airr-community / common-repo-wg Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 3.0 244 KB

AIRR Community Common Repository Working Group

License: Apache License 2.0

common-repo-wg's People

Contributors

Stargazers

Watchers

Forkers

lgcowell bpeters42 adriensix-i3

common-repo-wg's Issues

species ontology implementation in API

Hey @bcorrie, @laserson

In the last CRWG, it was decided to use the NCBI taxonomy for the species, however we left off technical API discussion to be done outside of the meeting. So I’m creating this issue to kick that off.

The main issue revolves around, do we support:

just a taxonomy id field in the API?
just a species text field in the API (presumably with the id hidden)?
both fields in the API?

I don't think (2) really makes sense, so it's between (1) and (3).

We should think about how UI's will handle this. With (1), I have these sort of questions:

How does the UI get the list of ids? Does the API provide them?
Where does it get the names that go with the ids?
How does the UI allow searching by name?

With (3) some of the same questions arise as with (1) but now need to handle query combos

The UI completely ignores the ids and uses just a plain text field.
If a query supplies both an id and a text field value, then do what?

My personal preference is currently for (1) but with the added twist that the data returned from the query has both the id and the text field species name.

Recommendation 7: Do we want to recommend a specific technology (Thrift)?

It seems like the second sentence in Recommendation 7 is out of place. It seems to me that this phrase would be more appropriate as one of the sub-clauses. For example, something like this:

Recommendation 7: The AIRR Working Groups should collaboratively develop operational criteria for compliant repositories. Operational Criteria should include implementation of:

standardized data elements with exact (computable) specifications;
a standardized data submission process (including standardized data and metadata formats);
a standardized set of queries;
a standardized, open source, data serialization framework for ensuring inter-operability, performance, maintainability, and evolution (e.g., Apache Avro, Thrift, and Protocol Buffers – the CRWG currently recommends Thrift).

It also seems a bit inappropriate for the CRWG to be recommending a specific technology (Thrift), in particular because such technologies change quickly. Listing example technologies would be fine, but recommending a specific technology in a high-level recommendation seems too specific??? This is not to say that somewhere else we couldn't specify what this technology is, but having it in the high-level recommendations doesn't feel right to me...

develop an initial swagger API based upon MS and formats WG data models

Starting point for WG discussion about implementation issues for common repository interface.

ADC API not and is operators

@schristley are the not and is operators as documented here:

https://github.com/airr-community/airr-standards/blob/metadata-docs/docs/api/overview.rst#request-parameters

from the GDC API? I find when you write a query using them it is very cumbersome. If this is GDC based, do they explain why they do it this way.

The following looks for things that have a clone ID and a specific V Gene, and then retrieves the junction for each clone:

{
    "filters": {"op":"and", "content": [
        {"op":"not", "content", {"field":"clone_id"}},
        {"op":"contains","content": {"field":"v_call","value":"IGHV3-30"}}
    ]},
    "fields":["clone_id","junction_aa"]
}

According to the docs the clause

{"op":"not", "content", {"field":"clone_id"}}

returns true if clone_id is "not missing" which is what I want. For a given rearrangement, this will return values if the rearrangement is representing a clone with a specific v_call.

This would be much more clear if this was something like:

{"op":"notmissing", "content", {"field":"clone_id"}}
{"op":"exists", "content", {"field":"clone_id"}}

Maybe having a "exists"/"missing" or "notmissing"/"missing" pair of ops might be more clear?

Recommendation 11: Registry VS cross repository queries.

My feeling is that having a registry and the capability to perform cross repository queries are two independent issues and should not be listed together. I would recommend striking the second phrase of this recommendation and adding some wording around the FAIR principles instead. Something like:

Recommendation 11: The AIRR Community should maintain a central registry of compliant repositories, providing a mechanism for AIRR Repositories to be "Findable" (see FAIR principles https://www.nature.com/articles/sdata201618).

It seems like a mistake to mix the registry capability that Recommendation 10 and Recommendation 11 define with the ability to perform operations/queries across repositories (the second phrase in Recommendation 11). The registry should be primarily about facilitating the Findable components of the FAIR aspects of AIRR repositories.

Note: This does not mean that a cross-repository operation/query capability should not exist. This is what iReceptor does. I think it is a mistake to have that capability explicitly linked to the registry.
Note: Recommendation 7 states that their should be a standard set of queries that each repository should support. This, combined with Recommendation 11 that repositories are FAIR means that it is possible to implement a tool or gateway that support cross-repository queries. It just shouldn't be the registry that does it!

AIRR Data Commons - is this a term we can/should use?

Hello All,

Today, on the iReceptor web site, in our papers, and in our grant proposals, we talk about the iReceptor Data Commons. This is a network of distributed data repositories which currently consists of the existing iReceptor repositories and VDJserver's public repository.

Quite often we also use the term AIRR Data Commons to refer to what we anticipate will become the international network of data repositories that follow the MiAIRR standard and more importantly adhere to the CRWG's Recommendations, implement the CRWG API, and are discoverable in the AIRR Registry. Tools like the iReceptor Scientific Gateway (as well as any other tool) then query the "AIRR Data Commons" to find, explore, and analyze data. Once the above are implemented, there would no longer be two Data Commons, there would only be one - the AIRR Data Commons...

My question to the group is - is this terminology something that we want to encourage? Do people like (or dislike) the AIRR Data Commons concept/terminology?

Brian

Create a list of AIRR-seq data repositories

We want to have a list of repositories that store AIRR-seq (or AIRR-seq related data) as a general list of repositories that are useful to the community. Note that this is NOT a list of AIRR Data Commons compliant repositories, it is a list of repositories that might be of interest to the broad community.

"changelog" for repositories...

Now that we have had a number of repositories up and running for a while, we have realized that the provenance of the data is quite important. For example, consider the following use case:

Time A: Repertoire R and its rearrangements are loaded into the IPA repository.
Time B: User 1 performs Search S on the repository and finds something interesting in Repertoire R
Time C: We find a mistake in the metadata for a Repertoire R. For example, the wrong cell_subset was used when the data was originally uploaded. We fix the problem.
Time D: User 1 comes back, performs Search S on Repertoire R, and gets a different result because their search involved the cell_subset in the erroneous repertoire metadata.

User 1 scratches their head and has no idea what happened... From a science reproducibility perspective, this is bad. If there is a "changelog" on each repository, then it would be possible to determine if it was a change in the repository that caused this issue.

There is a relatively simple solution to this issue. Each repository could optionally (not sure we can make it mandatory) maintain a "changelog" for their repository. One easy way for this to work would be to assume that a "changelog" exists on a web page somewhere. It could be maintained manually on a static site, it could exist on the repository site, and it could even be generated by the repository itself.

From a CRWG API perspective, this would be easy to implement through the /info entry point for the API, simply adding a new field to the /info response so it would look something like:

{
"name": "iReceptor Public Archive (IPA)",
"version": "v1.0",
"changelog": "http://www.ireceptor.org/repositories/IPA/changelog"
}

This makes the changelog completely independent of the actual repository (the DB and the service don't have to do anything). The nice thing is that if the repository actually managed its own changelog (when new data is added) and generated an automatic page, you could still point to the repositories generated page:

https://ipa.ireceptor.org/airr/v1/info

could generate:

{
"name": "iReceptor Public Archive (IPA)",
"version": "v1.0",
"changelog": "http://ipa.ireceptor.org/airr/v1/changelog"
}

The changelog interface isn't part of the AIRR API, but it doesn't stop the repository and service from providing one...

We could go further in solving this issue by adding a /changelog entry point into the API, but that feels like overkill to me...

Thoughts?

This seems like it is a pretty important concept that we should be considering...

Brian

Recommendation 12: IEDB scope?

Will IEDB accept submitted data for all epitopes/antigen receptors? I have notes from them that their scope is "infectious diseases, allergy, autoimmunity, and transplant", and "HIV, cancer, etc only curated as structural data or when presented with above subjects as per NIH/NIAID".

Which platform to use for AIRR Repository Registry

Looking at DataMed as a registry platform for the AIRR repositories. Initial discussions were held on Issue #12 - closing #12 and continuing discussion on this issue.

https://datamed.org/index.php

Also for consideration, fairshare.org:

https://fairsharing.org/license
https://fairsharing.org/api#/

Recommendation 5: Location of processed repertoire-sequencing data

The data referred to in recommendation 5 is a superset of the data defined in the current minimal standards WG (MiniStd) draft, sections 5, 6 and 7 (information on processing, processed sequences, basic V(D)J+CDR3 annotation). However, currently MiniStd assumes that the information of sections 5-7 will be stored in Genbank or TSA and is trying to map the respective fields of sections 5-7 to the INSDC Feature Table. The main consideration behind this is that until the distributed AIRR repository infrastructure described by this document is accepted for data deposition by journals and funders, MiniStd has to recommend suitable procedures for deposition of essential data in generic (non-AIRR) repositories.

This issue will likely resolve in time, but for now there should be consistent solution between the MiniStd and the Common Repo recommendations.

Intellectual Property and AIRR community common repository recommendations

If funders and journals require deposition in this type of repository but these repositories do not allow for any retention or protection of intellectual property, I believe many of the most important translational studies will be much less accessible, since we'll just drive away investigators and studies that generate medically important receptor discovery. If we want to impact human health, we need commercial parties to develop the discovered molecules, and they need exclusivity through ownership of IP, which in this case will be tightly linked to sequences. The way these guidelines currently read, anyone who wants to discover and pass on for development biomedically important antibodies, CAR-T receptors etc, couldn't really participate in this as I read it.

Recommendation 4: too specific (SRA/Genbank)?

Hi All,
In reading the recommendations and seeing how they apply to iReceptor, it occurred to me that Recommendation 4 might be to specific - that is specifying explicitly SRA and Genbank and only those... Although I am not an expert about other repositories (nor these ones) it seems that this is very narrow and somewhat North America specific. Would it make more sense to have something like this:

Recommendation 4: For long-term storage, data and metadata should be deposited in one of the International Nucleotide Sequence Database Collaboration (INSDC) archives such as SRA, Genbank, and ENA, per the recommendations established by the AIRR Minimal Standards Working Group. The AIRR Working Groups should work with the INSDC archives to coordinate the accurate gathering and storage of metadata for AIRR data.

In this way, we are recommending that data be published in one of the recognized national/international repositories but not telling people "exactly" what to do. If INSDC has another collaborator soon, then that should be a reasonable option. As long as the second phrase is there, and the AIRR Community works with the repositories to ensure there are easy mechanisms to store data (as has been done with SRA and Genbank), then this should be fine...

Repertoire API return rearrangement count?

Hi @schristley

One of the things that the current iReceptor API returns for /samples is the count of the number of rearrangements for each sample. For those not in the know, /samples in the iReceptor API is basically the equivalent of /repertoire in the AIRR API.

The return of the rearrangement count for a repertoire is a convenience value that is returned by the API that for iReceptor at least, is pretty fundamental to the purpose of the API call. That is, we always want to know how many rearrangements are associated with a repertoire. It is one thing to know that there is a repertoire, but it seems to me that the next question one would likely ask for each repertoire would be the number of rearrangements for that repertoire. Certainly this is something that iReceptor always wants (we want to tell the user if the repertoire they found had 10 or 10 Million rearrangements associated with it).

My question is, does it make sense to provide this as part of the /repertoire API directly (as we do in the current iReceptor API) as a summary statistic? Given that this is the fundamental link between the two conceptual levels of the API (and the AIRR Data Representation) this makes some sense to me.

The "facets" capability of the query API allows one to aggregate and count on a feature of the repertoire (e.g. "facets":"subject.subject_id"). I don't think there is a mechanism within the API to count the number of rearrangements for each repertoire, as it isn't part of the metadata. It is a summary statistic of the repertoire's rearrangements and is an operation on the /rearrangements API . Thus in order to produce the equivalent functionality that we have today in the iReceptor API as a single call (returning a set of repertoires and their rearrangement counts), we would need to make N + 1 API calls, one query to the /repertoire API to get the set of repertoires that meet the query criteria and one query to the /rearrangements API for each of the N repertoires that are returned to ask for the count of the rearrangements in that repertoire.

At some level, this is a question as to how "clean" or "simple" we want the API to be in terms of "just" querying fields versus the functionality that we want it to play to meet the use cases that we have. This goes back to the use cases and takes the question to the next logical step - for the API calls that we are developing, what are users of the API going to do next and is there anything that the API can do to facilitate those next steps? This is potentially one example of such a case?

Thoughts?

Brian

Define what we mean by /repertoire and /rearrangement in decisions document

Hi All,

Should we provide a description/definition in the decision document (at least as far as we currently have a description/definition) as to what we mean by repertoire and rearrangement? Perhaps a few sentences as to what is meant by these terms? If nothing else, perhaps point to the specs used in the API to implement them (and by default referring to MiAIRR) such that one can more easily see what is the intent of the two types of "gettable" objects. I am not the person to write that description or I would do it myself 8-)

Brian

Review CRWG Recommendations document

It is probably time to review these as we approach the May meeting.

I have created a branch that we can use to make changes...

https://github.com/airr-community/common-repo-wg/blob/issue-27/recommendations.md

Recommendations 12 & 13: Whereabouts of processed sequences

Currently (c8e751a) recommendations 12 and 13 suggest that:

raw data is deposited in SRA/ENA
processed data and annotations are deposited in a compliant AIRR repository

However, according to current MSWG recommendations processed data and some annotations should in addition be deposited in GenBank/ENA/DDBJ.

The wording of the recommendations should be changed accordingly.

Determine how to use ontology definitions in the API

We have discussed ontological terms and have agreed that species/organism is an "easy" one to solve. We use the binomial species classification. A likely candidate for the definition of this term in its entirety is the NCBI species taxonomy.

There are other more complex and important terms that we are interested in, for example "cell_subset" and/or "strain". These may come from multiple ontology definitions and may include custom terms. How do we handle this in the API.

airr-community / common-repo-wg Goto Github PK

common-repo-wg's People

Contributors

Stargazers

Watchers

Forkers

common-repo-wg's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs