To do when reviewing proteomics collections. Currently there is not an enforced patter

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

review id_locus pattern about nmdc-schema HOT 6 OPEN

aclum commented on July 19, 2024

review id_locus pattern

from nmdc-schema.

Comments (6)

SamuelPurvine commented on July 19, 2024

Please keep in mind we are also intending to add annotations that aren't NMDC derived when using the version 2 pipeline. Currently we are planning to use Uniprot annotations, but we shouldn't expect to limit ourselves to only that repository. Given the wild west of protein naming I've encountered over the past 20 years, trying to enforce a naming structure/patterns will end in misery, but we can always add that next standard (xkcd standards comic reference) :)

all_proteins is going to firmly and happily go away in the schema as soon as we can tackle the bloat, which comes after implementation of the refactored schema, which comes after re-id-ing the proteomics data, which comes after the metagenome annotations have completed...

There's also discussion to re-name best_protein to something that is more descriptive, such as most_confidently_associated_protein or similar ilk, and add a slot that defines and describes how that association was made (currently parsimony, others will likely come online).

from nmdc-schema.

aclum commented on July 19, 2024

At the metap meeting on 6/4/24 we discussed a minimum constraint of a curie and a max constraint of a nmdc wf identifier + uniprot.

from nmdc-schema.

SamuelPurvine commented on July 19, 2024

Current plan for Uniprot IDs (incorporated in the current Kaiko implementation) would be to use their full string, such as "tr|A0A1D5Q1C9|A0A1D5Q1C9_MACMU" or "sp|A1L190|SYCE3_HUMAN" which helps denote the three elements of sequence source (TrEMBL or SWISS-Prot), the Entry ID, and Entry Name, each separated by a pipe. One supposes adding a prefix of "uniprot:" might curie these adequately?

from nmdc-schema.

aclum commented on July 19, 2024

@SamuelPurvine The existing documentation about uniprot prefix registration is https://bioregistry.io/registry/uniprot
NMDC has the prefix of UniProtKB to expand to https://bioregistry.io/uniprot
So for "sp|A1L190|SYCE3_HUMAN" the code that makes the json file for the schema that contains the value for best_protein/occam_protein/prefered_slot_name would be UniProtKB:Entry ID, example UniProtKB:A1L190

from nmdc-schema.

SamuelPurvine commented on July 19, 2024

OK, very cool, we should certainly be able to accommodate that. Is there / will there be machinery to apply functional annotations for unirprot entries, or do "we" (the Kaiko team as implemented through the proteomics workflow) need to provide that to allow the portal to show Kaiko search results? There's probably more packed into that question (like who will end up making that aggregation table and populating it... and how?... and when?) than easily fits here, but thought I'd ask!

from nmdc-schema.

aclum commented on July 19, 2024

The only functional annotation supported now is KEGG so this would require development in the data portal. The workflow should not generate the aggregation, if needed aggregation would be written separately. If Cam is up for maintaining b/c he'd have knowledge of the workflow itself that would be good. The aggregation codes now have their own repository, https://github.com/microbiomedata/nmdc-aggregator

from nmdc-schema.

review id_locus pattern about nmdc-schema HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs