GithubHelp home page GithubHelp logo

review id_locus pattern about nmdc-schema HOT 6 OPEN

aclum avatar aclum commented on July 19, 2024
review id_locus pattern

from nmdc-schema.

Comments (6)

SamuelPurvine avatar SamuelPurvine commented on July 19, 2024

Please keep in mind we are also intending to add annotations that aren't NMDC derived when using the version 2 pipeline. Currently we are planning to use Uniprot annotations, but we shouldn't expect to limit ourselves to only that repository. Given the wild west of protein naming I've encountered over the past 20 years, trying to enforce a naming structure/patterns will end in misery, but we can always add that next standard (xkcd standards comic reference) :)

all_proteins is going to firmly and happily go away in the schema as soon as we can tackle the bloat, which comes after implementation of the refactored schema, which comes after re-id-ing the proteomics data, which comes after the metagenome annotations have completed...

There's also discussion to re-name best_protein to something that is more descriptive, such as most_confidently_associated_protein or similar ilk, and add a slot that defines and describes how that association was made (currently parsimony, others will likely come online).

from nmdc-schema.

aclum avatar aclum commented on July 19, 2024

At the metap meeting on 6/4/24 we discussed a minimum constraint of a curie and a max constraint of a nmdc wf identifier + uniprot.

from nmdc-schema.

SamuelPurvine avatar SamuelPurvine commented on July 19, 2024

Current plan for Uniprot IDs (incorporated in the current Kaiko implementation) would be to use their full string, such as "tr|A0A1D5Q1C9|A0A1D5Q1C9_MACMU" or "sp|A1L190|SYCE3_HUMAN" which helps denote the three elements of sequence source (TrEMBL or SWISS-Prot), the Entry ID, and Entry Name, each separated by a pipe. One supposes adding a prefix of "uniprot:" might curie these adequately?

from nmdc-schema.

aclum avatar aclum commented on July 19, 2024

@SamuelPurvine The existing documentation about uniprot prefix registration is https://bioregistry.io/registry/uniprot
NMDC has the prefix of UniProtKB to expand to https://bioregistry.io/uniprot
So for "sp|A1L190|SYCE3_HUMAN" the code that makes the json file for the schema that contains the value for best_protein/occam_protein/prefered_slot_name would be UniProtKB:Entry ID, example UniProtKB:A1L190

from nmdc-schema.

SamuelPurvine avatar SamuelPurvine commented on July 19, 2024

OK, very cool, we should certainly be able to accommodate that. Is there / will there be machinery to apply functional annotations for unirprot entries, or do "we" (the Kaiko team as implemented through the proteomics workflow) need to provide that to allow the portal to show Kaiko search results? There's probably more packed into that question (like who will end up making that aggregation table and populating it... and how?... and when?) than easily fits here, but thought I'd ask!

from nmdc-schema.

aclum avatar aclum commented on July 19, 2024

The only functional annotation supported now is KEGG so this would require development in the data portal. The workflow should not generate the aggregation, if needed aggregation would be written separately. If Cam is up for maintaining b/c he'd have knowledge of the workflow itself that would be good. The aggregation codes now have their own repository, https://github.com/microbiomedata/nmdc-aggregator

from nmdc-schema.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.