Comments (6)
Please keep in mind we are also intending to add annotations that aren't NMDC derived when using the version 2 pipeline. Currently we are planning to use Uniprot annotations, but we shouldn't expect to limit ourselves to only that repository. Given the wild west of protein naming I've encountered over the past 20 years, trying to enforce a naming structure/patterns will end in misery, but we can always add that next standard (xkcd standards comic reference) :)
all_proteins is going to firmly and happily go away in the schema as soon as we can tackle the bloat, which comes after implementation of the refactored schema, which comes after re-id-ing the proteomics data, which comes after the metagenome annotations have completed...
There's also discussion to re-name best_protein to something that is more descriptive, such as most_confidently_associated_protein or similar ilk, and add a slot that defines and describes how that association was made (currently parsimony, others will likely come online).
from nmdc-schema.
At the metap meeting on 6/4/24 we discussed a minimum constraint of a curie and a max constraint of a nmdc wf identifier + uniprot.
from nmdc-schema.
Current plan for Uniprot IDs (incorporated in the current Kaiko implementation) would be to use their full string, such as "tr|A0A1D5Q1C9|A0A1D5Q1C9_MACMU" or "sp|A1L190|SYCE3_HUMAN" which helps denote the three elements of sequence source (TrEMBL or SWISS-Prot), the Entry ID, and Entry Name, each separated by a pipe. One supposes adding a prefix of "uniprot:" might curie these adequately?
from nmdc-schema.
@SamuelPurvine The existing documentation about uniprot prefix registration is https://bioregistry.io/registry/uniprot
NMDC has the prefix of UniProtKB
to expand to https://bioregistry.io/uniprot
So for "sp|A1L190|SYCE3_HUMAN" the code that makes the json file for the schema that contains the value for best_protein/occam_protein/prefered_slot_name would be UniProtKB:Entry ID
, example UniProtKB:A1L190
from nmdc-schema.
OK, very cool, we should certainly be able to accommodate that. Is there / will there be machinery to apply functional annotations for unirprot entries, or do "we" (the Kaiko team as implemented through the proteomics workflow) need to provide that to allow the portal to show Kaiko search results? There's probably more packed into that question (like who will end up making that aggregation table and populating it... and how?... and when?) than easily fits here, but thought I'd ask!
from nmdc-schema.
The only functional annotation supported now is KEGG so this would require development in the data portal. The workflow should not generate the aggregation, if needed aggregation would be written separately. If Cam is up for maintaining b/c he'd have knowledge of the workflow itself that would be good. The aggregation codes now have their own repository, https://github.com/microbiomedata/nmdc-aggregator
from nmdc-schema.
Related Issues (20)
- Remove WorkflowExeuctionActivity as a range for Database slot activity_set
- tighter pattern constraint on was_generated_by
- Migrator: Update `migrator_from_10_3_0_to_10_4_0.py` so it also updates `was_generated_by` values HOT 1
- Docker Compose shows warning saying `version` (specifier) is obsolete
- extra leading caret in `structured_pattern`s HOT 3
- remove logic to repair curies HOT 1
- replace uses of `nmdc_schema_merged.yaml` with `nmdc_schema/nmdc_materialized_patterns.yaml` HOT 3
- force update of `urllib3` version HOT 2
- `berkeley-schema-fy24`: Schema version underlying `SchemaView` instance changes from one release to the next HOT 1
- Add PR template to `nmdc-schema`
- Finalize metatranscriptome modeling for nmdc HOT 1
- MIxS Environmental Context slot grouping
- `known_as` slot needs description and `structured_pattern` constraints
- `berkeley-schema-fy24`: id slot_usage resulting in badly minted IDs HOT 17
- `berkeley-schema-fy24`: Update `part_16` migrator to include `metatranscriptome_expression_analysis_set` collection
- document the fact that working on a PR template is the only case in which PRs don't require review HOT 1
- Fix PR template functionality HOT 1
- update range and pattern on metagenome_annotation_id to allow metatranscriptome workflow activities
- Draft: add details to `InstrumentModelEnum`
- `berkeley`: Fix range for `was_generated_by` slot
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nmdc-schema.