GithubHelp home page GithubHelp logo

microbiomedata / nmdc-schema Goto Github PK

View Code? Open in Web Editor NEW
26.0 9.0 8.0 116.83 MB

National Microbiome Data Collaborative (NMDC) unified data model

Home Page: https://microbiomedata.github.io/nmdc-schema/

License: Creative Commons Zero v1.0 Universal

Makefile 1.58% Python 78.89% Jupyter Notebook 17.89% Jinja 0.69% Shell 0.01% Dockerfile 0.19% JavaScript 0.74%
nmdc microbiome standards linkml genomics proteomics metabolomics transcriptomics schema microbiomedata

nmdc-schema's Introduction

National Microbiome Data Collaborative Schema

PyPI - License PyPI version

The mission of the NMDC is to build a FAIR microbiome data sharing network, through infrastructure, data standards, and community building, that addresses pressing challenges in environmental sciences. The NMDC platform is built on top of a unified data model (schema) that weaves together existing standards and ontologies to provide a systematic representation of all aspects of the microbiome data life cycle.

This repository mainly defines a LinkML schema for managing metadata from the National Microbiome Data Collaborative (NMDC).

Documentation

The documentation for the NMDC schema can be found at https://microbiomedata.github.io/nmdc-schema/. This documentation is aimed at consumers of NMDC data and metadata, it describes the different data elements used to describe studies, samples, sample processing, data generation, workflows, and downstream data objects.

The NMDC Introduction to metadata and ontologies primer provides some the context for this project.

The remainder of this page is primary for the internal maintainers and contributors to the NMDC schema

Repository Contents Overview

Some products that are maintained, and tasks orchestrated within this repository are:

  • Maintenance of LinkML YAML that specifies the NMDC Schema
  • Makefile targets for converting the schema from it's native LinkML YAML format to other artifact like JSON Schema
  • Build, deployment and distribution of the schema as a PyPI package
  • Automatic publishing of refreshed documentation upon change to the schema, accessible here

Maintaining the Schema

See DEVELOPMENT.md for instructions on setting up a development environment.

See MAINTAINERS.md for instructions on using that development environment to maintain the schema.

Makefiles

Makefiles are text files people can use to tell make (a computer program) how it can make things (or—in general—do things). In the world of Makefiles, those things are called targets.

This repo contains 2 Makefiles:

  • Makefile, based on the generic Makefile from the LinkML cookiecutter
  • project.Makefile, which contains targets that are specific to this project

Here's an example of using make in this repo:

# Deletes all files in `examples/output`.
make examples-clean

The examples-clean target is defined in the project.Makefile. In this repo, the Makefile includes the project.Makefile. As a result, make has access to the targets defined in both files.

Data downloads

The NMDC's metadata about biosamples, studies, bioinformatics workflows, etc. can be obtained from our nmdc-runtime API. Try entering "biosample_set" or "study_set" into the collection_name box at https://api.microbiomedata.org/docs#/metadata/list_from_collection_nmdcschema__collection_name__get

Or use the API programmatically! Note that some collections are large, so the responses are paged.

You can learn about the other available collections at https://microbiomedata.github.io/nmdc-schema/Database/

nmdc-schema's People

Contributors

aclum avatar anastasiyaprymolenna avatar bmeluch avatar brynnz22 avatar cmungall avatar corilo avatar dehays avatar dependabot[bot] avatar dwinston avatar eecavanna avatar emileyfadrosh avatar hubin-keio avatar jamestessmer avatar mbthornton-lbl avatar mslarae13 avatar pkalita-lbl avatar scanon avatar shalsh23 avatar shreddd avatar sujaypatil96 avatar turbomam avatar wdduncan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nmdc-schema's Issues

conflicting principal_investigator_name

principal_investigator_name is a slot in both Study and OmicsProcessing.

The issue is similar to #10: must an omics processing's PI be the same as for the study? Does one "override" the other in a given context? Again, it would be better to (a) have it exclusively either in study or in omics processing (I'm guessing study makes more sense), or (b) to name them differently so as to clarify, e.g. "study PI" vs "processing PI".

@dehays @cmungall

existing NMDC entities attributes to make required

Part of the decomposition of #41

Existing attributes to be indicated in the schema as required

omics_processing

  • has_input - note that this one should not be permitted to have an empty array value
  • has_output - ditto
  • part_of

data_object

  • name
  • description - this is an addition from #20

workflow execution activity

  • started_at_time
  • ended_at_time
  • execution_resource
  • git_url
  • name
  • type
  • has_input
  • has_output
  • was_informed_by

cannot import name 'Id' from 'nmdc_schema.basic_slots'

@cmungall I've reordered the dependency tree to remove cycles. Slots that are used across multiple schemas are in the a schema named basic_slots.yaml. This appears to be working fine in the sense make install all builds the targets w/o error.

However, there is a problem with the pypi package. When I try to import the nmdc python class from the package I receive the error cannot import name 'Id' from 'nmdc_schema.basic_slots'.

Steps to reproduce:

  1. Create a new virtual environment.
  2. Install the nmdc schema: pip install nmdc-schema.
  3. Start ipython.
  4. Execute from nmdc_schema import nmdc.
    Result:
<ipython-input-1-108b8f058d11> in <module>
----> 1 from nmdc_schema import nmdc

~/repos/NMDC/temp/nmdc-schema-install/.env/lib/python3.7/site-packages/nmdc_schema/nmdc.py in <module>
     25 from rdflib import Namespace, URIRef
     26 from linkml.utils.curienamespace import CurieNamespace
---> 27 from . annotation import FunctionalAnnotation, GenomeFeature, ReactionId, ReactionParticipant
     28 from . core import AttributeValue, Bytes, ChemicalEntityId, ControlledTermValue, GeneProductId, GeolocationValue, MetaboliteQuantification, NamedThing, NamedThingId, PeptideQuantification, PersonValue, QuantityValue, TextValue, TimestampValue
     29 from . prov import ActivityId

~/repos/NMDC/temp/nmdc-schema-install/.env/lib/python3.7/site-packages/nmdc_schema/annotation.py in <module>
     24 from rdflib import Namespace, URIRef
     25 from linkml.utils.curienamespace import CurieNamespace
---> 26 from . core import ChemicalEntityId, GeneProductId, MetaboliteQuantification, OntologyClass, OntologyClassId, PeptideQuantification
     27 from . workflow_execution_activity import MetagenomeAnnotationActivityId
     28 from linkml.utils.metamodelcore import Bool

~/repos/NMDC/temp/nmdc-schema-install/.env/lib/python3.7/site-packages/nmdc_schema/core.py in <module>
     22 from rdflib import Namespace, URIRef
     23 from linkml.utils.curienamespace import CurieNamespace
---> 24 from . basic_slots import Id
     25 from . prov import ActivityId
     26 from linkml.utils.metamodelcore import Bool

ImportError: cannot import name 'Id' from 'nmdc_schema.basic_slots' (/Users/wdduncan/repos/NMDC/temp/nmdc-schema-install/.env/lib/python3.7/site-packages/nmdc_schema/basic_slots.py)

conflicting gold paths in biosample and study

Currently, both biosample and study include slots for the five GOLD path fields (ecosystem, ecosystem_category, ecosystem_type, ecosystem_subtype, specific_ecosystem):

This is an issue because it's not clear if all biosamples must have the same values as the study, if the biosample values "override" the "default" study values, etc.

It would be better to (a) have them exclusively either in study or in biosample, or (b) to name them differently so as to clarify, e.g. prefix the slots in study with "default_", "assumed_", etc.

@cmungall @dehays

biosample processing and omics processing should not be in a subclass relationship

biosample processing is documented to be a process that takes samples as inputs and generates samples as outputs; ie sample->sample

  biosample processing:
    is_a: named thing
    description: >-
      A process that takes one or more biosamples as inputs and generates one or as outputs.
      Examples of outputs include samples cultivated from another sample or data objects created by instruments runs.
    slots:
      - has input
    slot_usage:
      has input:
        range: biosample

we can map this to to material processing in OBI

this can be used to represent graphs of arbitrary depths of processes on samples, from treatments through to subsampling, making aliquots. In practice these graphs may be flat especially for retrospective samples

omics_processing doesn't have a sample as output - it generates omics data, i.e sample->data. Yet it is a subclass of biosample processing

 omics processing:
    is_a: biosample processing
    in_subset: 
      - sample subset
    description: >-
      The methods and processes used to generate omics data from a biosample or organism.
    comments:
      - The IDs for objects coming from GOLD will have prefixes gold:GpNNNN
    slots:
      - part of
      - has output
      - omics type
    slot_usage:
      id:
        description: >-
          The primary identifier for the omics processing. E.g. GOLD:GpNNNN
      name:
        description: >-
          A human readable name or description of the omics processing.
      alternate identifiers:
        description: >-
          The same omics processing may have distinct identifiers in different databases (e.g. GOLD and EMSL, as well as NCBI)
      part of:
        range: study
      has output:
        range: data object

I think this can be mapped to assay in OBI

I don't think this is quite coherent, if OP inherits from BP then it should inherit the condition that the output is a sample

It may be more straightforward to have these be siblings

If we want we can have a grouping class for both of these

Add more integrity checks on fields in JSON schema 2

There are different levels of checks, this issue is to cover the first 2:

  1. Is the ID prefix valid? (e.g. KEGG.KO vs KEGG.ORTHOLOG)
    2 Is the local part of the ID syntactically conformant? (e.g. KEGG:K\d+)

The first is very easy to do with using the existing id_prefixes annotated in the schema

add alternate_identifiers slot to biosample

We will need this sooner than expected because GOLD is creating new biosample records that merge RNA and DNA samples when a common source is known. We will need to connect these new GOLD biosample records to EMSL processes by IGSN - so we need the new GOLD biosamples to retain their IGSNs in NMDC biosample records.

Expect that the range is a list of strings

make NMDC data object a ga4gh DRS object?

The ga4gh data repository service (DRS) API spec defines an object type, DrsObject, that has properties useful for workflow automation, for example url+headers for authorized access, or tokens for deferring url generation. I have sketched out a pydantic model for it in the nmdc-runtime repo. Also, #49 suggests a checksum field, which DRS addresses as an array (e.g. [{checksum: ..., type: 'crc32c'}, {checksum: ..., type: 'md5', ...]).

My suggestion here is to make NMDC's data object a DRS object, i.e. align its LinkML definition with DRS's DrsObject spec.

Add slots to database class to capture versioning

As part of the ETL, add versioning info to the database class. I have already created some stubs for this:

database:
...
      slot_usage:
      nmdc schema version:  <-- pulled from the nmdc.yaml file; I have a util to get this.
        description: >-
            TODO
      date created:  <-- date the database was created
        description: >-
            TODO
      etl software version: <-- requires versioning of our etl scripts
        description: >-
            TODO

Do we need others? @dehays @cmungall @dwinston @jeffbaumes @jbeezley

add started_at_time, ended_at_time slot to workflow execution activity

@cmungall I just noticed that the workflow execution activity class does not have started at time, ended at time slots:

workflow execution activity:
    is_a: activity
    in_subset: 
      - workflow subset
    description: >-
      Represents an instance of an execution of a particular workflow
    slots:
      - execution resource
      - git url
      - has input
      - has output
      - type  # custom slot that specifies object type
    slot_usage:
      was associated with:
        required: false
        description: >-
          the agent/entity associated with the generation of the file
        range: workflow execution activity
        inlined: false # allow for strings of IDs

However, the data coming from Aim 2 does have the these slots. E.g.:

{
    "id": "nmdc:f2fc8f5aade3092ea97769f0a892d2a9",
    "name": "MAGs activiity 1781_86101",
    "was_informed_by": "gold:Gp0115663",
    "started_at_time": "2021-01-10",
    "ended_at_time": "2021-01-10",
    "type": "nmdc:MAGsAnalysisActivity",
    "execution_resource": "NERSC - Cori",
    "git_url": "https://img.jgi.doe.gov",
    ...
}

I think we should add these slots to the schema.

Thoughts?

add slot relating biosample to study

Currently, biosamples are related to biosamples via omics processing. I.e.:

(biosample) <-- has input -- (omics process) -- part of --> (study)

It would be useful to have a slot that directly related biosamples to studies.

Register EMSL project prefix on identifiers.org

I recommend registering EMSL.project as a prefix just in case the same 5 digit number is reused for other entities, e.g. metabolites.

It looks like the redirect should be to:

https://search.emsl.pnnl.gov/?project[0]=projects_$LocalID

E.g.

https://search.emsl.pnnl.gov/?project[0]=projects_51283

is this correct @SamuelPurvine

See https://microbiomedata.github.io/nmdc-schema/identifiers/

Note: once registered with identifiers.org it should percolate out to n2t, bioregistry, prefixcommons, ...

run tests for invalid data

For invalid datasets in test/data, add target to makefile to check that invalid data is confirmed to be invalid.

As a convention, prefix that name of the invalid dataset with invalid_; e.g., invalid_study_test.json.

cc @turbomam

Share examples of repaired NCBI/INSDC Biosample metadata

Efforts for mapping categorical data from the NCBI/INSDC Biosample metadata to OBO foundry terms has been taking place in several branches or forks of INCATools/biosample-analysis and turbomam/scoped-mapping at least. That's partially because these techniques will be applied to other projects besides NMDC.

Existing efforts have been batchwise or notebook based. We are planning on moving towards a repair API.

For now, we need a place to store mappings to that NMDC personal can review them and suggest changes to our strategies.

CCing @cmungall @wdduncan

additional optional attributes to be added to study

part of decomposition of #41

To be added to study entity as optional attributes. These are not currently available from GOLD studies and therefore cannot presently be populated from the GOLD -> NMDC ETL

  • study DOI
  • study websites
  • publication DOIs
  • proposal name
  • scientific objective

Add ability to link studies or make umbrella studies

At JGI we have many cases where there are multiple proposals that span a long-term “study” (e.g., a large, multi-year SFA project that involves many hundreds of samples).

Also need to add mappings for concepts of studies across different systems

add depth2 slots

for biosamples (and related package tables) depth2 and subsurface_depth2 slots are needed for when the depth is within a range.

jsonschema validation working for new file locally but not in GH actions

locally:

% make test-jsonschema
jsonschema -i test/data/biosample_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/gold_project_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/img_mg_annotation_objects.json jsonschema/nmdc.schema.json
jsonschema -i test/data/nmdc_example_database.json jsonschema/nmdc.schema.json
jsonschema -i test/data/MAGs_activity.json jsonschema/nmdc.schema.json
jsonschema -i test/data/mg_assembly_activities_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/mg_assembly_data_objects_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/study_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/functional_annotation_set_valid.json jsonschema/nmdc.schema.json

GitHub Actions:

make: *** No rule to make target 'validate-functional_annotation_set_valid', needed by 'test-jsonschema'.  Stop.

add a schema version on releases of NMDC schema

We discussed using semantic versioning or simply a time stamp on release versions. Either could work, to avoid needing to define the semantics of semantic versioning (does minor or patch break ingest or only major?) we decided time stamp release versions was simplest.

With versions on the schema it becomes possible to ask what schema version the metadata store (MongoDB or Terminus) is at and what version the search application ingest and relationship schema support

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.