microbiomedata / nmdc-schema Goto Github PK

View Code? Open in Web Editor NEW

26.0 9.0 8.0 116.83 MB

National Microbiome Data Collaborative (NMDC) unified data model

Home Page: https://microbiomedata.github.io/nmdc-schema/

License: Creative Commons Zero v1.0 Universal

Makefile 1.58% Python 78.89% Jupyter Notebook 17.89% Jinja 0.69% Shell 0.01% Dockerfile 0.19% JavaScript 0.74%

nmdc microbiome standards linkml genomics proteomics metabolomics transcriptomics schema microbiomedata

nmdc-schema's Introduction

National Microbiome Data Collaborative Schema

The mission of the NMDC is to build a FAIR microbiome data sharing network, through infrastructure, data standards, and community building, that addresses pressing challenges in environmental sciences. The NMDC platform is built on top of a unified data model (schema) that weaves together existing standards and ontologies to provide a systematic representation of all aspects of the microbiome data life cycle.

This repository mainly defines a LinkML schema for managing metadata from the National Microbiome Data Collaborative (NMDC).

Documentation

The documentation for the NMDC schema can be found at https://microbiomedata.github.io/nmdc-schema/. This documentation is aimed at consumers of NMDC data and metadata, it describes the different data elements used to describe studies, samples, sample processing, data generation, workflows, and downstream data objects.

The NMDC Introduction to metadata and ontologies primer provides some the context for this project.

The remainder of this page is primary for the internal maintainers and contributors to the NMDC schema

Repository Contents Overview

Some products that are maintained, and tasks orchestrated within this repository are:

Maintenance of LinkML YAML that specifies the NMDC Schema
- src/schema/nmdc.yaml
- and various other YAML schemas imported by it, like prov.yaml, annotation.yaml, etc. all which you can find in the src/schema folder
Makefile targets for converting the schema from it's native LinkML YAML format to other artifact like JSON Schema
Build, deployment and distribution of the schema as a PyPI package
Automatic publishing of refreshed documentation upon change to the schema, accessible here

Maintaining the Schema

See DEVELOPMENT.md for instructions on setting up a development environment.

See MAINTAINERS.md for instructions on using that development environment to maintain the schema.

Makefiles

Makefiles are text files people can use to tell make (a computer program) how it can make things (or—in general—do things). In the world of Makefiles, those things are called targets.

This repo contains 2 Makefiles:

Makefile, based on the generic Makefile from the LinkML cookiecutter
project.Makefile, which contains targets that are specific to this project

Here's an example of using make in this repo:

# Deletes all files in `examples/output`.
make examples-clean

The examples-clean target is defined in the project.Makefile. In this repo, the Makefile includes the project.Makefile. As a result, make has access to the targets defined in both files.

Data downloads

The NMDC's metadata about biosamples, studies, bioinformatics workflows, etc. can be obtained from our nmdc-runtime API. Try entering "biosample_set" or "study_set" into the collection_name box at https://api.microbiomedata.org/docs#/metadata/list_from_collection_nmdcschema__collection_name__get

Or use the API programmatically! Note that some collections are large, so the responses are paged.

You can learn about the other available collections at https://microbiomedata.github.io/nmdc-schema/Database/

nmdc-schema's People

Contributors

Stargazers

Watchers

Forkers

polyneme ramonawalls lanl-bioinformatics jeaniceangelica tankmermaid microbiomedata microbiomedata vimala88

nmdc-schema's Issues

Align concept of study/project between NMDC and EMSL

We should make sure the NMDC modeling of studies in the schema matches what is in EMSL

We have aligned this with GOLD but we should also make the mapping more explicit

For reference, this is the documentation page for the study class in our schema

https://microbiomedata.github.io/nmdc-metadata/docs/Study

Auto-generated UML diagram (not so useful IMO):

@lamccue - who is the best PNNL point person? Mark B?

annotate schema with edam

low priority but easy to do

annotate elements of the schema using edam:

https://www.ebi.ac.uk/ols/ontologies/edam

Make MIXS triad properties required

Set required: true for the MIXS triad properties:

env_broad_scale
env_local_scale
env_medium

cc @cmungall @dehays

conflicting principal_investigator_name

principal_investigator_name is a slot in both Study and OmicsProcessing.

The issue is similar to #10: must an omics processing's PI be the same as for the study? Does one "override" the other in a given context? Again, it would be better to (a) have it exclusively either in study or in omics processing (I'm guessing study makes more sense), or (b) to name them differently so as to clarify, e.g. "study PI" vs "processing PI".

@dehays @cmungall

create command-line utility to retrieve jsonschema

Provide a command-line utility to retrieve the jsonschema from the nmdc-schema pypi package.

cc @cmungall @dehays

Provide regular expression values for slots' `pattern`s

Need a list of regexs to check data.

Make explicit list of fields required for portal to function, make missing entries invalidate the data

For example, principal_investigator_name was marked optional in the JSON schema but is required for the portal to function.

I believe this involves marking more things as required in the schema, hence I'm putting this issue on the metadata repo.

existing NMDC entities attributes to make required

Part of the decomposition of #41

Existing attributes to be indicated in the schema as required

omics_processing

has_input - note that this one should not be permitted to have an empty array value
has_output - ditto
part_of

data_object

name
description - this is an addition from #20

workflow execution activity

cannot import name 'Id' from 'nmdc_schema.basic_slots'

@cmungall I've reordered the dependency tree to remove cycles. Slots that are used across multiple schemas are in the a schema named basic_slots.yaml. This appears to be working fine in the sense make install all builds the targets w/o error.

However, there is a problem with the pypi package. When I try to import the nmdc python class from the package I receive the error cannot import name 'Id' from 'nmdc_schema.basic_slots'.

Steps to reproduce:

Create a new virtual environment.
Install the nmdc schema: pip install nmdc-schema.
Start ipython.
Execute from nmdc_schema import nmdc.
Result:

<ipython-input-1-108b8f058d11> in <module>
----> 1 from nmdc_schema import nmdc

~/repos/NMDC/temp/nmdc-schema-install/.env/lib/python3.7/site-packages/nmdc_schema/nmdc.py in <module>
     25 from rdflib import Namespace, URIRef
     26 from linkml.utils.curienamespace import CurieNamespace
---> 27 from . annotation import FunctionalAnnotation, GenomeFeature, ReactionId, ReactionParticipant
     28 from . core import AttributeValue, Bytes, ChemicalEntityId, ControlledTermValue, GeneProductId, GeolocationValue, MetaboliteQuantification, NamedThing, NamedThingId, PeptideQuantification, PersonValue, QuantityValue, TextValue, TimestampValue
     29 from . prov import ActivityId

~/repos/NMDC/temp/nmdc-schema-install/.env/lib/python3.7/site-packages/nmdc_schema/annotation.py in <module>
     24 from rdflib import Namespace, URIRef
     25 from linkml.utils.curienamespace import CurieNamespace
---> 26 from . core import ChemicalEntityId, GeneProductId, MetaboliteQuantification, OntologyClass, OntologyClassId, PeptideQuantification
     27 from . workflow_execution_activity import MetagenomeAnnotationActivityId
     28 from linkml.utils.metamodelcore import Bool

~/repos/NMDC/temp/nmdc-schema-install/.env/lib/python3.7/site-packages/nmdc_schema/core.py in <module>
     22 from rdflib import Namespace, URIRef
     23 from linkml.utils.curienamespace import CurieNamespace
---> 24 from . basic_slots import Id
     25 from . prov import ActivityId
     26 from linkml.utils.metamodelcore import Bool

ImportError: cannot import name 'Id' from 'nmdc_schema.basic_slots' (/Users/wdduncan/repos/NMDC/temp/nmdc-schema-install/.env/lib/python3.7/site-packages/nmdc_schema/basic_slots.py)

increase specificity of has_input relationship on omics_processing

The need here is driven from metabolomics and metaproteomics where @corilo would like to be able to specify the input as one of:

sample
blank
QC

Move PI profile images to official metadata schema

This is currently in the portal repo. Should we consider something like gravatar? Unsure if ORCID provides public access to a profile image but that might be ideal.

conflicting gold paths in biosample and study

Currently, both biosample and study include slots for the five GOLD path fields (ecosystem, ecosystem_category, ecosystem_type, ecosystem_subtype, specific_ecosystem):

This is an issue because it's not clear if all biosamples must have the same values as the study, if the biosample values "override" the "default" study values, etc.

It would be better to (a) have them exclusively either in study or in biosample, or (b) to name them differently so as to clarify, e.g. prefix the slots in study with "default_", "assumed_", etc.

@cmungall @dehays

biosample processing and omics processing should not be in a subclass relationship

biosample processing is documented to be a process that takes samples as inputs and generates samples as outputs; ie sample->sample

  biosample processing:
    is_a: named thing
    description: >-
      A process that takes one or more biosamples as inputs and generates one or as outputs.
      Examples of outputs include samples cultivated from another sample or data objects created by instruments runs.
    slots:
      - has input
    slot_usage:
      has input:
        range: biosample

we can map this to to material processing in OBI

this can be used to represent graphs of arbitrary depths of processes on samples, from treatments through to subsampling, making aliquots. In practice these graphs may be flat especially for retrospective samples

omics_processing doesn't have a sample as output - it generates omics data, i.e sample->data. Yet it is a subclass of biosample processing

 omics processing:
    is_a: biosample processing
    in_subset: 
      - sample subset
    description: >-
      The methods and processes used to generate omics data from a biosample or organism.
    comments:
      - The IDs for objects coming from GOLD will have prefixes gold:GpNNNN
    slots:
      - part of
      - has output
      - omics type
    slot_usage:
      id:
        description: >-
          The primary identifier for the omics processing. E.g. GOLD:GpNNNN
      name:
        description: >-
          A human readable name or description of the omics processing.
      alternate identifiers:
        description: >-
          The same omics processing may have distinct identifiers in different databases (e.g. GOLD and EMSL, as well as NCBI)
      part of:
        range: study
      has output:
        range: data object

I think this can be mapped to assay in OBI

I don't think this is quite coherent, if OP inherits from BP then it should inherit the condition that the output is a sample

It may be more straightforward to have these be siblings

If we want we can have a grouping class for both of these

Add more integrity checks on fields in JSON schema 2

There are different levels of checks, this issue is to cover the first 2:

Is the ID prefix valid? (e.g. KEGG.KO vs KEGG.ORTHOLOG)
2 Is the local part of the ID syntactically conformant? (e.g. KEGG:K\d+)

The first is very easy to do with using the existing id_prefixes annotated in the schema

add license

prefer CC-0

cc @cmungall @turbomam

attribute value should be abstract, doi should be something concrete

Whereas named thing is abstract, attribute value is not abstract. However, doi is the only slot in the NMDC schema with a range of attribute value, rather than a range that is_a attribute value such as text value. I recommend that attribute value be made abstract: true and that the range of doi be made text value or another concretion of attribute value.

add badges

cc @cmungall @turbomam

addition to file_type_enum for NOM and metaB file types

modify the file_type_enum definition in nmdc.yml (https://github.com/microbiomedata/nmdc-schema/blob/main/src/schema/nmdc.yaml - line 376).

Add 2 enum values with descriptions:

data_object_type: GC-MS Metabolomics Results
description: GC-MS-based metabolite assignment results table

data_object_type: Natural Organic Matter Results
description: FT-MS high resolution molecular formula assignment results table

sync nmdc.yaml with nmdc-metadata repo

The nmdc.yaml in the nmdc-metadata repo changed. Need to sync schemas.

add validation tests for pull request merges

When a PR is merged, I need to validate the schema using our example datasets.

Might also consider using unit tests.

add alternate_identifiers slot to biosample

We will need this sooner than expected because GOLD is creating new biosample records that merge RNA and DNA samples when a common source is known. We will need to connect these new GOLD biosample records to EMSL processes by IGSN - so we need the new GOLD biosamples to retain their IGSNs in NMDC biosample records.

Expect that the range is a list of strings

make NMDC data object a ga4gh DRS object?

The ga4gh data repository service (DRS) API spec defines an object type, DrsObject, that has properties useful for workflow automation, for example url+headers for authorized access, or tokens for deferring url generation. I have sketched out a pydantic model for it in the nmdc-runtime repo. Also, #49 suggests a checksum field, which DRS addresses as an array (e.g. [{checksum: ..., type: 'crc32c'}, {checksum: ..., type: 'md5', ...]).

My suggestion here is to make NMDC's data object a DRS object, i.e. align its LinkML definition with DRS's DrsObject spec.

NMDC Schema update

The sediment package parameter tot_carb is not in nmdc.schema.json

Do we keep GOLD add_date and mod_date?

At present, I pull in the add_date and mod_date from the GOLD data tables. These tables pertain to when the record was added to the database, not when the sample was collected.

Example biosample add_date: 28-JUL-14 12.00.00.000000000 AM.

Do we keep these or remove these?

cc @cmungall @dwinston @jbeezley @jeffbaumes @turbomam @dehays

Add slots to database class to capture versioning

As part of the ETL, add versioning info to the database class. I have already created some stubs for this:

database:
...
      slot_usage:
      nmdc schema version:  <-- pulled from the nmdc.yaml file; I have a util to get this.
        description: >-
            TODO
      date created:  <-- date the database was created
        description: >-
            TODO
      etl software version: <-- requires versioning of our etl scripts
        description: >-
            TODO

Do we need others? @dehays @cmungall @dwinston @jeffbaumes @jbeezley

add started_at_time, ended_at_time slot to workflow execution activity

@cmungall I just noticed that the workflow execution activity class does not have started at time, ended at time slots:

workflow execution activity:
    is_a: activity
    in_subset: 
      - workflow subset
    description: >-
      Represents an instance of an execution of a particular workflow
    slots:
      - execution resource
      - git url
      - has input
      - has output
      - type  # custom slot that specifies object type
    slot_usage:
      was associated with:
        required: false
        description: >-
          the agent/entity associated with the generation of the file
        range: workflow execution activity
        inlined: false # allow for strings of IDs

However, the data coming from Aim 2 does have the these slots. E.g.:

{
    "id": "nmdc:f2fc8f5aade3092ea97769f0a892d2a9",
    "name": "MAGs activiity 1781_86101",
    "was_informed_by": "gold:Gp0115663",
    "started_at_time": "2021-01-10",
    "ended_at_time": "2021-01-10",
    "type": "nmdc:MAGsAnalysisActivity",
    "execution_resource": "NERSC - Cori",
    "git_url": "https://img.jgi.doe.gov",
    ...
}

I think we should add these slots to the schema.

Thoughts?

add slot relating biosample to study

Currently, biosamples are related to biosamples via omics processing. I.e.:

(biosample) <-- has input -- (omics process) -- part of --> (study)

It would be useful to have a slot that directly related biosamples to studies.

image (from slides?) for readme

long term, should come from the LinkML

Register EMSL project prefix on identifiers.org

I recommend registering EMSL.project as a prefix just in case the same 5 digit number is reused for other entities, e.g. metabolites.

It looks like the redirect should be to:

https://search.emsl.pnnl.gov/?project[0]=projects_$LocalID

E.g.

https://search.emsl.pnnl.gov/?project[0]=projects_51283

is this correct @SamuelPurvine

See https://microbiomedata.github.io/nmdc-schema/identifiers/

Note: once registered with identifiers.org it should percolate out to n2t, bioregistry, prefixcommons, ...

run tests for invalid data

For invalid datasets in test/data, add target to makefile to check that invalid data is confirmed to be invalid.

As a convention, prefix that name of the invalid dataset with invalid_; e.g., invalid_study_test.json.

cc @turbomam

move across identifiers.md doc

this and similar docs belongs with the schema I think

Metatranscriptiome workflow activities were incorrectly labelled

The activities were labelled "metagenome" and this was corrected manually in the portal, but I believe this should be fixed upstream in the metadata to get rid of this naming hack.

add abstract slot for studies

Add slot to hold abstract section of studies.

create change log with releases

When making a new release update/create chanes.md to log changes.

Share examples of repaired NCBI/INSDC Biosample metadata

Efforts for mapping categorical data from the NCBI/INSDC Biosample metadata to OBO foundry terms has been taking place in several branches or forks of INCATools/biosample-analysis and turbomam/scoped-mapping at least. That's partially because these techniques will be applied to other projects besides NMDC.

Existing efforts have been batchwise or notebook based. We are planning on moving towards a repair API.

For now, we need a place to store mappings to that NMDC personal can review them and suggest changes to our strategies.

CCing @cmungall @wdduncan

additional optional attributes to be added to study

part of decomposition of #41

To be added to study entity as optional attributes. These are not currently available from GOLD studies and therefore cannot presently be populated from the GOLD -> NMDC ETL

Move file descriptors to the offical description field of data objects

Currently the mapping is in the portal repo but should probably be associated with data objects upstream.

Add ability to link studies or make umbrella studies

At JGI we have many cases where there are multiple proposals that span a long-term “study” (e.g., a large, multi-year SFA project that involves many hundreds of samples).

Also need to add mappings for concepts of studies across different systems

add GOLD info to SSSOM README

The README in sssom directory for the GOLD to MIXS mappings.

add depth2 slots

for biosamples (and related package tables) depth2 and subsurface_depth2 slots are needed for when the depth is within a range.

generate JSONLD context

Generate JSONLD context as one of the output artifacts.

cc @cmungall @turbomam

jsonschema validation working for new file locally but not in GH actions

locally:

% make test-jsonschema
jsonschema -i test/data/biosample_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/gold_project_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/img_mg_annotation_objects.json jsonschema/nmdc.schema.json
jsonschema -i test/data/nmdc_example_database.json jsonschema/nmdc.schema.json
jsonschema -i test/data/MAGs_activity.json jsonschema/nmdc.schema.json
jsonschema -i test/data/mg_assembly_activities_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/mg_assembly_data_objects_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/study_test.json jsonschema/nmdc.schema.json
jsonschema -i test/data/functional_annotation_set_valid.json jsonschema/nmdc.schema.json

GitHub Actions:

make: *** No rule to make target 'validate-functional_annotation_set_valid', needed by 'test-jsonschema'.  Stop.

generate OWL output

As part of makefile, generate OWL artefact.

cc @cmungall @turbomam

update README on main page

cc @jbeezley @turbomam

use ISO8601 for versioning

When doing a release use ISO8601 date to tag the version.

add a schema version on releases of NMDC schema

We discussed using semantic versioning or simply a time stamp on release versions. Either could work, to avoid needing to define the semantics of semantic versioning (does minor or patch break ingest or only major?) we decided time stamp release versions was simplest.

With versions on the schema it becomes possible to ask what schema version the metadata store (MongoDB or Terminus) is at and what version the search application ingest and relationship schema support

microbiomedata / nmdc-schema Goto Github PK

nmdc-schema's Introduction

National Microbiome Data Collaborative Schema

Documentation

Repository Contents Overview

Maintaining the Schema

Makefiles

Data downloads

nmdc-schema's People

Contributors

Stargazers

Watchers

Forkers

nmdc-schema's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs