GithubHelp home page GithubHelp logo

realmarcin / sample-annotator Goto Github PK

View Code? Open in Web Editor NEW

This project forked from microbiomedata/sample-annotator

0.0 1.0 0.0 6.22 MB

NMDC Sample Annotator

Home Page: https://microbiomedata.github.io/sample-annotator/static/intro.html

Makefile 1.99% Python 54.05% Batchfile 1.35% Jupyter Notebook 42.61%

sample-annotator's Introduction

Documentation Status

NMDC Sample Annotator API

Installing

pipenv

This requires python 3.7.x or later (as default python).

If you have pipenv installed:

git clone ...
cd sample-annotator
make test

venv

For those using venv, you'll need something like:

git clone ...
cd sample-annotator
python3.7 -m venv env
source ./env/bin/activate
pip install pipenv
PIPENV_IGNORE_VIRTUALENVS=1 make test

While there may be more concise ways of running commands like those below, this works:

PIPENV_IGNORE_VIRTUALENVS=1 pipenv run python -m sample_annotator.sample_annotator -R examples/report.tsv examples/gold.json

What is it?

This is a python and flask API for performing annotation of samples from semi-structured or untidy data

The API takes as input a JSON object or dictionary representing a simple sample, where each key is a metadata field

It will attempt to tidy and infer missing data according to a specified schema (currently MIxS)

Command Line

pipenv run annotate-sample --help

Usage: annotate-samples [OPTIONS] SAMPLEFILE

  Annotate a file of samples, producing a "repaired"/enhanced sample file as
  output, together with a report

  The input file must be a JSON fine containing an array of dicts

Options:
  -v, --validateonly / -g, --generate
                                  Just validate / generate output (default:
                                  generate)

  -s, --output TEXT               JSON for tidied samples
  -R, --report-file TEXT          report file
  -G, --googlemaps-api-key-path TEXT
                                  path to file containing google maps API KEY
  -B, --bioportal-api-key-path TEXT
                                  path to file containing bioportal API KEY
  --help                          Show this message and exit.

E.g.

pipenv run annotate-sample -G config/googlemaps-api-key.txt -R examples/report.tsv examples/gold.json

This will transform input such as:

[
    {
        "id": "gold:Gb0108335",
        "community": "microbial communities",
        "depth": "0.0 m",
        "ecosystem": "Environmental",
        "ecosystem_category": "Terrestrial",
        "ecosystem_subtype": "Wetlands",
        "ecosystem_type": "Soil",
        "env_broad_scale": "ENVO:00000446",
        "env_local_scale": "ENVO:00000489",
        "env_medium": "ENVO:00000134",
        "geo_loc_name": "Sweden: Kiruna",
        "habitat": "Thawing permafrost",
        "identifier": "studying carbon transformations",
        "lat_lon": "68.3534 19.0472",
        "location": "from the Arctic",
        "mod_date": "15-MAY-20 10.04.19.473000000 AM",
        "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D",
        "ncbi_taxonomy_name": "permafrost metagenome",
        "sample_collection_site": "Palsa",
        "specific_ecosystem": "Permafrost",
        "study_description": "A fundamental challenge of microbial environmental science is to understand how earth systems will respond to climate change. A parallel challenge in biology is to unverstand how information encoded in organismal genes manifests as biogeochemical processes at ecosystem-to-global scales. These grand challenges intersect in the need to understand the glocal carbon (C) cycle, which is both mediated by biological processes and a key driver of climate through the greenhouse gases carbon dioxide (CO2) and methane (CH4). A key aspect of these challenges is the C cycle implications of the predicted dramatic shrinkage in northern permafrost in the coming century.",
        "type": "nmdc:Biosample"
    },

into:

[
    {
        "id": "gold:Gb0108335",
        "community": "microbial communities",
        "depth": {
            "has_numeric_value": 0.0,
            "has_raw_value": "0.0 m",
            "has_unit": "metre"
        },
        "ecosystem": "Environmental",
        "ecosystem_category": "Terrestrial",
        "ecosystem_subtype": "Wetlands",
        "ecosystem_type": "Soil",
        "elev": {
            "has_numeric_value": 359,
            "has_unit": "meter"
        },
        "env_broad_scale": "ENVO:00000446",
        "env_local_scale": "ENVO:00000489",
        "env_medium": "ENVO:00000134",
        "geo_loc_name": "Sweden: Kiruna",
        "habitat": "Thawing permafrost",
        "identifier": "studying carbon transformations",
        "lat_lon": {
            "latitude": 68.3534,
            "longitude": 19.0472
        },
        "location": "from the Arctic",
        "mod_date": "15-MAY-20 10.04.19.473000000 AM",
        "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D",
        "ncbi_taxonomy_name": "permafrost metagenome",
        "sample_collection_site": "Palsa",
        "specific_ecosystem": "Permafrost",
        "study_description": "A fundamental challenge of microbial environmental science is to understand how earth systems will respond to climate change. A parallel challenge in biology is to unverstand how information encoded in organismal genes manifests as biogeochemical processes at ecosystem-to-global scales. These grand challenges intersect in the need to understand the glocal carbon (C) cycle, which is both mediated by biological processes and a key driver of climate through the greenhouse gases carbon dioxide (CO2) and methane (CH4). A key aspect of these challenges is the C cycle implications of the predicted dramatic shrinkage in northern permafrost in the coming century.",

Differences between input and output:

  • measurement fields are normalized
  • information inferred from lat_lon (currently only elev)
  • TODO: ENVO from text mining
  • TODO: annotation sufficiency score
  • TODO: more...

Validation reports

These are created as report objects, and exported to pandas dataframes for basic statistical aggregation. See tests for details

Example report:

description severity field was_repaired category
No package specified 1 Category.MissingCore
No checklist specified 1 Category.Unclassified
Key not underscored: total particulate carbon 1 True Category.Unclassified
Invalid field: id 1 Category.UnknownField
Alias used: total_particulate_carbon => tot_part_carb 1 True Category.Unclassified
Parsed unit-value: 2.0 metre 1 Category.Unclassified
Missing unit 5 1 Category.Unclassified
Skipping geo-checks 0 Category.Unclassified

API Docs

TODO: readthedocs

Testing

Currently the best way to understand this code is to understand the tests

This contains 'fake' samples that are intended to test validation and repair

Schema Validation

See the schema folder -- this contains a copy of the LinkML rendering of the MIxS schema from mixs-source which will later be integrated by GSC

Modules

Each module will take care of different aspects

For example, the measurement module will normalized all fields in the schema with range QuantityValue

E.g. Input:

sample:
  id: TEST:1
  alt: 2m
  ...

Repair Output:

sample:
  id: TEST:1
  alt:
    has_numeric_value: 2.0
    has_raw_value: 2m
    has_unit: metre
    ...

Starting the web API

  • TODO: write flask code

sample-annotator's People

Contributors

cmungall avatar hrshdhgd avatar kltm avatar realmarcin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.