GithubHelp home page GithubHelp logo

related-sciences / nxontology-data Goto Github PK

View Code? Open in Web Editor NEW
10.0 4.0 3.0 1.44 GB

NXOntology data: making ontologies accessible as simple JSON files

License: Other

Python 44.24% Jupyter Notebook 55.76%
nxontology ontologies networkx taxonomies hierarchies obo pubchem python graphs networks

nxontology-data's Introduction

NXOntology data: making ontologies accessible as simple JSON files

GitHub Actions CI Build Status
Software License
Code style: black

This repository imports public ontologies/taxonomies into Python NXOntology objects and writes the ontologies in the JSON-based node-link data format. The goal is to standardize and simplify data access to ontologies.

For ontologies that have been imported into NXOntology and exported to JSON, see the output/* branches on GitHub, for example output/pubchem.

Once you find the ontology you'd like to read, you can read in Python (after installing any dependenies like pip install nxontology):

# URL to the exported dataset.
# Here we read the ChEMBL protein/target classification hierarchy.
url = "https://github.com/related-sciences/nxontology-data/raw/output/pubchem/087_chembl_target_tree.json"
# Versioning with the commit hash is a good idea, since we might change the branch structure where data is stored.
url = "https://github.com/related-sciences/nxontology-data/raw/71cf538dc5c258ada880d58663b0205b7b7f8561/087_chembl_target_tree.json"

# To read as an NXOntology object,
# which encapsulates the networkx graph.
# Will also work for the gzip compressed files.
from nxontology import NXOntology
nxo = NXOntology.read_node_link_json(url)

# To read as a networkx.DiGraph
import requests
from networkx.readwrite.json_graph import node_link_graph
digraph = node_link_graph(requests.get(url).json())

or in R:

url <- "https://github.com/related-sciences/nxontology-data/raw/71cf538dc5c258ada880d58663b0205b7b7f8561/087_chembl_target_tree.json"
json_ont <- jsonlite::read_json(path = url)
digraph <- tidygraph::tbl_graph(
  nodes = dplyr::bind_rows(json_ont$nodes),
  edges = dplyr::bind_rows(json_ont$links),
)
digraph
#> # A tbl_graph: 904 nodes and 889 edges

Note: There's currently an open issue on reading in json.gz files with the R package jsonlite.

Sources

The data sources that are currently imported are listed below. Please open an issue if you are interested in contributing support for additional sources.

EFO

This project converts all three variants of the Experimental Factor Ontology (EFO, EFO OTAR Profile, and EFO OTAR Slim) into NXOntology objects. See nxontology_data/efo for a detailed README.

HGNC Gene Groups

HGNC (HUGO Gene Nomenclature Committee) maintains a directed acyclic graph of gene groups/families. See nxontology_data/hgnc for a detailed README. Output data is on the output/hgnc branch.

MeSH

MeSH (Medical Subject Headings) is created by the National Library Medicine and integrated into many projects including PubMed. See nxontology_data/mesh for a detailed README. Output data is on the output/mesh branch.

PubChem

We import ontologies from the PubChem Classifications service (see browser & docs). Most ontologies indexed by service do not originate with PubChem, but PubChem provides convenient and standardized bulk access. Output data is on the output/pubchem branch.

Development

# Install the environment
poetry install --no-root

# Update the lock file
poetry update

# Run tests
pytest

# Set up the git pre-commit hooks.
# `git commit` will now trigger automatic checks including linting.
pre-commit install

# Run all pre-commit checks (CI will also run this).
pre-commit run --all

License

This source code in this repository is released under an Apache License 2.0 License (see LICENSE.md). Source code refers to the contents of the main branch and any other development branches containing code and documentation.

The output branches contain data from external ontologies. Please refer to each respective ontology for its data license. If available, we include license information in the graph metadata for each ontology, but often license information is not supplied in the ontology data we ingest. Please attribute the source ontology when reusing data obtained from this project, and as best practice mention that the data was obtained via NXOntology data.

Any original data produced by this repository is released under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. As noted above, the underlying ontology data is not original to this repository and upstream licenses should be consulted.

nxontology-data's People

Contributors

bfoltyn avatar dhimmel avatar ravwojdyla avatar trangdata avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

nxontology-data's Issues

MeSH: add name synonyms from mesh concepts / terms

It would be nice to have synonyms for each MeSH node (i.e. descriptor / SCRs).

Background

From Concept Structure in MeSH:

Terms in a MeSH record which are strictly synonymous with each other are grouped in a category called a "Concept." (Not to be confused with Supplementary Concept Records.) See the Concept element in MeSH. Each MeSH record consists of one or more Concepts, and each Concept consists in one or more synonymous terms. For example,

Cardiomegaly [Descriptor]
     Cardiomegaly                      [Concept, Preferred]
          Cardiomegaly                    [Term, Preferred]
          Enlarged Heart                  [Term]
          Heart Enlargement               [Term]
     Cardiac Hypertrophy               [Concept, Narrower]
          Cardiac Hypertrophy             [Term, Preferred]
          Heart Hypertrophy               [Term]

This Descriptor record consists of two Concepts and five terms. Each Concept has a Preferred Term, which is also said to be the name of the Concept. And each record has a Preferred Concept. The name of the record - the term most often used to refer to the Descriptor - is the Preferred Term of the preferred Concept.

Within each Concept the terms are synonymous with each other. In contrast, the terms in one Concept are not strictly synonymous with terms in another Concept, even in the same record. For example, one concept in a record may be narrower than the Preferred Concept, as in the above example. Also note that the terms in a concept inherit this relationship and so are narrower, for example, than the terms in the other concept. However, all the terms in a record are equivalent for purposes of indexing and searching MEDLINE and so they are still entry terms for the record.

A more complex example, with three Concepts and 12 terms.

AIDS Dementia Complex [Descriptor]
     AIDS Dementia Complex                                   [Concept, Preferred]
          AIDS Dementia Complex                                 [Term, Preferred]
          Acquired-Immune Deficiency Syndrome Dementia Complex  [Term]
          AIDS-Related Dementia Complex                         [Term]
          HIV Dementia                                          [Term]
          Dementia Complex, Acquired Immune Deficiency Syndrome [Term]
          Dementia Complex, AIDS-Related                        [Term]
     HIV Encephalopathy                                       [Concept, Narrower]
          HIV Encephalopathy                                    [Term, Preferred]
          AIDS Encephalopathy                                   [Term]
          Encephalopathy, HIV                                   [Term, Preferred]
          Encephalopathy, AIDS                                  [Term]
     HIV-1-Associated Cognitive Motor Complex                [Concept, Narrower]
          HIV-1-Associated Cognitive Motor Complex              [Term, Preferred]
          HIV-1 Cognitive and Motor Complex                     [Term]

... Note that this three-tiered structure is within a given record, not between separate records. This is in contrast to the MeSH Tree Structures, which are hierarchical in structure, but the relationships are between different Descriptor records. MeSH includes both types of relationships. See "Concepts, Synonyms, and Descriptor Structure" in Introduction to MeSH in XML format.

Also noting this reference from "Concepts, Synonyms, and Descriptor Structure":

Redefining a Thesaurus: Term-Centric No More
Douglas Johnston, Stuart J Nelson, Jacque-Lynne A Schulman, Allan G Savage, Tammy P Powell
Proceedings of the AMIA Symposium (1998)
PMCID: PMC2232255

MeSH: extract pharmacological action relationships

Here's an example MeSH query for pharmacological action relationships:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh>
WHERE {
  ?source_uri meshv:pharmacologicalAction ?action_uri .
  ?source_uri rdfs:label ?source_label.
  ?action_uri rdfs:label ?action_label.
  ?source_uri meshv:identifier ?source_id.
  ?action_uri meshv:identifier ?action_id.
}
ORDER BY ?source_uri ?action_uri

That produces results like

source_uri action_uri source_label action_label source_id action_id
mesh:C000002 mesh:D000894 bevonium Anti-Inflammatory Agents, Non-Steroidal C000002 D000894
mesh:C000006 mesh:D007004 insulin, neutral Hypoglycemic Agents C000006 D007004
mesh:C000081 mesh:D000697 4-methylaminorex Central Nervous System Stimulants C000081 D000697
mesh:C000082 mesh:D000903 alanosine Antibiotics, Antineoplastic C000082 D000903
mesh:C000082 mesh:D002614 alanosine Chelating Agents C000082 D002614

Currently, we are not extracting meshv:pharmacologicalAction relationships. Should we? Either as a separate table or in the core ontology as a valid edge type?

Extract MeSH mappings to external registries / vocabularies

MeSH includes some external mappings via the following predicates (from docs):

  • meshv:registryNumber: A property of Concepts. A unique identifier from one of these sources: Enzyme Commission (Example: EC 2.4.2.17; Example for Partial enzyme number: EC 1.4.3.-); Chemical Abstracts Service (CAS) (Example: 7004-12-8); FDA Substance Registration System Unique Identifier (UNII) in 10-character format (Example: R16CO5Y76E); or the value of 0 if no match is available from the previous sources. A single MeSH Concept can only have one Registry Number. Used for Concepts related to Descriptors in the D Category Drugs and Chemicals and for SupplementaryConceptRecords. MUI M0000115 example: 362O9ITL9D.

  • meshv:relatedRegistryNumber: A property of Concepts. An additional unique identifier for chemicals, which is sometimes followed by a label in parentheses. Multiple Related Registry Numbers are allowed for each Concept. For example, these might be salts and/or stereoisomers of the parent compound. Used for Concepts related to Descriptors in the D Category Drugs and Chemicals and for SupplementaryConceptRecords. MUI M0000115 example: 103-90-2 (Acetaminophen). MUI M0068239 example: 75821-71-5 (Ca salt)

  • meshv:casn1_label: A property of Concepts. Free-text of the Chemical Abstracts Type N1 Name which is the systematic name used in the Chemical Abstracts Chemical Substance and Formula Indexes. The systematic name is a unique name assigned to a chemical substance to represent its structure. First available in 1995. MUI M0000115 example: Acetamide, N-(4-hydroxyphenyl)-

Here's a query to access these:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT DISTINCT *
FROM <http://id.nlm.nih.gov/mesh>
WHERE { 
  ?concept_uri rdf:type meshv:Concept.
  ?concept_uri rdfs:label ?concept_label.
  ?concept_uri meshv:identifier ?concept_id.
  VALUES ?predicate_uri {
    meshv:registryNumber
    meshv:relatedRegistryNumber
    meshv:casn1_label
  }
  ?concept_uri ?predicate_uri ?registry_number.
  BIND( STRAFTER(STR(?predicate_uri), "mesh/vocab#") AS ?relationship_type )
  FILTER (?registry_number != "0")
}
ORDER BY ?concept_uri ?predicate_uri ?registry_number
concept_uri concept_label concept_id predicate_uri registry_number relationship_type
mesh:M0000001 Calcimycin M0000001 meshv:casn1_label 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7-dioxaspiro(5.5)undec-2-yl)methyl)-, (6S-(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))- casn1_label
mesh:M0000001 Calcimycin M0000001 meshv:registryNumber 37H9VM9WZL registryNumber
mesh:M0000001 Calcimycin M0000001 meshv:relatedRegistryNumber 52665-69-7 (Calcimycin) relatedRegistryNumber
mesh:M0000002 Temefos M0000002 meshv:casn1_label Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester casn1_label
mesh:M0000002 Temefos M0000002 meshv:registryNumber ONP3ME32DL registryNumber
mesh:M0000002 Temefos M0000002 meshv:relatedRegistryNumber 3383-96-8 (Temefos) relatedRegistryNumber
mesh:M0000011 Abelson murine leukemia virus M0000011 meshv:registryNumber txid11788 registryNumber
mesh:M0000055 Abrin M0000055 meshv:casn1_label Abrins casn1_label
mesh:M0000055 Abrin M0000055 meshv:registryNumber 1393-62-0 registryNumber
mesh:M0000061 Abscisic Acid M0000061 meshv:registryNumber 72S9A8J5GW registryNumber
mesh:M0000061 Abscisic Acid M0000061 meshv:relatedRegistryNumber 113349-29-4 ((Z,E)-isomer) relatedRegistryNumber

One challenge is that registry numbers appear to be local identifiers without any notation of their source.

EFO cross-references: classify as exact/close when possible

background in EBISPOT/efo#935

We currently extract database cross-references for EFO using the oboInOwl:hasDbXref predicate. However, MONDO is providing xrefs with greater specificity using the mondo:exactMatch and mondo:closeMatch predicates. Furthermore, there are axioms (with rdf:type owl:Axiom) that annotate oboInOwl:hasDbXref instances with values like MONDO:equivalentTo.

EFO:0000479 is a good example of a class that has all types of xrefs:

  1. oboInOwl:hasDbXref without axioms
  2. oboInOwl:hasDbXref with axioms
  3. mondo:exactMatch and mondo:closeMatch

It would be nice to further understand the relation between 2 and 3.

MeSH vocabulary subclass graph

Here's a display of the vocabulary subclass graph from 2022 MeSH:

image

Code to create it:

from nxontology_data.mesh.mesh import MeshLoader
from IPython.display import Image
from networkx.drawing.nx_agraph import to_agraph

vocab = MeshLoader.create_vocab_digraph(rdf)
gviz = to_agraph(vocab)
gviz.layout("dot")
Image(gviz.draw(format="png"))

Code requires pygraphviz which requires graphviz, which ends up being a problematic dependency on CI: failed on the self-hosted runner. So just posting this visualization in a GitHub issue and will remove pygraphviz from CI.

Accessing MeSH NXO

Hey @dhimmel, I was trying to pull the 2021 MeSH NXO like this:

from nxontology import NXOntology
url = "https://github.com/related-sciences/nxontology-data/raw/71cf538dc5c258ada880d58663b0205b7b7f8561/001_medical_subject_headings_mesh_desctree.json.gz"
nxo = NXOntology.read_node_link_json(url)

I was a little surprised to find that the node ids are ints and that there isn't a lot of data attached to them:

pd.Series(type(n) for n in nxo.graph.nodes).value_counts()
<class 'int'>    920388

nxo.node_info(1).data
{'name': 'Organisms Category',
 'description': None,
 'pubchem_hnid': 1269010,
 'url': 'http://www.ncbi.nlm.nih.gov/mesh/1000066'}

Is there another way to get the unique ids, class, and tree numbers (for descriptors)?

MeSH: include qualifiers as a node property

It would be great for topical descriptors nodes in our MeSH ontologies to have a data attribute//property like qualifiers for the list of allowed qualifiers.

For example the disease Exostoses, Multiple Hereditary has the following descriptors available through hasDescriptor:

  • diagnostic imaging D005097
  • blood D005097
  • therapy D005097
  • history D005097
  • mortality D005097
  • prevention & control D005097
  • surgery D005097
  • diagnosis D005097
  • classification D005097
  • radiotherapy D005097
  • and some more

See also https://hhs.github.io/meshrdf/descriptor-qualifier-pairs

MeSH: should we exclude non-English labels?

From the June 18, 2015 MeSH RDF release notes:

Users now must specify the language tag @en when searching rdfs:label or any other string literal. See the sample queries page (queries 5 and 6) for examples. One preferred MeSH Heading, Central Nervous System which is D002493, has non-English strings as a proof-of-concept example. This sample will remain in the beta version but may not be included in the production MeSH RDF version.

We already filter out non-English matches in our identifiers query:

OPTIONAL {
# meshv:prefLabel is used for meshv:Term
?mesh_uri rdfs:label|meshv:prefLabel ?mesh_label .
FILTER (langMatches(lang(?mesh_label), "EN")) .
}

But not in our synonyms table.

How to handle MeSH supplemental concepts that only map to an AllowedDescriptorQualifierPair

Some supplemental concept records (SCRs) in MeSH only have a preferredMappedTo whose predicate is a AllowedDescriptorQualifierPair rather than the usual TopicalDescriptor

For example, the SCR Disease Familial spinal arachnoiditis has preferredMappedTo to an AllowedDescriptorQualifierPair Arachnoiditis/congenital. So Arachnoiditis is the parent topical descriptor and congenital is the qualifier.

Currently, these edges are dropped:

except nxontology.exceptions.NodeNotFound:
# meshv:AllowedDescriptorQualifierPair nodes like D014199Q000031 aren't included as nodes
pass

Need to investigate whether any of these AllowedDescriptorQualifierPair parents would break our is-a / parent assumption. If they are consistent with a hierarchical conceptual relationship, then I'm thinking we just add an edge from Arachnoiditis to Familial spinal arachnoiditis with an edge property that for the congenital qualifier.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.