biopragmatics / obo-db-ingest Goto Github PK

🗄️ Conversion of biomedical nomenclatures like HGNC to OBO

Home Page: https://biopragmatics.github.io/obo-db-ingest/

Python 100.00%

biocuration obofoundry ontologies obo obographs obographviz owl rdf

obo-db-ingest's Introduction

OBO Database Ingestion

This repository shows how databases can be formalized as an OBO Ontology in the OBO flat file format, OWL format, and OBO Graph JSON format. A list of the databases whose controlled vocabularies and related content can be readily converted to OBO can be in found in the PyOBO source code's sources/ folder here.

Further discussion:

Limits of ontologies: How should databases be represented in OBO? presented by Chris Mungall
OBOFoundry/OBOFoundry.github.io#1981

Each resource gets a subdirectory in the export/ directory containing the following exports:

A manifest of all resources is available at manifest.yml.

Build

To generate OBO files, run the following shell commands (Python 3.8+):

$ pip install tox
$ tox

PURLs

See PURL configuration at https://github.com/perma-id/w3id.org/tree/master/biopragmatics. This W3ID entry makes ontology artifacts in the "export" folder (https://github.com/biopragmatics/obo-db-ingest/tree/main/export) resolvable. Here are a few examples:

Resource	Version Type	Example PURL
Reactome	Sequential	https://w3id.org/biopragmatics/resources/reactome/83/reactome.obo
Interpro	Major/Minor	https://w3id.org/biopragmatics/resources/interpro/92.0/interpro.obo
Interpro	Semantic	https://w3id.org/biopragmatics/resources/drugbank.salt/5.1.9/drugbank.salt.obo
MeSH	Year	https://w3id.org/biopragmatics/resources/mesh/2003/mesh.obo.gz
UniProt	Year/Month	https://w3id.org/biopragmatics/resources/uniprot/2022_05/uniprot.obo.gz
HGNC	Date	https://w3id.org/biopragmatics/resources/hgnc/2023-02-01/hgnc.obo
CGNC	unversioned	https://w3id.org/biopragmatics/resources/cgnc/cgnc.obo

obo-db-ingest's People

Contributors

Stargazers

Watchers

Forkers

shunsunsun

obo-db-ingest's Issues

rename uniprot to swissprot

the uniprot obo file is actually just swissprot

grep -c '^id: uniprot:' ../obo-db-ingest/export/uniprot/2022_02/uniprot.obo
567483

which is useful in its own right, but it should be called swissprot

uniprot has another 229m entries from trembl, which might be harder to get by github size limits

another useful slice is all the reference proteomes. For human this more or less equates to swissprot but for other organisms it gives a representative entry for each gene

Document the governance strategy for rendering databases as OBO

Note: it is acceptable to close this and say this project is not intended to represent consensus mapping of databases to OBO. I think that would be a wasted opportunity and we should be moving this towards a community standard.

When OBO-izing a database, there are many decisions that are made, either explicitly or implicitly. These have long term consequences for us.

Identifiers: do ECs have dashes in them?
Terminological: What gets the primary label: symbol or full name?
Metadata: What kinds of textual information is included, what is excluded? Definitions?
Ontological: What is the relationship between a gene, an EC, and a RHEA? (see this diagram)
Ismorphism: To what extent do we retain isomorphism with source vs introducing additional information that is useful in an OBO context (lots of examples here: https://github.com/obophenotype/ncbitaxon/issues)
Principles / Guidelines: What is best practice? When is OBO followed, when is it not suitable?

Some of these decisions can be punted elsewhere; identifiers can largely be punted to bioregistry.

As far as general principles, I believe that here there is a general best-effort (if unstated) to conform to OBO principles. However, as per slide 43 from my databases as OBO deck I don't think it makes sense to blindly apply OBO principles to OBOized databases. I think there are probably many lessons learned from existing efforts in the obo-db-ingest project that could be explicitly articulated for parallel guidelines.

I think there are some very practical OBO principles that should be adhered to, such as: entries should as far as possible have labels: biopragmatics/pyobo#169

There are modeling decisions that are made that are potentially very impactful in terms of constraining how the OBO products are used. For example, when using a RO relation to link two entities (e.g. biopragmatics/pyobo#168 (comment)) this basically injects a superclass into both sides of the relation. You are making a statement on behalf of the resource about what kind of thing they represent. This is a form of axiom injection.

Of course, this is necessary to some extent to make the resulting OBO usable. And we already do this to some extent in biolink. E.g. these are what biolink considers acceptable ID prefixes for a Gene: https://biolink.github.io/biolink-model/Gene/#valid-id-prefixes

These will be some of the harder ones (just look at the COB repo, and the dreaded D****** discussion). But I think we should be very practical here and not get bogged down.

However, it would be good to have some decision process - it's good to move fast but we don't want to be building up technical debt

drugbank obo file fails to parse using owlapi

the owlapi parser is the canonical reference parser, if there are ambiguities in the spec then it's considered the decider

drugbank.obo fails"

Parser: org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser@3e5499cc
    Stack trace:
LINENO: 111 - Could not find tag separator ':' in line.
LINE: Leuprolide was first approved in 1985 as a daily subcutaneous injection under the tradename Lupron™ by Abbvie Endocrine Inc.[L13850] Since this initial approval, various long-acting intramuscular and su
bcutaneous products have been developed such that patients can be dosed once every six months.[L13781, L13790] Leuprolide remains frontline therapy in all conditions for which it is indicated for use." []    
    org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:60)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyFactoryImpl.loadOWLOntology(OWLOntologyFactoryImpl.java:220)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.actualParse(OWLOntologyManagerImpl.java:1254)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1208)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1108)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1064)
        owltools.io.ParserWrapper.parseOWL(ParserWrapper.java:163)   
        owltools.io.ParserWrapper.parseOWL(ParserWrapper.java:150)   
        owltools.io.ParserWrapper.parseOBO(ParserWrapper.java:136)   
        owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:4801)
LINENO: 111 - Could not find tag separator ':' in line.
LINE: Leuprolide was first approved in 1985 as a daily subcutaneous injection under the tradename Lupron™ by Abbvie Endocrine Inc.[L13850] Since this initial approval, various long-acting intramuscular and su
bcutaneous products have been developed such that patients can be dosed once every six months.[L13781, L13790] Leuprolide remains frontline therapy in all conditions for which it is indicated for use." []    
    org.obolibrary.oboformat.parser.OBOFormatParser.error(OBOFormatParser.java:1465)
        org.obolibrary.oboformat.parser.OBOFormatParser.getParseTag(OBOFormatParser.java:861)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrameClause(OBOFormatParser.java:610)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrameClauseEOL(OBOFormatParser.java:598)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrame(OBOFormatParser.java:572)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseEntityFrame(OBOFormatParser.java:539)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseOBODoc(OBOFormatParser.java:349)
        org.obolibrary.oboformat.parser.OBOFormatParser.parse(OBOFormatParser.java:307)
        org.obolibrary.oboformat.parser.OBOFormatParser.parse(OBOFormatParser.java:259)
        org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:76)

I think the newlines need to be escaped

HGNC gene family IDs fail to resolve

[Term]
id: hgnc:5956
name: IHH
def: "Indian hedgehog signaling molecule" [pubmed:7590746, pubmed:14770182]
xref: ccds:CCDS33380
xref: ena:L38517
xref: ensembl:ENSG00000163501
xref: merops:C46.003
xref: ncbigene:3549
xref: omim:600726
xref: orphanet:122605
xref: refseq:NM_002181
xref: ucsc:uc002vjo.3
xref: vega:OTTHUMG00000154631
is_a: hgnc.genefamily:1373 ! Hedgehog signaling molecule family
is_a: hgnc.genefamily:1691 ! MicroRNA protein coding host genes
relationship: ro:0002205 uniprot:Q14623
relationship: ro:HOM0000017 rgd:620021
relationship: ro:HOM0000017 mgi:96533
relationship: ro:0002162 ncbitaxon:9606
property_value: locus_group "protein-coding gene" xsd:string
property_value: locus_type "gene with protein product" xsd:string
property_value: location "2q35" xsd:string
synonym: "BDA1" EXACT alias_symbol []
synonym: "HHG2" EXACT alias_symbol []
synonym: "Indian hedgehog (Drosophila) homolog" EXACT previous_name []

https://bioregistry.io/hgnc.genefamily:1373 -->
https://registry.identifiers.org/deprecation/resources/MIR:00100671/1373

with a 404 "go home" message :-(

Add dashes to EC grouping classes

(lmk if you prefer these in pyobo vs here)

The correct ID to use for grouping class is something like "EC:1.1.1.-", not "EC:1.1.1"

See:
biopragmatics/bioregistry#681

cc @balhoff

rhea 125 build is near-empty

https://github.com/biopragmatics/obo-db-ingest/tree/main/export/rhea/125

only has Typedefs

OBO CURIEs (e.g. RO) should be uppercase

[Term]
id: hgnc:5956
name: IHH
def: "Indian hedgehog signaling molecule" [pubmed:7590746, pubmed:14770182]
xref: ccds:CCDS33380
xref: ena:L38517
xref: ensembl:ENSG00000163501
xref: merops:C46.003
xref: ncbigene:3549
xref: omim:600726
xref: orphanet:122605
xref: refseq:NM_002181
xref: ucsc:uc002vjo.3
xref: vega:OTTHUMG00000154631
is_a: hgnc.genefamily:1373 ! Hedgehog signaling molecule family
is_a: hgnc.genefamily:1691 ! MicroRNA protein coding host genes
relationship: ro:0002205 uniprot:Q14623
relationship: ro:HOM0000017 rgd:620021
relationship: ro:HOM0000017 mgi:96533
relationship: ro:0002162 ncbitaxon:9606
property_value: locus_group "protein-coding gene" xsd:string
property_value: locus_type "gene with protein product" xsd:string
property_value: location "2q35" xsd:string
synonym: "BDA1" EXACT alias_symbol []
synonym: "HHG2" EXACT alias_symbol []
synonym: "Indian hedgehog (Drosophila) homolog" EXACT previous_name []

Neither obo nor owl will treat these as being the same IDs as RO:0002205 etc

But awesome to see RO uses this way!

HGNC ingest fails as ingester assumes latest release is present

py run-test: commands[0] | python build.py -x hgnc
INFO: [2023-02-03 11:33:41] pystow.utils - downloading with urllib from https://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/json/hgnc_complete_set_2023-02-01.json to /Users/cjm/.data/pyobo/raw/hgnc/2023-02-01/hgnc_complete_set.json
Making OBO examples:   0%|                                                                                                                                                                                      | 0/1 [00:00<?, ?it/s, prefix=hgnc]
Traceback (most recent call last):
  File "/Users/cjm/repos/obo-db-ingest/build.py", line 143, in <module>
    main()

[...]

urllib.error.HTTPError: HTTP Error 404: Not Found
ERROR: InvocationError for command /Users/cjm/repos/obo-db-ingest/.tox/py/bin/python build.py -x hgnc (exited with code 1)
_____________________________________________________________________________________________________________________ summary _____________________________________________________________________________________________________________________
  lint: commands succeeded
ERROR:   py: commands failed

it looks like 2023-02-01 is not up yet (but it may be by the time you get to this issue):

https://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/json/

not sure if this is a regular occurrence, but the assumption that there is always a release present for the 1st of the current month may be unreliable

Always use correct casing for OBO prefixes in xrefs

complexportal.obo uses pr:, should be PR:
drugbank.obo and swisslipid.obo uses chebi:, should be CHEBI:

biopragmatics / obo-db-ingest Goto Github PK

obo-db-ingest's Introduction

OBO Database Ingestion

Contents

Build

PURLs

obo-db-ingest's People

Contributors

Stargazers

Watchers

Forkers

obo-db-ingest's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs