GithubHelp home page GithubHelp logo

biopragmatics / obo-db-ingest Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 1.0 2.58 GB

🗄️ Conversion of biomedical nomenclatures like HGNC to OBO

Home Page: https://biopragmatics.github.io/obo-db-ingest/

Python 100.00%
biocuration obofoundry ontologies obo obographs obographviz owl rdf

obo-db-ingest's Introduction

OBO Database Ingestion

This repository shows how databases can be formalized as an OBO Ontology in the OBO flat file format, OWL format, and OBO Graph JSON format. A list of the databases whose controlled vocabularies and related content can be readily converted to OBO can be in found in the PyOBO source code's sources/ folder here.

Further discussion:

Contents

Each resource gets a subdirectory in the export/ directory containing the following exports:

A manifest of all resources is available at manifest.yml.

Build

To generate OBO files, run the following shell commands (Python 3.8+):

$ pip install tox
$ tox

PURLs

See PURL configuration at https://github.com/perma-id/w3id.org/tree/master/biopragmatics. This W3ID entry makes ontology artifacts in the "export" folder (https://github.com/biopragmatics/obo-db-ingest/tree/main/export) resolvable. Here are a few examples:

Resource Version Type Example PURL
Reactome Sequential https://w3id.org/biopragmatics/resources/reactome/83/reactome.obo
Interpro Major/Minor https://w3id.org/biopragmatics/resources/interpro/92.0/interpro.obo
Interpro Semantic https://w3id.org/biopragmatics/resources/drugbank.salt/5.1.9/drugbank.salt.obo
MeSH Year https://w3id.org/biopragmatics/resources/mesh/2003/mesh.obo.gz
UniProt Year/Month https://w3id.org/biopragmatics/resources/uniprot/2022_05/uniprot.obo.gz
HGNC Date https://w3id.org/biopragmatics/resources/hgnc/2023-02-01/hgnc.obo
CGNC unversioned https://w3id.org/biopragmatics/resources/cgnc/cgnc.obo

obo-db-ingest's People

Contributors

cthoyt avatar github-actions[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

shunsunsun

obo-db-ingest's Issues

rename uniprot to swissprot

the uniprot obo file is actually just swissprot

grep -c '^id: uniprot:' ../obo-db-ingest/export/uniprot/2022_02/uniprot.obo
567483

which is useful in its own right, but it should be called swissprot

uniprot has another 229m entries from trembl, which might be harder to get by github size limits

another useful slice is all the reference proteomes. For human this more or less equates to swissprot but for other organisms it gives a representative entry for each gene

Document the governance strategy for rendering databases as OBO

Note: it is acceptable to close this and say this project is not intended to represent consensus mapping of databases to OBO. I think that would be a wasted opportunity and we should be moving this towards a community standard.

When OBO-izing a database, there are many decisions that are made, either explicitly or implicitly. These have long term consequences for us.

  • Identifiers: do ECs have dashes in them?
  • Terminological: What gets the primary label: symbol or full name?
  • Metadata: What kinds of textual information is included, what is excluded? Definitions?
  • Ontological: What is the relationship between a gene, an EC, and a RHEA? (see this diagram)
  • Ismorphism: To what extent do we retain isomorphism with source vs introducing additional information that is useful in an OBO context (lots of examples here: https://github.com/obophenotype/ncbitaxon/issues)
  • Principles / Guidelines: What is best practice? When is OBO followed, when is it not suitable?

Some of these decisions can be punted elsewhere; identifiers can largely be punted to bioregistry.

As far as general principles, I believe that here there is a general best-effort (if unstated) to conform to OBO principles. However, as per slide 43 from my databases as OBO deck I don't think it makes sense to blindly apply OBO principles to OBOized databases. I think there are probably many lessons learned from existing efforts in the obo-db-ingest project that could be explicitly articulated for parallel guidelines.

I think there are some very practical OBO principles that should be adhered to, such as: entries should as far as possible have labels: biopragmatics/pyobo#169

There are modeling decisions that are made that are potentially very impactful in terms of constraining how the OBO products are used. For example, when using a RO relation to link two entities (e.g. biopragmatics/pyobo#168 (comment)) this basically injects a superclass into both sides of the relation. You are making a statement on behalf of the resource about what kind of thing they represent. This is a form of axiom injection.

Of course, this is necessary to some extent to make the resulting OBO usable. And we already do this to some extent in biolink. E.g. these are what biolink considers acceptable ID prefixes for a Gene: https://biolink.github.io/biolink-model/Gene/#valid-id-prefixes

These will be some of the harder ones (just look at the COB repo, and the dreaded D****** discussion). But I think we should be very practical here and not get bogged down.

However, it would be good to have some decision process - it's good to move fast but we don't want to be building up technical debt

drugbank obo file fails to parse using owlapi

the owlapi parser is the canonical reference parser, if there are ambiguities in the spec then it's considered the decider

drugbank.obo fails"

Parser: org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser@3e5499cc
    Stack trace:
LINENO: 111 - Could not find tag separator ':' in line.
LINE: Leuprolide was first approved in 1985 as a daily subcutaneous injection under the tradename Lupron™ by Abbvie Endocrine Inc.[L13850] Since this initial approval, various long-acting intramuscular and su
bcutaneous products have been developed such that patients can be dosed once every six months.[L13781, L13790] Leuprolide remains frontline therapy in all conditions for which it is indicated for use." []    
    org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:60)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyFactoryImpl.loadOWLOntology(OWLOntologyFactoryImpl.java:220)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.actualParse(OWLOntologyManagerImpl.java:1254)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1208)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1108)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1064)
        owltools.io.ParserWrapper.parseOWL(ParserWrapper.java:163)   
        owltools.io.ParserWrapper.parseOWL(ParserWrapper.java:150)   
        owltools.io.ParserWrapper.parseOBO(ParserWrapper.java:136)   
        owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:4801)
LINENO: 111 - Could not find tag separator ':' in line.
LINE: Leuprolide was first approved in 1985 as a daily subcutaneous injection under the tradename Lupron™ by Abbvie Endocrine Inc.[L13850] Since this initial approval, various long-acting intramuscular and su
bcutaneous products have been developed such that patients can be dosed once every six months.[L13781, L13790] Leuprolide remains frontline therapy in all conditions for which it is indicated for use." []    
    org.obolibrary.oboformat.parser.OBOFormatParser.error(OBOFormatParser.java:1465)
        org.obolibrary.oboformat.parser.OBOFormatParser.getParseTag(OBOFormatParser.java:861)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrameClause(OBOFormatParser.java:610)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrameClauseEOL(OBOFormatParser.java:598)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrame(OBOFormatParser.java:572)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseEntityFrame(OBOFormatParser.java:539)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseOBODoc(OBOFormatParser.java:349)
        org.obolibrary.oboformat.parser.OBOFormatParser.parse(OBOFormatParser.java:307)
        org.obolibrary.oboformat.parser.OBOFormatParser.parse(OBOFormatParser.java:259)
        org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:76)

I think the newlines need to be escaped

HGNC gene family IDs fail to resolve

[Term]
id: hgnc:5956
name: IHH
def: "Indian hedgehog signaling molecule" [pubmed:7590746, pubmed:14770182]
xref: ccds:CCDS33380
xref: ena:L38517
xref: ensembl:ENSG00000163501
xref: merops:C46.003
xref: ncbigene:3549
xref: omim:600726
xref: orphanet:122605
xref: refseq:NM_002181
xref: ucsc:uc002vjo.3
xref: vega:OTTHUMG00000154631
is_a: hgnc.genefamily:1373 ! Hedgehog signaling molecule family
is_a: hgnc.genefamily:1691 ! MicroRNA protein coding host genes
relationship: ro:0002205 uniprot:Q14623
relationship: ro:HOM0000017 rgd:620021
relationship: ro:HOM0000017 mgi:96533
relationship: ro:0002162 ncbitaxon:9606
property_value: locus_group "protein-coding gene" xsd:string
property_value: locus_type "gene with protein product" xsd:string
property_value: location "2q35" xsd:string
synonym: "BDA1" EXACT alias_symbol []
synonym: "HHG2" EXACT alias_symbol []
synonym: "Indian hedgehog (Drosophila) homolog" EXACT previous_name []

https://bioregistry.io/hgnc.genefamily:1373 -->
https://registry.identifiers.org/deprecation/resources/MIR:00100671/1373

with a 404 "go home" message :-(

image

OBO CURIEs (e.g. RO) should be uppercase

[Term]
id: hgnc:5956
name: IHH
def: "Indian hedgehog signaling molecule" [pubmed:7590746, pubmed:14770182]
xref: ccds:CCDS33380
xref: ena:L38517
xref: ensembl:ENSG00000163501
xref: merops:C46.003
xref: ncbigene:3549
xref: omim:600726
xref: orphanet:122605
xref: refseq:NM_002181
xref: ucsc:uc002vjo.3
xref: vega:OTTHUMG00000154631
is_a: hgnc.genefamily:1373 ! Hedgehog signaling molecule family
is_a: hgnc.genefamily:1691 ! MicroRNA protein coding host genes
relationship: ro:0002205 uniprot:Q14623
relationship: ro:HOM0000017 rgd:620021
relationship: ro:HOM0000017 mgi:96533
relationship: ro:0002162 ncbitaxon:9606
property_value: locus_group "protein-coding gene" xsd:string
property_value: locus_type "gene with protein product" xsd:string
property_value: location "2q35" xsd:string
synonym: "BDA1" EXACT alias_symbol []
synonym: "HHG2" EXACT alias_symbol []
synonym: "Indian hedgehog (Drosophila) homolog" EXACT previous_name []

Neither obo nor owl will treat these as being the same IDs as RO:0002205 etc

But awesome to see RO uses this way!

HGNC ingest fails as ingester assumes latest release is present

py run-test: commands[0] | python build.py -x hgnc
INFO: [2023-02-03 11:33:41] pystow.utils - downloading with urllib from https://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/json/hgnc_complete_set_2023-02-01.json to /Users/cjm/.data/pyobo/raw/hgnc/2023-02-01/hgnc_complete_set.json
Making OBO examples:   0%|                                                                                                                                                                                      | 0/1 [00:00<?, ?it/s, prefix=hgnc]
Traceback (most recent call last):
  File "/Users/cjm/repos/obo-db-ingest/build.py", line 143, in <module>
    main()

[...]

urllib.error.HTTPError: HTTP Error 404: Not Found
ERROR: InvocationError for command /Users/cjm/repos/obo-db-ingest/.tox/py/bin/python build.py -x hgnc (exited with code 1)
_____________________________________________________________________________________________________________________ summary _____________________________________________________________________________________________________________________
  lint: commands succeeded
ERROR:   py: commands failed

it looks like 2023-02-01 is not up yet (but it may be by the time you get to this issue):

https://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/archive/monthly/json/

image

not sure if this is a regular occurrence, but the assumption that there is always a release present for the 1st of the current month may be unreliable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.