GithubHelp home page GithubHelp logo

translatorsri / babel Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 2.0 9.61 MB

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.

License: MIT License

Python 99.01% Dockerfile 0.61% Shell 0.39%
ncats-translator

babel's People

Contributors

cbizon avatar gaurav avatar jdr0887 avatar phillipsowen avatar shalsh23 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

cthoyt jdr0887

babel's Issues

Improve typing of chebi

In biolink 2, we are following these rules:

if it has a smiles:
if '.' in smiles:
MolecularMixture
else:
SmallMolecule
else:
ChemicalEntity

That last one can be improved.

  1. If there were CHEBI mappings to biolink
  2. Using subclass of relations in chebi. If all the subclasses of a particular thing without a SMILES are Small Molecules, then maybe the parent class is too.

Unable to build chemical compendium on missing input file

I tried to build the chemical compendium using the documentation in the README, but it failed as follows:

Babel % snakemake --cores 1 chemical
Building DAG of jobs...
MissingInputException in line 86 of /Users/bsmith/isb/Babel/src/snakefiles/chemical.snakefile:
Missing input files for rule chemical_drugbank_ids:
    output: /Users/bsmith/isb/Babel/babel_downloads/chemicals/ids/DRUGBANK
    affected files:
        /Users/bsmith/isb/Babel/babel_downloads/DrugBank/UC_XREF.srcfiltered.txt

I assume that this may simply be a case where the documentation is out of date, as per #32.

If there are updated instructions, even if it's just a few commands specific to the chemicals, that can be shared here ahead of an update to the README, I'd really appreciate seeing them.

Please let me know if there is anything I can clarify or contribute here. Thanks!

CHEBI

It looks like there's a problem with the non-inchi version of bringing in chebis.

If we have no inch but smi here are a few getting dropped:
71095
72219
145476

If we have neither here are a few: 52707, 91003

And sometimes we are overwriting something that does have a inchi with something that doesn't have one. So there's a chem that should have id 83920 but for some reason has 81762.

Validating a new Babel run

I've just completed my first run of Babel on Sterling (on a container with 500GB of memory!) using the changes in draft PR #37. The results I've obtained (on Hatteras at /scratch/gaurav/babel-outputs/2022apr4) has lots of differences from the 2022-01-01 run, but I haven't come up with a good way of summarizing the changes or figuring out if it's working "correctly".

I've tried using diff/diffstat, but there are tons of changes, so it's not easy to see how signficant the changes are. I tried diffing some files individually, and was able to find a few patterns: for example, the polypeptide LSM-37009 in synonyms/Polypeptide.txt is referred to as CHEBI:125504 in the new run and INCHIKEY:GGLDQJNBYFODOM-RDCMKPLUSA-N in the previous run.

Diffstat comparison of Jan 1 and Apr 4 Babel runs
 compendia/AnatomicalEntity.txt         |284873 
 compendia/BiologicalProcess.txt        |55258 
 compendia/Cell.txt                     |15690 
 compendia/CellularComponent.txt        |24855 
 compendia/ChemicalEntity.txt           |6976071 
 compendia/ChemicalMixture.txt          |  889 
 compendia/ComplexMolecularMixture.txt  |  296 
 compendia/Disease.txt                  |654029 
 compendia/Gene.txt                     |77898179 ++---
 compendia/GeneFamily.txt               |55418 
 compendia/GrossAnatomicalStructure.txt |20397 
 compendia/MolecularActivity.txt        |294143 
 compendia/MolecularMixture.txt         |16366879 -
 compendia/OrganismTaxon.txt            |4783919 
 compendia/Pathway.txt                  |104290 
 compendia/PhenotypicFeature.txt        |700283 
 compendia/Polypeptide.txt              |  753 
 compendia/Protein.txt                  |456484451 ++++++++++++++++-----------------
 compendia/SmallMolecule.txt            |204804339 +++++++-------
 conflation/GeneProtein.txt             |16857753 -
 reports/AnatomicalEntity.txt           |  100 
 reports/BiologicalProcess.txt          |   17 
 reports/Cell.txt                       |   73 
 reports/CellularComponent.txt          |   60 
 reports/ChemicalEntity.txt             | 1451 
 reports/ChemicalMixture.txt            |   20 
 reports/ComplexMolecularMixture.txt    |   23 
 reports/Disease.txt                    | 8278 
 reports/Gene.txt                       |   72 
 reports/GeneFamily.txt                 |    8 
 reports/GrossAnatomicalStructure.txt   |   80 
 reports/MolecularActivity.txt          |   70 
 reports/MolecularMixture.txt           | 2310 
 reports/OrganismTaxon.txt              |   12 
 reports/Pathway.txt                    |    8 
 reports/PhenotypicFeature.txt          | 1175 
 reports/Polypeptide.txt                |   30 
 reports/Protein.txt                    |  445 
 reports/SmallMolecule.txt              | 8790 
 reports/disease_completeness.txt       |   69 
 reports/process_completeness.txt       |    4 
 synonyms/AnatomicalEntity.txt          |624380 
 synonyms/BiologicalProcess.txt         |224432 
 synonyms/Cell.txt                      |43534 
 synonyms/CellularComponent.txt         |57426 
 synonyms/ChemicalEntity.txt            |1428975 
 synonyms/ChemicalMixture.txt           | 3566 
 synonyms/ComplexMolecularMixture.txt   | 1672 
 synonyms/Disease.txt                   |2407347 
 synonyms/Gene.txt                      |1060645 
 synonyms/GeneFamily.txt                |55418 
 synonyms/GrossAnatomicalStructure.txt  |105021 
 synonyms/MolecularActivity.txt         |393102 
 synonyms/MolecularMixture.txt          |16811698 -
 synonyms/OrganismTaxon.txt             |139483 
 synonyms/Pathway.txt                   |109508 
 synonyms/PhenotypicFeature.txt         |1712740 
 synonyms/Polypeptide.txt               | 3133 
 synonyms/Protein.txt                   |2871694 
 synonyms/SmallMolecule.txt             |215593042 +++++++--------
 60 files changed, 525618409 insertions(+), 504434267 deletions(-)

Probably the best way to compare the changes is by comparing line counts, which shows that most files are pretty similarly sized, except for compendia/ChemicalEntity.txt (which is 1577.58% bigger), compendia/MolecularMixture.txt (58.43% bigger) and synonyms/MolecularMixture.txt (56.51% bigger).

Does anybody have suggestions for comparing/validating the new Babel output before we try to move it to the dev server? We could for instance dump all the IDs alphabetically and run a massive diff on that. Having some method to do this would help with #36 as well.

ย  January 1, 2022 April 4, 2022 Percentage change
reports/chemical_completeness.txt 1 1 0.00%
reports/disease_completeness.txt 60 123 105.00%
reports/taxon_done 1 1 0.00%
reports/process_done 1 1 0.00%
reports/ChemicalEntity.txt 741 732 -1.21%
reports/MolecularMixture.txt 1144 1182 3.32%
reports/gene_done 1 1 0.00%
reports/ChemicalMixture.txt 15 17 13.33%
reports/protein_done 1 1 0.00%
reports/anatomy_done 1 1 0.00%
reports/MolecularActivity.txt 39 41 5.13%
reports/Disease.txt 4154 4174 0.48%
reports/OrganismTaxon.txt 11 11 0.00%
reports/Protein.txt 197 274 39.09%
reports/Cell.txt 42 43 2.38%
reports/genefamily_done 1 1 0.00%
reports/CellularComponent.txt 36 40 11.11%
reports/process_completeness.txt 3 1 -66.67%
reports/ComplexMolecularMixture.txt 18 15 -16.67%
reports/taxon_completeness.txt 1 1 0.00%
reports/anatomy_completeness.txt 1 1 0.00%
reports/PhenotypicFeature.txt 603 626 3.81%
reports/GrossAnatomicalStructure.txt 43 45 4.65%
reports/Polypeptide.txt 20 22 10.00%
reports/BiologicalProcess.txt 15 14 -6.67%
reports/disease_done 1 1 0.00%
reports/gene_completeness.txt 1 1 0.00%
reports/Pathway.txt 11 11 0.00%
reports/genefamily_completeness.txt 1 1 0.00%
reports/AnatomicalEntity.txt 52 62 19.23%
reports/SmallMolecule.txt 4384 4432 1.09%
reports/protein_completeness.txt 1 1 0.00%
reports/chemicals_done 1 1 0.00%
reports/GeneFamily.txt 9 9 0.00%
reports/Gene.txt 45 47 4.44%
compendia/ChemicalEntity.txt 392499 6584478 1577.58%
compendia/MolecularMixture.txt 6334558 10035657 58.43%
compendia/ChemicalMixture.txt 475 482 1.47%
compendia/MolecularActivity.txt 145925 149030 2.13%
compendia/Disease.txt 322229 332754 3.27%
compendia/OrganismTaxon.txt 2375027 2412122 1.56%
compendia/Protein.txt 223676217 232834484 4.09%
compendia/Cell.txt 7678 8210 6.93%
compendia/CellularComponent.txt 12510 12623 0.90%
compendia/ComplexMolecularMixture.txt 165 169 2.42%
compendia/PhenotypicFeature.txt 355408 345793 -2.71%
compendia/GrossAnatomicalStructure.txt 10379 10238 -1.36%
compendia/Polypeptide.txt 408 409 0.25%
compendia/BiologicalProcess.txt 27790 27714 -0.27%
compendia/Pathway.txt 52370 52452 0.16%
compendia/AnatomicalEntity.txt 142269 143562 0.91%
compendia/SmallMolecule.txt 104226454 100590131 -3.49%
compendia/GeneFamily.txt 27892 27770 -0.44%
compendia/Gene.txt 37802616 40108195 6.10%
synonyms/ChemicalEntity.txt 698121 732464 4.92%
synonyms/MolecularMixture.txt 6555269 10259687 56.51%
synonyms/ChemicalMixture.txt 1856 1870 0.75%
synonyms/MolecularActivity.txt 195416 198534 1.60%
synonyms/Disease.txt 1189024 1219429 2.56%
synonyms/OrganismTaxon.txt 69926 69993 0.10%
synonyms/Protein.txt 1421157 1451959 2.17%
synonyms/Cell.txt 20674 23034 11.42%
synonyms/CellularComponent.txt 28577 29027 1.57%
synonyms/ComplexMolecularMixture.txt 878 890 1.37%
synonyms/PhenotypicFeature.txt 858920 855136 -0.44%
synonyms/GrossAnatomicalStructure.txt 52860 52553 -0.58%
synonyms/Polypeptide.txt 1641 1628 -0.79%
synonyms/BiologicalProcess.txt 112432 112364 -0.06%
synonyms/Pathway.txt 54941 55021 0.15%
synonyms/AnatomicalEntity.txt 309911 315239 1.72%
synonyms/SmallMolecule.txt 108292041 107313775 -0.90%
synonyms/GeneFamily.txt 27892 27770 -0.44%
synonyms/Gene.txt 497027 564344 13.54%
conflation/GeneProtein.txt 8168582 8692887 6.42%

Should we explictly add "A1.2.3" to the list of anatomical entities?

In the following code, A1.2.3 Fully Formed Anatomical Structure is listed as a UMLS category in the comments, but is not listed on line 77:

#UMLS categories:
#A1.2 Anatomical Structure
#A1.2.1 Embryonic Structure
#A1.2.3 Fully Formed Anatomical Structure
#A1.2.3.1 Body Part, Organ, or Organ Component
#A1.2.3.2 Tissue
#A1.2.3.3 Cell
#A1.2.3.4 Cell Component
#A2.1.4.1 Body System
#A2.1.5.1 Body Space or Junction
#A2.1.5.2 Body Location or Region
umlsmap = {x: ANATOMICAL_ENTITY for x in ['A1.2', 'A1.2.1', 'A1.2.3.1', 'A1.2.3.2', 'A2.1.4.1', 'A2.1.5.1', 'A2.1.5.2']}

According to TranslatorSRI/NodeNormalization#119 (comment), there are 41 UMLS IDs classified as A1.2.3 that are leftover at the end of processing. Adding them as anatomical entities could remove them from the UMLS generation.

Next steps:

  • Get a list of the 41 anatomical entities leftover in UMLS at the end of processing.

GTOPDB

There are a few failure modes:

  1. the chemical is something without a structure... should probably bring in all the identifiers a la the mesh update and chebi
  2. There is a inchi, but it isn't in unichem for gtopdb, even if it is for e.g. pubchem (10532) - this one is real bad, because it means we can't 100% rely on unichem. Maybe not a lot of these? Hopefully? If we can pull a list of inchis with the chemicals, we should still be ok, glom should handle it
  3. Peptides (4440, 6759) Not sure if we're rejecting this on purpose, but we shouldn't.

Babel -> kboom

Babel was meant to be a prototype solution for identifier equivalence. It solves some problems, and is based on pre-existing code. In the long run, we would like to move to a more principled form of identifier equivalence mapping.

Babel should transform to a set of scripts that just pulls data from sources, and writes that data in a format that it will become the fodder for more advanced algorithms, such as kboom.

Targeting for Translator 1B.

GO

There are GO terms that are not getting in, like GO:0052859 ( molecular function )
GO:1990333 (cellular component)

Only thing I can think at the moment is that our subclass query is flawed somehow.

Non-phenotypes given phenotype type

HP contains terms that are not phenotypes. Things like
HP:0001427 "Mitochondrial Inheritance", as well as terms like "triggered by" or other disease characteristics.

These are appearing in here as 'phenotypic_features' which they are not. I'm fairly sure that they're not getting in via HP, but probably via UMLS, and then we see the HP and say 'oh that must be a phenotype'.

BSPO and CARO

These need to get added to anatomical enity as prefixes.

Upgrade to pyoxigraph 0.3

We currently use pyoxigraph 0.2. In order to upgrade to pyoxigraph 0.3, we'll need to replace references to the MemoryStore class (which has been removed from this package) with references to Store instead.

too many files open when running chemicals.py

When running chemicals.py I get

Traceback (most recent call last):
  File "babel/chemicals.py", line 682, in <module>
    load_chemicals(refresh_mesh=False,refresh_uniprot=False,refresh_pubchem=False,refresh_chembl=False)
  File "babel/chemicals.py", line 151, in load_chemicals
    concord = load_unichem(refresh=True)
  File "/opt/Babel/babel/unichem/unichem.py", line 21, in load_unichem
    return refresh_unichem(working_dir,xref_file,struct_file)
  File "/opt//Babel/babel/unichem/unichem.py", line 40, in refresh_unichem
    sorted_xref_file = sort_xref_file(srcfiltered_xref_file, xref_file)
  File "/opt/Babel/babel/unichem/unichem.py", line 244, in sort_xref_file
    batch_sort(inf, outf, key=uci_key, tempdirs='.')
  File "/opt/Babel/babel/big_gz_sort.py", line 41, in batch_sort
    output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
OSError: [Errno 24] Too many open files: './001016'

Is there a way to fix this without adjusting ulimit on the client OS?

Mesh chemicals

There are mesh chemicals (like MESH:C545823) that are not getting into babel. We should probably do like we do for chebi and bring in everything whether or not it synonymizes. (i.e. if we don't have any xrefs for it, we can at least give it the mesh id).

SSSOM output

Instead of a special file format, we should perhaps be writing out SSSOM file.

Is that going to make the files even more gigantic? Probably? Is that a problem?

Improve typing of CHEMBL

We are just using SMILES to determine what kind of thing a CHEMBL.COMPOUND is, but there is a CHEMBL structure that we should make some attempt to use. There's proteins and enzymes and biologicals etc...

Add complex portal to downloads

  • Add a complex portal section to src/snakefiles/datacollect.snakefile
  • Create a complex portal ingester to do the work in src.datahandlers to download, extract labels and synonyms
  • Decide on the biolink class (Cellular Component or Macromolecular Complex)
  • Add the appropriate prefix to biolink

No Mesh for Human?

organism_taxon has been added, handling NCBITaxon and Mesh.

But it appears that NCBITaxon:9606 (homo sapiens) doesn't map to a mesh term? Seems mighty suspicious.

kegg chemicals are not getting in

We are looking for KEGG.COMPOUND, but biolink model is using KEGG.

I think KEGG.COMPOUND is the identifiers.org version? Fix model or fix babel/rk ?

Potential bug in chemicals.write_unii_ids()

I think the continue in this code block is going to the wrong place -- it should be skipping the line, but instead I think it just continues the inner for loop. Something to check.

for line in inf:
x = line.strip().split('\t')
for bcn in bad_colnos:
if len(x[bcn]) > 0:
#This is a plant or an eye of newt or something
continue
outf.write(f'{UNII}:{x[0]}\t{CHEMICAL_ENTITY}\n')

Document Babel

The documentation of how babel works is pretty out of date. Write some words to help people understand what it is and how to use it.

Add macromolecular machine compendium

  • Add ComplexPortal to biolink as a prefix
  • Create a mm snakemake and createcompendia module
  • Check in SGD for other identifiers we want to merge (check with Jon-Michael)
  • Assuming we do find something, add datahandler for it
  • in mm snakefile, add rules for ids ; if necessary add code to module
  • in mm snakefile, add rules for relationships/concords; add code to module
  • in mm snakefile add rule for create compendium; add code to module
  • in mm snakefile add assessment rules

Implement Disease/Phenotype conflation

We merge diseases and phenotypes when the same term occurred in both MONDO and HP. But this isn't totally correct because diseases are not phenotypes (even if kinda they are). Sometimes for unclear reasons, the mappings don't work out too well (see e.g. asthma).

We should make disease and phenotype another form of conflation and be more careful with it. we can at least partially use MONDO:otherHierarchy to build the conflation tables. The main problem I forsee is when you have MONDO claiming equivalence to (say) a UMLS and HP doing the same, so in that case we'll need to have some kind of rule about what goes where.

Handle missing efo

Jim has removed EFO from ubergraph, we will need to get it directly rather than from ubergraph

Estrogens turning into a phenotype

If you normalize MESH: D004967 it gets merged with a bunch of stuff like NCIT:C483, which is therapeutic estrogen. That's not too bad, but the problem comes in because somehow that is getting made into a phenotypic feature.

Improve process

currently, runs are somewhat ad hoc. We should probably have scheduled builds, and improved reporting / testing on what has changed.

There are reports generated for each file, but nothing comparing to previous runs to get a sense of what has changed or to look for unexpected differerences.

There should also be some cross-file checks. Like does the same identifier get pulled into more than one file?

Trembl

Do we incorporate Trembl? How?

Do trembl identifiers go into a clique with swissprots? Or do we handle outside with similarity edges?

UniProtKB not synonymized with genes

In genes.py we're unifying UniProtKB with genes. But because UniProtKB is not a gene prefix in biolink model, we're filtering those out when writing the compendium. What do we want to do here?

  1. Modify the real biolink model
  2. fork bl
  3. add a gene-product compendium with uniprots and PRs and then have a source linking genes to gene products (nasty).

I think the right answer is 2.

Maybe we fork bl into the prototypes repo here?

Add gene family

Add compendia for gene family (panther family, hgnc). I think each will be independent, i.e. there is no synonymization across the two.

Mesh not-chemicals

There are some MeSH phenotype terms that don't correspond to anything in mondo or hp, so they get dropped.
"fibrosis" (MeSH:D005355) is an example. HP/MONDO have many specific things like renal fibrosis etc, but not the concept of fibrosis itself.

UMLS does have this concept, and maps it to this mesh term, so we should be using that I think.

Pharos & Chembl

There is a large number of chembl compounds coming from pharos that dont' appear to actually exist in chembl. Perhaps these are obsolete?

Make sure labels are explicitly set as prerequisites for rules that need them

I noticed that the 2022sep6 Babel release doesn't have any UMLS labels in ChemicalEntity.txt. I have a theory as to why this might be: since chemical.snakefile doesn't explicitly indicate that it depends on [download_directory]/UMLS/labels, snakemake could be scheduling it to run before it generates [download_directory]/UMLS/labels. I will investigate this further.

include previous versions of panther

Right now, the panther families are only from the latest version, but there are identifiers that only exist in the older versions. So we should be pulling either all or a bunch of past ones and include them as well.

Move KGX transform from NN to Babel

We have at least one consumer of babel output that wants KGX format. We have a KGX converter, but it lives in nodenorm for some reason. It should be extracted from there and put over here and generation of KGX outputs should be made part of the build process.

Distribute work (Slurm, AirFlow)

snakemake workflows can be run in parallel. It can interface with slurm and AirFlow can also handle it. What's the right implementation at RENCI?

Yeast ensembl gene ids

Nodenorm includes yeast genes. For these genes, it looks like we have ncbigene, SGD, UniProt and PR, but for some reason, not ensembl.

We're bringing in ensembl genes, so why not yeasts?

Mesh chemicals + UNII

There are a bunch of chemicals that ended up with a MESH as their main identifier, and which have a UNII and nothing else.

And there is a CHEBI that they seem like maybe they should be associated with. See e.g. Chloroquine. Is this just a case of hydrous/anhydrous or is there more to it?

Mondo id has bad HP links

MONDO:0005379. is neurotic disorder. It is being grouped with many many UMLS and other terms. And it is also grouped with several hps:

 {
        "identifier": "HP:0030973",
        "label": "Postexertional malaise"
      },
      {
        "identifier": "HP:0012432",
        "label": "Chronic fatigue"
      },
      {
        "identifier": "HP:0012378",
        "label": "Fatigue"
      },
      {
        "identifier": "HP:0025406",
        "label": "Asthenia"
      }

All those HP are wrong

Match KGX serialization for nodes. Was: Is this the right format for the final identifier?

curl -X GET "https://nodenormalization-sri.renci.org/get?key=MONDO%3A0011122" -H "accept: application/json"

...
"id": {
      "identifier": "MONDO:0011122",
      "label": "obesity disorder"
    },
...

Should it be like this, or should it be more like:

"id":"MONDO:0011122",
"label":"obesity disorder"

i.e. without the wrapper around identity. Because of how we assign labels, it's not necessarily true that the id and the label come from the same vocabulary, which is sort of implied by the grouping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.