translatorsri / babel Goto Github PK
View Code? Open in Web Editor NEWBabel creates cliques of equivalent identifiers across many biomedical vocabularies.
License: MIT License
Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
License: MIT License
In biolink 2, we are following these rules:
if it has a smiles:
if '.' in smiles:
MolecularMixture
else:
SmallMolecule
else:
ChemicalEntity
That last one can be improved.
I tried to build the chemical compendium using the documentation in the README, but it failed as follows:
Babel % snakemake --cores 1 chemical
Building DAG of jobs...
MissingInputException in line 86 of /Users/bsmith/isb/Babel/src/snakefiles/chemical.snakefile:
Missing input files for rule chemical_drugbank_ids:
output: /Users/bsmith/isb/Babel/babel_downloads/chemicals/ids/DRUGBANK
affected files:
/Users/bsmith/isb/Babel/babel_downloads/DrugBank/UC_XREF.srcfiltered.txt
I assume that this may simply be a case where the documentation is out of date, as per #32.
If there are updated instructions, even if it's just a few commands specific to the chemicals, that can be shared here ahead of an update to the README, I'd really appreciate seeing them.
Please let me know if there is anything I can clarify or contribute here. Thanks!
It looks like there's a problem with the non-inchi version of bringing in chebis.
If we have no inch but smi here are a few getting dropped:
71095
72219
145476
If we have neither here are a few: 52707, 91003
And sometimes we are overwriting something that does have a inchi with something that doesn't have one. So there's a chem that should have id 83920 but for some reason has 81762.
I've just completed my first run of Babel on Sterling (on a container with 500GB of memory!) using the changes in draft PR #37. The results I've obtained (on Hatteras at /scratch/gaurav/babel-outputs/2022apr4
) has lots of differences from the 2022-01-01 run, but I haven't come up with a good way of summarizing the changes or figuring out if it's working "correctly".
I've tried using diff/diffstat, but there are tons of changes, so it's not easy to see how signficant the changes are. I tried diffing some files individually, and was able to find a few patterns: for example, the polypeptide LSM-37009
in synonyms/Polypeptide.txt
is referred to as CHEBI:125504 in the new run and INCHIKEY:GGLDQJNBYFODOM-RDCMKPLUSA-N in the previous run.
compendia/AnatomicalEntity.txt |284873
compendia/BiologicalProcess.txt |55258
compendia/Cell.txt |15690
compendia/CellularComponent.txt |24855
compendia/ChemicalEntity.txt |6976071
compendia/ChemicalMixture.txt | 889
compendia/ComplexMolecularMixture.txt | 296
compendia/Disease.txt |654029
compendia/Gene.txt |77898179 ++---
compendia/GeneFamily.txt |55418
compendia/GrossAnatomicalStructure.txt |20397
compendia/MolecularActivity.txt |294143
compendia/MolecularMixture.txt |16366879 -
compendia/OrganismTaxon.txt |4783919
compendia/Pathway.txt |104290
compendia/PhenotypicFeature.txt |700283
compendia/Polypeptide.txt | 753
compendia/Protein.txt |456484451 ++++++++++++++++-----------------
compendia/SmallMolecule.txt |204804339 +++++++-------
conflation/GeneProtein.txt |16857753 -
reports/AnatomicalEntity.txt | 100
reports/BiologicalProcess.txt | 17
reports/Cell.txt | 73
reports/CellularComponent.txt | 60
reports/ChemicalEntity.txt | 1451
reports/ChemicalMixture.txt | 20
reports/ComplexMolecularMixture.txt | 23
reports/Disease.txt | 8278
reports/Gene.txt | 72
reports/GeneFamily.txt | 8
reports/GrossAnatomicalStructure.txt | 80
reports/MolecularActivity.txt | 70
reports/MolecularMixture.txt | 2310
reports/OrganismTaxon.txt | 12
reports/Pathway.txt | 8
reports/PhenotypicFeature.txt | 1175
reports/Polypeptide.txt | 30
reports/Protein.txt | 445
reports/SmallMolecule.txt | 8790
reports/disease_completeness.txt | 69
reports/process_completeness.txt | 4
synonyms/AnatomicalEntity.txt |624380
synonyms/BiologicalProcess.txt |224432
synonyms/Cell.txt |43534
synonyms/CellularComponent.txt |57426
synonyms/ChemicalEntity.txt |1428975
synonyms/ChemicalMixture.txt | 3566
synonyms/ComplexMolecularMixture.txt | 1672
synonyms/Disease.txt |2407347
synonyms/Gene.txt |1060645
synonyms/GeneFamily.txt |55418
synonyms/GrossAnatomicalStructure.txt |105021
synonyms/MolecularActivity.txt |393102
synonyms/MolecularMixture.txt |16811698 -
synonyms/OrganismTaxon.txt |139483
synonyms/Pathway.txt |109508
synonyms/PhenotypicFeature.txt |1712740
synonyms/Polypeptide.txt | 3133
synonyms/Protein.txt |2871694
synonyms/SmallMolecule.txt |215593042 +++++++--------
60 files changed, 525618409 insertions(+), 504434267 deletions(-)
Probably the best way to compare the changes is by comparing line counts, which shows that most files are pretty similarly sized, except for compendia/ChemicalEntity.txt
(which is 1577.58% bigger), compendia/MolecularMixture.txt
(58.43% bigger) and synonyms/MolecularMixture.txt
(56.51% bigger).
Does anybody have suggestions for comparing/validating the new Babel output before we try to move it to the dev server? We could for instance dump all the IDs alphabetically and run a massive diff on that. Having some method to do this would help with #36 as well.
ย | January 1, 2022 | April 4, 2022 | Percentage change |
---|---|---|---|
reports/chemical_completeness.txt | 1 | 1 | 0.00% |
reports/disease_completeness.txt | 60 | 123 | 105.00% |
reports/taxon_done | 1 | 1 | 0.00% |
reports/process_done | 1 | 1 | 0.00% |
reports/ChemicalEntity.txt | 741 | 732 | -1.21% |
reports/MolecularMixture.txt | 1144 | 1182 | 3.32% |
reports/gene_done | 1 | 1 | 0.00% |
reports/ChemicalMixture.txt | 15 | 17 | 13.33% |
reports/protein_done | 1 | 1 | 0.00% |
reports/anatomy_done | 1 | 1 | 0.00% |
reports/MolecularActivity.txt | 39 | 41 | 5.13% |
reports/Disease.txt | 4154 | 4174 | 0.48% |
reports/OrganismTaxon.txt | 11 | 11 | 0.00% |
reports/Protein.txt | 197 | 274 | 39.09% |
reports/Cell.txt | 42 | 43 | 2.38% |
reports/genefamily_done | 1 | 1 | 0.00% |
reports/CellularComponent.txt | 36 | 40 | 11.11% |
reports/process_completeness.txt | 3 | 1 | -66.67% |
reports/ComplexMolecularMixture.txt | 18 | 15 | -16.67% |
reports/taxon_completeness.txt | 1 | 1 | 0.00% |
reports/anatomy_completeness.txt | 1 | 1 | 0.00% |
reports/PhenotypicFeature.txt | 603 | 626 | 3.81% |
reports/GrossAnatomicalStructure.txt | 43 | 45 | 4.65% |
reports/Polypeptide.txt | 20 | 22 | 10.00% |
reports/BiologicalProcess.txt | 15 | 14 | -6.67% |
reports/disease_done | 1 | 1 | 0.00% |
reports/gene_completeness.txt | 1 | 1 | 0.00% |
reports/Pathway.txt | 11 | 11 | 0.00% |
reports/genefamily_completeness.txt | 1 | 1 | 0.00% |
reports/AnatomicalEntity.txt | 52 | 62 | 19.23% |
reports/SmallMolecule.txt | 4384 | 4432 | 1.09% |
reports/protein_completeness.txt | 1 | 1 | 0.00% |
reports/chemicals_done | 1 | 1 | 0.00% |
reports/GeneFamily.txt | 9 | 9 | 0.00% |
reports/Gene.txt | 45 | 47 | 4.44% |
compendia/ChemicalEntity.txt | 392499 | 6584478 | 1577.58% |
compendia/MolecularMixture.txt | 6334558 | 10035657 | 58.43% |
compendia/ChemicalMixture.txt | 475 | 482 | 1.47% |
compendia/MolecularActivity.txt | 145925 | 149030 | 2.13% |
compendia/Disease.txt | 322229 | 332754 | 3.27% |
compendia/OrganismTaxon.txt | 2375027 | 2412122 | 1.56% |
compendia/Protein.txt | 223676217 | 232834484 | 4.09% |
compendia/Cell.txt | 7678 | 8210 | 6.93% |
compendia/CellularComponent.txt | 12510 | 12623 | 0.90% |
compendia/ComplexMolecularMixture.txt | 165 | 169 | 2.42% |
compendia/PhenotypicFeature.txt | 355408 | 345793 | -2.71% |
compendia/GrossAnatomicalStructure.txt | 10379 | 10238 | -1.36% |
compendia/Polypeptide.txt | 408 | 409 | 0.25% |
compendia/BiologicalProcess.txt | 27790 | 27714 | -0.27% |
compendia/Pathway.txt | 52370 | 52452 | 0.16% |
compendia/AnatomicalEntity.txt | 142269 | 143562 | 0.91% |
compendia/SmallMolecule.txt | 104226454 | 100590131 | -3.49% |
compendia/GeneFamily.txt | 27892 | 27770 | -0.44% |
compendia/Gene.txt | 37802616 | 40108195 | 6.10% |
synonyms/ChemicalEntity.txt | 698121 | 732464 | 4.92% |
synonyms/MolecularMixture.txt | 6555269 | 10259687 | 56.51% |
synonyms/ChemicalMixture.txt | 1856 | 1870 | 0.75% |
synonyms/MolecularActivity.txt | 195416 | 198534 | 1.60% |
synonyms/Disease.txt | 1189024 | 1219429 | 2.56% |
synonyms/OrganismTaxon.txt | 69926 | 69993 | 0.10% |
synonyms/Protein.txt | 1421157 | 1451959 | 2.17% |
synonyms/Cell.txt | 20674 | 23034 | 11.42% |
synonyms/CellularComponent.txt | 28577 | 29027 | 1.57% |
synonyms/ComplexMolecularMixture.txt | 878 | 890 | 1.37% |
synonyms/PhenotypicFeature.txt | 858920 | 855136 | -0.44% |
synonyms/GrossAnatomicalStructure.txt | 52860 | 52553 | -0.58% |
synonyms/Polypeptide.txt | 1641 | 1628 | -0.79% |
synonyms/BiologicalProcess.txt | 112432 | 112364 | -0.06% |
synonyms/Pathway.txt | 54941 | 55021 | 0.15% |
synonyms/AnatomicalEntity.txt | 309911 | 315239 | 1.72% |
synonyms/SmallMolecule.txt | 108292041 | 107313775 | -0.90% |
synonyms/GeneFamily.txt | 27892 | 27770 | -0.44% |
synonyms/Gene.txt | 497027 | 564344 | 13.54% |
conflation/GeneProtein.txt | 8168582 | 8692887 | 6.42% |
curl -X GET "https://nodenormalization-sri.renci.org/get?key=NCBIGene%3A144571" -H "accept: application/json"
...
"type": [
"gene",
"named_thing",
"biological_entity",
"molecular_entity",
"genomic_entity",
"macromolecular_machine",
"gene_or_gene_product"
]
...
It starts at the right level, but then the ancestors should be inverted, starting at the parent of gene, and moving up to named_thing
In the following code, A1.2.3 Fully Formed Anatomical Structure
is listed as a UMLS category in the comments, but is not listed on line 77:
Babel/src/createcompendia/anatomy.py
Lines 66 to 77 in b0d638e
According to TranslatorSRI/NodeNormalization#119 (comment), there are 41 UMLS IDs classified as A1.2.3
that are leftover at the end of processing. Adding them as anatomical entities could remove them from the UMLS generation.
Next steps:
There are a few failure modes:
Babel was meant to be a prototype solution for identifier equivalence. It solves some problems, and is based on pre-existing code. In the long run, we would like to move to a more principled form of identifier equivalence mapping.
Babel should transform to a set of scripts that just pulls data from sources, and writes that data in a format that it will become the fodder for more advanced algorithms, such as kboom.
Targeting for Translator 1B.
There are GO terms that are not getting in, like GO:0052859 ( molecular function )
GO:1990333 (cellular component)
Only thing I can think at the moment is that our subclass query is flawed somehow.
There's a bl pr for this https://github.com/biolink/biolink-model/pull/295/files which will then have to be pulled into our bl service.
HP contains terms that are not phenotypes. Things like
HP:0001427 "Mitochondrial Inheritance", as well as terms like "triggered by" or other disease characteristics.
These are appearing in here as 'phenotypic_features' which they are not. I'm fairly sure that they're not getting in via HP, but probably via UMLS, and then we see the HP and say 'oh that must be a phenotype'.
Need to get from NCBI gene / ENSEMBL
These need to get added to anatomical enity as prefixes.
We currently use pyoxigraph 0.2. In order to upgrade to pyoxigraph 0.3, we'll need to replace references to the MemoryStore
class (which has been removed from this package) with references to Store
instead.
When running chemicals.py I get
Traceback (most recent call last):
File "babel/chemicals.py", line 682, in <module>
load_chemicals(refresh_mesh=False,refresh_uniprot=False,refresh_pubchem=False,refresh_chembl=False)
File "babel/chemicals.py", line 151, in load_chemicals
concord = load_unichem(refresh=True)
File "/opt/Babel/babel/unichem/unichem.py", line 21, in load_unichem
return refresh_unichem(working_dir,xref_file,struct_file)
File "/opt//Babel/babel/unichem/unichem.py", line 40, in refresh_unichem
sorted_xref_file = sort_xref_file(srcfiltered_xref_file, xref_file)
File "/opt/Babel/babel/unichem/unichem.py", line 244, in sort_xref_file
batch_sort(inf, outf, key=uci_key, tempdirs='.')
File "/opt/Babel/babel/big_gz_sort.py", line 41, in batch_sort
output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
OSError: [Errno 24] Too many open files: './001016'
Is there a way to fix this without adjusting ulimit on the client OS?
There are mesh chemicals (like MESH:C545823) that are not getting into babel. We should probably do like we do for chebi and bring in everything whether or not it synonymizes. (i.e. if we don't have any xrefs for it, we can at least give it the mesh id).
We appear to have some incomplete merging of UMLS and PUBCHEM identifiers. See e.g. Valsartan
Instead of a special file format, we should perhaps be writing out SSSOM file.
Is that going to make the files even more gigantic? Probably? Is that a problem?
We are just using SMILES to determine what kind of thing a CHEMBL.COMPOUND is, but there is a CHEMBL structure that we should make some attempt to use. There's proteins and enzymes and biologicals etc...
Need to get from NCBI gene / ENSEMBL
https://github.com/mapping-commons/
How should Babel be interacting with this? Contributing? Pulling? Both?
organism_taxon has been added, handling NCBITaxon and Mesh.
But it appears that NCBITaxon:9606 (homo sapiens) doesn't map to a mesh term? Seems mighty suspicious.
We are looking for KEGG.COMPOUND, but biolink model is using KEGG.
I think KEGG.COMPOUND is the identifiers.org version? Fix model or fix babel/rk ?
This is likely already on the radar - but it would be useful to be able to run the disease_phenotype parser without a UMLS license
I think the continue
in this code block is going to the wrong place -- it should be skipping the line, but instead I think it just continues the inner for
loop. Something to check.
Babel/src/createcompendia/chemicals.py
Lines 142 to 148 in 0f1eb14
The documentation of how babel works is pretty out of date. Write some words to help people understand what it is and how to use it.
The UMLS concordance starts with a single blank line (causing trouble for the kgx transformer)
We merge diseases and phenotypes when the same term occurred in both MONDO and HP. But this isn't totally correct because diseases are not phenotypes (even if kinda they are). Sometimes for unclear reasons, the mappings don't work out too well (see e.g. asthma).
We should make disease and phenotype another form of conflation and be more careful with it. we can at least partially use MONDO:otherHierarchy to build the conflation tables. The main problem I forsee is when you have MONDO claiming equivalence to (say) a UMLS and HP doing the same, so in that case we'll need to have some kind of rule about what goes where.
Jim has removed EFO from ubergraph, we will need to get it directly rather than from ubergraph
curl -X GET "https://nodenormalization-sri.renci.org/get?key=NCBIGene%3A144571" -H "accept: application/json"
...
"id": {
"identifier": "NCBIGene:144571"
},
...
If you normalize MESH: D004967 it gets merged with a bunch of stuff like NCIT:C483, which is therapeutic estrogen. That's not too bad, but the problem comes in because somehow that is getting made into a phenotypic feature.
currently, runs are somewhat ad hoc. We should probably have scheduled builds, and improved reporting / testing on what has changed.
There are reports generated for each file, but nothing comparing to previous runs to get a sense of what has changed or to look for unexpected differerences.
There should also be some cross-file checks. Like does the same identifier get pulled into more than one file?
Do we incorporate Trembl? How?
Do trembl identifiers go into a clique with swissprots? Or do we handle outside with similarity edges?
In genes.py we're unifying UniProtKB with genes. But because UniProtKB is not a gene prefix in biolink model, we're filtering those out when writing the compendium. What do we want to do here?
I think the right answer is 2.
Maybe we fork bl into the prototypes repo here?
Add compendia for gene family (panther family, hgnc). I think each will be independent, i.e. there is no synonymization across the two.
There are some MeSH phenotype terms that don't correspond to anything in mondo or hp, so they get dropped.
"fibrosis" (MeSH:D005355) is an example. HP/MONDO have many specific things like renal fibrosis etc, but not the concept of fibrosis itself.
UMLS does have this concept, and maps it to this mesh term, so we should be using that I think.
There is a large number of chembl compounds coming from pharos that dont' appear to actually exist in chembl. Perhaps these are obsolete?
https://github.com/cambridgeltl/sapbert
RENCI NER fork here: https://github.com/renci-ner/sapbert
I noticed that the 2022sep6 Babel release doesn't have any UMLS labels in ChemicalEntity.txt. I have a theory as to why this might be: since chemical.snakefile doesn't explicitly indicate that it depends on [download_directory]/UMLS/labels
, snakemake could be scheduling it to run before it generates [download_directory]/UMLS/labels
. I will investigate this further.
Right now, the panther families are only from the latest version, but there are identifiers that only exist in the older versions. So we should be pulling either all or a bunch of past ones and include them as well.
We have at least one consumer of babel output that wants KGX format. We have a KGX converter, but it lives in nodenorm for some reason. It should be extracted from there and put over here and generation of KGX outputs should be made part of the build process.
snakemake workflows can be run in parallel. It can interface with slurm and AirFlow can also handle it. What's the right implementation at RENCI?
Nodenorm includes yeast genes. For these genes, it looks like we have ncbigene, SGD, UniProt and PR, but for some reason, not ensembl.
We're bringing in ensembl genes, so why not yeasts?
There are a bunch of chemicals that ended up with a MESH as their main identifier, and which have a UNII and nothing else.
And there is a CHEBI that they seem like maybe they should be associated with. See e.g. Chloroquine. Is this just a case of hydrous/anhydrous or is there more to it?
MONDO:0005379. is neurotic disorder. It is being grouped with many many UMLS and other terms. And it is also grouped with several hps:
{
"identifier": "HP:0030973",
"label": "Postexertional malaise"
},
{
"identifier": "HP:0012432",
"label": "Chronic fatigue"
},
{
"identifier": "HP:0012378",
"label": "Fatigue"
},
{
"identifier": "HP:0025406",
"label": "Asthenia"
}
All those HP are wrong
right now this identifier (Amylases) gets a class of Protein. But shouldn't it be ProteinFamily?
curl -X GET "https://nodenormalization-sri.renci.org/get?key=MONDO%3A0011122" -H "accept: application/json"
...
"id": {
"identifier": "MONDO:0011122",
"label": "obesity disorder"
},
...
Should it be like this, or should it be more like:
"id":"MONDO:0011122",
"label":"obesity disorder"
i.e. without the wrapper around identity. Because of how we assign labels, it's not necessarily true that the id and the label come from the same vocabulary, which is sort of implied by the grouping.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.