translatorsri / babel Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 2.0 9.61 MB

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.

License: MIT License

Python 99.01% Dockerfile 0.61% Shell 0.39%

ncats-translator

babel's People

Contributors

Stargazers

Watchers

Forkers

cthoyt jdr0887

babel's Issues

Improve typing of chebi

In biolink 2, we are following these rules:

if it has a smiles:
if '.' in smiles:
MolecularMixture
else:
SmallMolecule
else:
ChemicalEntity

That last one can be improved.

If there were CHEBI mappings to biolink
Using subclass of relations in chebi. If all the subclasses of a particular thing without a SMILES are Small Molecules, then maybe the parent class is too.

Unable to build chemical compendium on missing input file

I tried to build the chemical compendium using the documentation in the README, but it failed as follows:

Babel % snakemake --cores 1 chemical
Building DAG of jobs...
MissingInputException in line 86 of /Users/bsmith/isb/Babel/src/snakefiles/chemical.snakefile:
Missing input files for rule chemical_drugbank_ids:
    output: /Users/bsmith/isb/Babel/babel_downloads/chemicals/ids/DRUGBANK
    affected files:
        /Users/bsmith/isb/Babel/babel_downloads/DrugBank/UC_XREF.srcfiltered.txt

I assume that this may simply be a case where the documentation is out of date, as per #32.

If there are updated instructions, even if it's just a few commands specific to the chemicals, that can be shared here ahead of an update to the README, I'd really appreciate seeing them.

Please let me know if there is anything I can clarify or contribute here. Thanks!

CHEBI

It looks like there's a problem with the non-inchi version of bringing in chebis.

If we have no inch but smi here are a few getting dropped:
71095
72219
145476

If we have neither here are a few: 52707, 91003

And sometimes we are overwriting something that does have a inchi with something that doesn't have one. So there's a chem that should have id 83920 but for some reason has 81762.

I've just completed my first run of Babel on Sterling (on a container with 500GB of memory!) using the changes in draft PR #37. The results I've obtained (on Hatteras at /scratch/gaurav/babel-outputs/2022apr4) has lots of differences from the 2022-01-01 run, but I haven't come up with a good way of summarizing the changes or figuring out if it's working "correctly".

I've tried using diff/diffstat, but there are tons of changes, so it's not easy to see how signficant the changes are. I tried diffing some files individually, and was able to find a few patterns: for example, the polypeptide LSM-37009 in synonyms/Polypeptide.txt is referred to as CHEBI:125504 in the new run and INCHIKEY:GGLDQJNBYFODOM-RDCMKPLUSA-N in the previous run.

Diffstat comparison of Jan 1 and Apr 4 Babel runs

 compendia/AnatomicalEntity.txt         |284873 
 compendia/BiologicalProcess.txt        |55258 
 compendia/Cell.txt                     |15690 
 compendia/CellularComponent.txt        |24855 
 compendia/ChemicalEntity.txt           |6976071 
 compendia/ChemicalMixture.txt          |  889 
 compendia/ComplexMolecularMixture.txt  |  296 
 compendia/Disease.txt                  |654029 
 compendia/Gene.txt                     |77898179 ++---
 compendia/GeneFamily.txt               |55418 
 compendia/GrossAnatomicalStructure.txt |20397 
 compendia/MolecularActivity.txt        |294143 
 compendia/MolecularMixture.txt         |16366879 -
 compendia/OrganismTaxon.txt            |4783919 
 compendia/Pathway.txt                  |104290 
 compendia/PhenotypicFeature.txt        |700283 
 compendia/Polypeptide.txt              |  753 
 compendia/Protein.txt                  |456484451 ++++++++++++++++-----------------
 compendia/SmallMolecule.txt            |204804339 +++++++-------
 conflation/GeneProtein.txt             |16857753 -
 reports/AnatomicalEntity.txt           |  100 
 reports/BiologicalProcess.txt          |   17 
 reports/Cell.txt                       |   73 
 reports/CellularComponent.txt          |   60 
 reports/ChemicalEntity.txt             | 1451 
 reports/ChemicalMixture.txt            |   20 
 reports/ComplexMolecularMixture.txt    |   23 
 reports/Disease.txt                    | 8278 
 reports/Gene.txt                       |   72 
 reports/GeneFamily.txt                 |    8 
 reports/GrossAnatomicalStructure.txt   |   80 
 reports/MolecularActivity.txt          |   70 
 reports/MolecularMixture.txt           | 2310 
 reports/OrganismTaxon.txt              |   12 
 reports/Pathway.txt                    |    8 
 reports/PhenotypicFeature.txt          | 1175 
 reports/Polypeptide.txt                |   30 
 reports/Protein.txt                    |  445 
 reports/SmallMolecule.txt              | 8790 
 reports/disease_completeness.txt       |   69 
 reports/process_completeness.txt       |    4 
 synonyms/AnatomicalEntity.txt          |624380 
 synonyms/BiologicalProcess.txt         |224432 
 synonyms/Cell.txt                      |43534 
 synonyms/CellularComponent.txt         |57426 
 synonyms/ChemicalEntity.txt            |1428975 
 synonyms/ChemicalMixture.txt           | 3566 
 synonyms/ComplexMolecularMixture.txt   | 1672 
 synonyms/Disease.txt                   |2407347 
 synonyms/Gene.txt                      |1060645 
 synonyms/GeneFamily.txt                |55418 
 synonyms/GrossAnatomicalStructure.txt  |105021 
 synonyms/MolecularActivity.txt         |393102 
 synonyms/MolecularMixture.txt          |16811698 -
 synonyms/OrganismTaxon.txt             |139483 
 synonyms/Pathway.txt                   |109508 
 synonyms/PhenotypicFeature.txt         |1712740 
 synonyms/Polypeptide.txt               | 3133 
 synonyms/Protein.txt                   |2871694 
 synonyms/SmallMolecule.txt             |215593042 +++++++--------
 60 files changed, 525618409 insertions(+), 504434267 deletions(-)

Probably the best way to compare the changes is by comparing line counts, which shows that most files are pretty similarly sized, except for compendia/ChemicalEntity.txt (which is 1577.58% bigger), compendia/MolecularMixture.txt (58.43% bigger) and synonyms/MolecularMixture.txt (56.51% bigger).

Does anybody have suggestions for comparing/validating the new Babel output before we try to move it to the dev server? We could for instance dump all the IDs alphabetically and run a massive diff on that. Having some method to do this would help with #36 as well.

	January 1, 2022	April 4, 2022	Percentage change
reports/chemical_completeness.txt	1	1	0.00%
reports/disease_completeness.txt	60	123	105.00%
reports/taxon_done	1	1	0.00%
reports/process_done	1	1	0.00%
reports/ChemicalEntity.txt	741	732	-1.21%
reports/MolecularMixture.txt	1144	1182	3.32%
reports/gene_done	1	1	0.00%
reports/ChemicalMixture.txt	15	17	13.33%
reports/protein_done	1	1	0.00%
reports/anatomy_done	1	1	0.00%
reports/MolecularActivity.txt	39	41	5.13%
reports/Disease.txt	4154	4174	0.48%
reports/OrganismTaxon.txt	11	11	0.00%
reports/Protein.txt	197	274	39.09%
reports/Cell.txt	42	43	2.38%
reports/genefamily_done	1	1	0.00%
reports/CellularComponent.txt	36	40	11.11%
reports/process_completeness.txt	3	1	-66.67%
reports/ComplexMolecularMixture.txt	18	15	-16.67%
reports/taxon_completeness.txt	1	1	0.00%
reports/anatomy_completeness.txt	1	1	0.00%
reports/PhenotypicFeature.txt	603	626	3.81%
reports/GrossAnatomicalStructure.txt	43	45	4.65%
reports/Polypeptide.txt	20	22	10.00%
reports/BiologicalProcess.txt	15	14	-6.67%
reports/disease_done	1	1	0.00%
reports/gene_completeness.txt	1	1	0.00%
reports/Pathway.txt	11	11	0.00%
reports/genefamily_completeness.txt	1	1	0.00%
reports/AnatomicalEntity.txt	52	62	19.23%
reports/SmallMolecule.txt	4384	4432	1.09%
reports/protein_completeness.txt	1	1	0.00%
reports/chemicals_done	1	1	0.00%
reports/GeneFamily.txt	9	9	0.00%
reports/Gene.txt	45	47	4.44%
compendia/ChemicalEntity.txt	392499	6584478	1577.58%
compendia/MolecularMixture.txt	6334558	10035657	58.43%
compendia/ChemicalMixture.txt	475	482	1.47%
compendia/MolecularActivity.txt	145925	149030	2.13%
compendia/Disease.txt	322229	332754	3.27%
compendia/OrganismTaxon.txt	2375027	2412122	1.56%
compendia/Protein.txt	223676217	232834484	4.09%
compendia/Cell.txt	7678	8210	6.93%
compendia/CellularComponent.txt	12510	12623	0.90%
compendia/ComplexMolecularMixture.txt	165	169	2.42%
compendia/PhenotypicFeature.txt	355408	345793	-2.71%
compendia/GrossAnatomicalStructure.txt	10379	10238	-1.36%
compendia/Polypeptide.txt	408	409	0.25%
compendia/BiologicalProcess.txt	27790	27714	-0.27%
compendia/Pathway.txt	52370	52452	0.16%
compendia/AnatomicalEntity.txt	142269	143562	0.91%
compendia/SmallMolecule.txt	104226454	100590131	-3.49%
compendia/GeneFamily.txt	27892	27770	-0.44%
compendia/Gene.txt	37802616	40108195	6.10%
synonyms/ChemicalEntity.txt	698121	732464	4.92%
synonyms/MolecularMixture.txt	6555269	10259687	56.51%
synonyms/ChemicalMixture.txt	1856	1870	0.75%
synonyms/MolecularActivity.txt	195416	198534	1.60%
synonyms/Disease.txt	1189024	1219429	2.56%
synonyms/OrganismTaxon.txt	69926	69993	0.10%
synonyms/Protein.txt	1421157	1451959	2.17%
synonyms/Cell.txt	20674	23034	11.42%
synonyms/CellularComponent.txt	28577	29027	1.57%
synonyms/ComplexMolecularMixture.txt	878	890	1.37%
synonyms/PhenotypicFeature.txt	858920	855136	-0.44%
synonyms/GrossAnatomicalStructure.txt	52860	52553	-0.58%
synonyms/Polypeptide.txt	1641	1628	-0.79%
synonyms/BiologicalProcess.txt	112432	112364	-0.06%
synonyms/Pathway.txt	54941	55021	0.15%
synonyms/AnatomicalEntity.txt	309911	315239	1.72%
synonyms/SmallMolecule.txt	108292041	107313775	-0.90%
synonyms/GeneFamily.txt	27892	27770	-0.44%
synonyms/Gene.txt	497027	564344	13.54%
conflation/GeneProtein.txt	8168582	8692887	6.42%

Order of types is not right

curl -X GET "https://nodenormalization-sri.renci.org/get?key=NCBIGene%3A144571" -H "accept: application/json"

...
"type": [
      "gene",
      "named_thing",
      "biological_entity",
      "molecular_entity",
      "genomic_entity",
      "macromolecular_machine",
      "gene_or_gene_product"
    ]
...

It starts at the right level, but then the ancestors should be inverted, starting at the parent of gene, and moving up to named_thing

Should we explictly add "A1.2.3" to the list of anatomical entities?

In the following code, A1.2.3 Fully Formed Anatomical Structure is listed as a UMLS category in the comments, but is not listed on line 77:

Babel/src/createcompendia/anatomy.py

Lines 66 to 77 in b0d638e

 #UMLS categories: 

 #A1.2 Anatomical Structure 

 #A1.2.1 Embryonic Structure 

 #A1.2.3 Fully Formed Anatomical Structure 

 #A1.2.3.1 Body Part, Organ, or Organ Component 

 #A1.2.3.2 Tissue 

 #A1.2.3.3 Cell 

 #A1.2.3.4 Cell Component 

 #A2.1.4.1 Body System 

 #A2.1.5.1 Body Space or Junction 

 #A2.1.5.2 Body Location or Region 

 umlsmap = {x: ANATOMICAL_ENTITY for x in ['A1.2', 'A1.2.1', 'A1.2.3.1', 'A1.2.3.2', 'A2.1.4.1', 'A2.1.5.1', 'A2.1.5.2']}

According to TranslatorSRI/NodeNormalization#119 (comment), there are 41 UMLS IDs classified as A1.2.3 that are leftover at the end of processing. Adding them as anatomical entities could remove them from the UMLS generation.

Next steps:

Get a list of the 41 anatomical entities leftover in UMLS at the end of processing.

GTOPDB

There are a few failure modes:

the chemical is something without a structure... should probably bring in all the identifiers a la the mesh update and chebi
There is a inchi, but it isn't in unichem for gtopdb, even if it is for e.g. pubchem (10532) - this one is real bad, because it means we can't 100% rely on unichem. Maybe not a lot of these? Hopefully? If we can pull a list of inchis with the chemicals, we should still be ok, glom should handle it
Peptides (4440, 6759) Not sure if we're rejecting this on purpose, but we shouldn't.

Babel -> kboom

Babel was meant to be a prototype solution for identifier equivalence. It solves some problems, and is based on pre-existing code. In the long run, we would like to move to a more principled form of identifier equivalence mapping.

Babel should transform to a set of scripts that just pulls data from sources, and writes that data in a format that it will become the fodder for more advanced algorithms, such as kboom.

Targeting for Translator 1B.

GO

There are GO terms that are not getting in, like GO:0052859 ( molecular function )
GO:1990333 (cellular component)

Only thing I can think at the moment is that our subclass query is flawed somehow.

HP needs to be an allowed diseaes

There's a bl pr for this https://github.com/biolink/biolink-model/pull/295/files which will then have to be pulled into our bl service.

Non-phenotypes given phenotype type

HP contains terms that are not phenotypes. Things like
HP:0001427 "Mitochondrial Inheritance", as well as terms like "triggered by" or other disease characteristics.

These are appearing in here as 'phenotypic_features' which they are not. I'm fairly sure that they're not getting in via HP, but probably via UMLS, and then we see the HP and say 'oh that must be a phenotype'.

Add wikipathways

https://www.wikipathways.org/index.php/WikiPathways

Non-human genes

Need to get from NCBI gene / ENSEMBL

BSPO and CARO

These need to get added to anatomical enity as prefixes.

Upgrade to pyoxigraph 0.3

We currently use pyoxigraph 0.2. In order to upgrade to pyoxigraph 0.3, we'll need to replace references to the MemoryStore class (which has been removed from this package) with references to Store instead.

too many files open when running chemicals.py

When running chemicals.py I get

Traceback (most recent call last):
  File "babel/chemicals.py", line 682, in <module>
    load_chemicals(refresh_mesh=False,refresh_uniprot=False,refresh_pubchem=False,refresh_chembl=False)
  File "babel/chemicals.py", line 151, in load_chemicals
    concord = load_unichem(refresh=True)
  File "/opt/Babel/babel/unichem/unichem.py", line 21, in load_unichem
    return refresh_unichem(working_dir,xref_file,struct_file)
  File "/opt//Babel/babel/unichem/unichem.py", line 40, in refresh_unichem
    sorted_xref_file = sort_xref_file(srcfiltered_xref_file, xref_file)
  File "/opt/Babel/babel/unichem/unichem.py", line 244, in sort_xref_file
    batch_sort(inf, outf, key=uci_key, tempdirs='.')
  File "/opt/Babel/babel/big_gz_sort.py", line 41, in batch_sort
    output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
OSError: [Errno 24] Too many open files: './001016'

Is there a way to fix this without adjusting ulimit on the client OS?

Mesh chemicals

There are mesh chemicals (like MESH:C545823) that are not getting into babel. We should probably do like we do for chebi and bring in everything whether or not it synonymizes. (i.e. if we don't have any xrefs for it, we can at least give it the mesh id).

Valsartan not merging UMLS/PUBCHEM

We appear to have some incomplete merging of UMLS and PUBCHEM identifiers. See e.g. Valsartan

SSSOM output

Instead of a special file format, we should perhaps be writing out SSSOM file.

Is that going to make the files even more gigantic? Probably? Is that a problem?

Improve typing of CHEMBL

We are just using SMILES to determine what kind of thing a CHEMBL.COMPOUND is, but there is a CHEMBL structure that we should make some attempt to use. There's proteins and enzymes and biologicals etc...

non-human genes

Need to get from NCBI gene / ENSEMBL

Add complex portal to downloads

Add a complex portal section to src/snakefiles/datacollect.snakefile
Create a complex portal ingester to do the work in src.datahandlers to download, extract labels and synonyms
Decide on the biolink class (Cellular Component or Macromolecular Complex)
Add the appropriate prefix to biolink

Mapping commons

https://github.com/mapping-commons/

How should Babel be interacting with this? Contributing? Pulling? Both?

No Mesh for Human?

organism_taxon has been added, handling NCBITaxon and Mesh.

But it appears that NCBITaxon:9606 (homo sapiens) doesn't map to a mesh term? Seems mighty suspicious.

kegg chemicals are not getting in

We are looking for KEGG.COMPOUND, but biolink model is using KEGG.

I think KEGG.COMPOUND is the identifiers.org version? Fix model or fix babel/rk ?

Allow running of disease_phenotype.py without UMLS license

This is likely already on the radar - but it would be useful to be able to run the disease_phenotype parser without a UMLS license

Potential bug in chemicals.write_unii_ids()

I think the continue in this code block is going to the wrong place -- it should be skipping the line, but instead I think it just continues the inner for loop. Something to check.

Babel/src/createcompendia/chemicals.py

Lines 142 to 148 in 0f1eb14

 for line in inf: 

 x = line.strip().split('\t') 

 for bcn in bad_colnos: 

 if len(x[bcn]) > 0: 

 #This is a plant or an eye of newt or something 

 continue 

 outf.write(f'{UNII}:{x[0]}\t{CHEMICAL_ENTITY}\n')

Document Babel

The documentation of how babel works is pretty out of date. Write some words to help people understand what it is and how to use it.

Add macromolecular machine compendium

Add ComplexPortal to biolink as a prefix
Create a mm snakemake and createcompendia module
Check in SGD for other identifiers we want to merge (check with Jon-Michael)
Assuming we do find something, add datahandler for it
in mm snakefile, add rules for ids ; if necessary add code to module
in mm snakefile, add rules for relationships/concords; add code to module
in mm snakefile add rule for create compendium; add code to module
in mm snakefile add assessment rules

Stray empty line in umls.txt

The UMLS concordance starts with a single blank line (causing trouble for the kgx transformer)

Implement Disease/Phenotype conflation

We merge diseases and phenotypes when the same term occurred in both MONDO and HP. But this isn't totally correct because diseases are not phenotypes (even if kinda they are). Sometimes for unclear reasons, the mappings don't work out too well (see e.g. asthma).

We should make disease and phenotype another form of conflation and be more careful with it. we can at least partially use MONDO:otherHierarchy to build the conflation tables. The main problem I forsee is when you have MONDO claiming equivalence to (say) a UMLS and HP doing the same, so in that case we'll need to have some kind of rule about what goes where.

Handle missing efo

Jim has removed EFO from ubergraph, we will need to get it directly rather than from ubergraph

Gene is missing label

curl -X GET "https://nodenormalization-sri.renci.org/get?key=NCBIGene%3A144571" -H "accept: application/json"

...
"id": {
      "identifier": "NCBIGene:144571"
    },
...

Estrogens turning into a phenotype

If you normalize MESH: D004967 it gets merged with a bunch of stuff like NCIT:C483, which is therapeutic estrogen. That's not too bad, but the problem comes in because somehow that is getting made into a phenotypic feature.

Improve process

currently, runs are somewhat ad hoc. We should probably have scheduled builds, and improved reporting / testing on what has changed.

There are reports generated for each file, but nothing comparing to previous runs to get a sense of what has changed or to look for unexpected differerences.

There should also be some cross-file checks. Like does the same identifier get pulled into more than one file?

Trembl

Do we incorporate Trembl? How?

Do trembl identifiers go into a clique with swissprots? Or do we handle outside with similarity edges?

UniProtKB not synonymized with genes

In genes.py we're unifying UniProtKB with genes. But because UniProtKB is not a gene prefix in biolink model, we're filtering those out when writing the compendium. What do we want to do here?

Modify the real biolink model
fork bl
add a gene-product compendium with uniprots and PRs and then have a source linking genes to gene products (nasty).

I think the right answer is 2.

Maybe we fork bl into the prototypes repo here?

Add gene family

Add compendia for gene family (panther family, hgnc). I think each will be independent, i.e. there is no synonymization across the two.

Mesh not-chemicals

There are some MeSH phenotype terms that don't correspond to anything in mondo or hp, so they get dropped.
"fibrosis" (MeSH:D005355) is an example. HP/MONDO have many specific things like renal fibrosis etc, but not the concept of fibrosis itself.

UMLS does have this concept, and maps it to this mesh term, so we should be using that I think.

Pharos & Chembl

There is a large number of chembl compounds coming from pharos that dont' appear to actually exist in chembl. Perhaps these are obsolete?

Look into SapBERT for grouping together related concepts

https://github.com/cambridgeltl/sapbert

RENCI NER fork here: https://github.com/renci-ner/sapbert

Make sure labels are explicitly set as prerequisites for rules that need them

I noticed that the 2022sep6 Babel release doesn't have any UMLS labels in ChemicalEntity.txt. I have a theory as to why this might be: since chemical.snakefile doesn't explicitly indicate that it depends on [download_directory]/UMLS/labels, snakemake could be scheduling it to run before it generates [download_directory]/UMLS/labels. I will investigate this further.

include previous versions of panther

Right now, the panther families are only from the latest version, but there are identifiers that only exist in the older versions. So we should be pulling either all or a bunch of past ones and include them as well.

Move KGX transform from NN to Babel

We have at least one consumer of babel output that wants KGX format. We have a KGX converter, but it lives in nodenorm for some reason. It should be extracted from there and put over here and generation of KGX outputs should be made part of the build process.

 {
        "identifier": "HP:0030973",
        "label": "Postexertional malaise"
      },
      {
        "identifier": "HP:0012432",
        "label": "Chronic fatigue"
      },
      {
        "identifier": "HP:0012378",
        "label": "Fatigue"
      },
      {
        "identifier": "HP:0025406",
        "label": "Asthenia"
      }

All those HP are wrong

Is UMLS:C0002712 a Protein?

right now this identifier (Amylases) gets a class of Protein. But shouldn't it be ProteinFamily?

Match KGX serialization for nodes. Was: Is this the right format for the final identifier?

curl -X GET "https://nodenormalization-sri.renci.org/get?key=MONDO%3A0011122" -H "accept: application/json"

...
"id": {
      "identifier": "MONDO:0011122",
      "label": "obesity disorder"
    },
...

Should it be like this, or should it be more like:

"id":"MONDO:0011122",
"label":"obesity disorder"

i.e. without the wrapper around identity. Because of how we assign labels, it's not necessarily true that the id and the label come from the same vocabulary, which is sort of implied by the grouping.

	#UMLS categories:
	#A1.2 Anatomical Structure
	#A1.2.1 Embryonic Structure
	#A1.2.3 Fully Formed Anatomical Structure
	#A1.2.3.1 Body Part, Organ, or Organ Component
	#A1.2.3.2 Tissue
	#A1.2.3.3 Cell
	#A1.2.3.4 Cell Component
	#A2.1.4.1 Body System
	#A2.1.5.1 Body Space or Junction
	#A2.1.5.2 Body Location or Region
	umlsmap = {x: ANATOMICAL_ENTITY for x in ['A1.2', 'A1.2.1', 'A1.2.3.1', 'A1.2.3.2', 'A2.1.4.1', 'A2.1.5.1', 'A2.1.5.2']}

	for line in inf:
	x = line.strip().split('\t')
	for bcn in bad_colnos:
	if len(x[bcn]) > 0:
	#This is a plant or an eye of newt or something
	continue
	outf.write(f'{UNII}:{x[0]}\t{CHEMICAL_ENTITY}\n')

translatorsri / babel Goto Github PK

babel's People

Contributors

Stargazers

Watchers

Forkers

babel's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs