cancervariants / disease-normalization Goto Github PK
View Code? Open in Web Editor NEWServices and guidelines for normalizing disease terms
Home Page: https://disease-normalizer.readthedocs.io/latest/
License: MIT License
Services and guidelines for normalizing disease terms
Home Page: https://disease-normalizer.readthedocs.io/latest/
License: MIT License
This shouldn't ever happen, but occasionally does for one reason or another -- in this case, we should handle it a bit more gracefully and enable more verbose logging.
The test GitHub action has been failing for some time in main. We should update the test data so that tests will pass
eg ACCC -> oncotree:ACCC. Not sure if this should be treated as an alias and stored in the DB, or just written into the query method.
If not all sources are loaded in the database, the queries on the search endpoint will fail.
They're currently treated as xrefs, but I think (?) they can be converted into direct OMIM concept identifiers, which would broaden the reach of some of the normalized concept entries.
Different sources inconsistently provide specific low and/or high grade concepts for some cancers. In the tree- and ontology-based sources, these are usually represented as children of the generic grade-less concept of the specific cancer.
This happens during the ETL for MONDO.
Traceback:
Loading mondo...
...
File "metakb/.venv/lib/python3.10/site-packages/disease/etl/mondo.py", line 132, in <dictcomp>
str(key): [self._get_concept_id(g[1]) for g in group] # type: ignore
File "metakb/.venv/lib/python3.10/site-packages/disease/etl/mondo.py", line 132, in <listcomp>
str(key): [self._get_concept_id(g[1]) for g in group] # type: ignore
File "metakb/.venv/lib/python3.10/site-packages/disease/etl/mondo.py", line 103, in _get_concept_id
prefix, id_no = ref.split(":")
ValueError: too many values to unpack (expected 2)
This doesn't seem to be a valid ID as seen here. So will just catch the exception and log.
Our EB currently uses python 3.7. We should upgrade to 3.8.
For the sake of keeping the 3 normalizer responses consistent:
In the analysis of MetaKB, we were unable to normalize these terms for a disease in MOA. This resulted in us not being able to capture 6 MOA assertions.
oncotree:TALL
T-Cell Acute Lymphoid Leukemia
T-Cell Acute Lymphoid Leukemia
It might belong in this concept group, but I might be wrong.
Change
"ga4gh.vrsatile.pydantic" = "==0.0.11"
to
"ga4gh.vrsatile.pydantic" = "~=0.0.11"
in main
branch
When running python3 -m disease.cli --normalizer="ncit" --dev
, I get a zipfile.BadZipFile: File is not a zip file
. We should fix this to be able to use NCIt.
bad things are happening
To be consistent with other normalizers on prod. I think some staging changes accidentally got merged into the main branch.
We'll be moving away from Elastic Beanstalk. This requires a few things to help clean up environments + improve performance:
When disease-norm has been imported as a package into another environment, it's pretty unclear upon initialization where the "no OMIM found" error is coming from -- we should tighten up the description a bit
Currently outputs:
Loading ncit...
Loaded <disease.etl.ncit.NCIt object at 0x107a261d0> in 624.06806 seconds.
Total time for <disease.etl.ncit.NCIt object at 0x107a261d0>: 624.15317 seconds.
This is due to source
being reinitialized here.
Expected output:
Loading ncit...
Loaded ncit in 624.06806 seconds.
Total time for ncit: 624.15317 seconds.
We currently have the production database as the default, but we should switch to using the local database.
Occasionally it's helpful to compare disease normalizer terms with a focus on whether they are cancers. Probably doable by looking at attributes/term lineage in MONDO.
Somewhat bizarrely, owlread2 doesn't seem to recognize local copies of the DO OBO or OWL files as valid, but it is able to gather data from the remote repository. We should figure out what's going on so that we can use a consistent source file, given that DO updates intermittently.
A Dockerfile would be useful
disease_id
--> id
eg, NCIT:C27819 == UMLS:C0027819?
Merges for the normalization endpoint are showing about 350 concepts from DO and NCIt that our first pass missed. For NCIt, many of them are subclasses of "finding" (not disease/disorder), but not all (it's not clear why ncit:C98923
isn't getting picked up, for example).
A full list is here: https://gist.github.com/jsstevenson/2d625d962edeeb21b2fd3f6a974c75eb'
Many of the DOID concepts seem to be deprecated (but still referenced by Mondo -- should be ok to skip)
We are retrieving the latest version from bioversions which pulls the from bioregistry. bioregistry currently stores the latest version as 2022-12-01
. However, the latest version is actually 20223-01-04
as seen in MONDO's latest releases here. I need to do a deeper dive on how bioregistry gets the latest version for mondo to create an issue/PR, but for now we can use GitHub's API.
To get the correct latest version, we can do:
import requests
response = requests.get("https://api.github.com/repos/monarch-initiative/mondo/releases/latest")
self._version = response.json()["name"].replace("v", "")
We could also replace the download url with if we wanted to ensure that the latest release is always being downloaded. The current url works, but might be safer to change in the event that there is a mismatch (seems unlikely?):
url = f"http://purl.obolibrary.org/obo/mondo/releases/{self._version}/mondo.owl"
We seem to be missing some terms containing "MTHU" identifiers from OMIM -- should investigate why, and whether they're worth adding
Two things seem to be happening on a recent build of the data:
For the term "testicular cancer", normalization is creating a group which puts the "pediatric" term as the group label. IE, the group containing mondo:0037250
is using ncit:C5053
as its anchor label.
That group also seems to be dropping IDs. The Mondo term lists a couple of xrefs but they're falling out of the finalized merge group.
Currently, we use Mondo cross-references to create record ID sets, and then generate merged records from all 3 sources (NCIt > Mondo> DO).
Some of this behavior should be refined:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.