cancervariants / disease-normalization Goto Github PK

View Code? Open in Web Editor NEW

5.0 6.0 0.0 4.35 MB

Services and guidelines for normalizing disease terms

Home Page: https://disease-normalizer.readthedocs.io/latest/

License: MIT License

Python 56.33% Shell 0.15% Jupyter Notebook 43.52%

bioinformatics precision-medicine disease-classification biomedical-informatics python

disease-normalization's People

Contributors

Stargazers

Watchers

disease-normalization's Issues

Automate pypi release using GH Actions

Add DO data

http://www.obofoundry.org/ontology/doid.html

Handle missing merge ref

This shouldn't ever happen, but occasionally does for one reason or another -- in this case, we should handle it a bit more gracefully and enable more verbose logging.

Get GitHub Actions to pass

The test GitHub action has been failing for some time in main. We should update the test data so that tests will pass

Support lookup by context free OncoTree code

eg ACCC -> oncotree:ACCC. Not sure if this should be treated as an alias and stored in the DB, or just written into the query method.

Error when searching without all sources loaded

If not all sources are loaded in the database, the queries on the search endpoint will fail.

Clean OMIMPS xref values

They're currently treated as xrefs, but I think (?) they can be converted into direct OMIM concept identifiers, which would broaden the reach of some of the normalized concept entries.

Different sources inconsistently provide specific low and/or high grade concepts for some cancers. In the tree- and ontology-based sources, these are usually represented as children of the generic grade-less concept of the specific cancer.

Is it possible/desirable to retain that relationship? Practically, is there a way to retrieve both the graded and grade-less concept in one search?

Add other_identifier match type and DB reference

Add service meta to response

MONDO `_get_concept_id` raises ValueError when `ref=MESH:MESH:1622152`

This happens during the ETL for MONDO.

Traceback:

Loading mondo...
  ...
  File "metakb/.venv/lib/python3.10/site-packages/disease/etl/mondo.py", line 132, in <dictcomp>
    str(key): [self._get_concept_id(g[1]) for g in group]  # type: ignore
  File "metakb/.venv/lib/python3.10/site-packages/disease/etl/mondo.py", line 132, in <listcomp>
    str(key): [self._get_concept_id(g[1]) for g in group]  # type: ignore
  File "metakb/.venv/lib/python3.10/site-packages/disease/etl/mondo.py", line 103, in _get_concept_id
    prefix, id_no = ref.split(":")
  ValueError: too many values to unpack (expected 2)

This doesn't seem to be a valid ID as seen here. So will just catch the exception and log.

Add OncoTree data

http://oncotree.mskcc.org/#/home

EB use python 3.8

Our EB currently uses python 3.7. We should upgrade to 3.8.

Return all matches in search/

For the sake of keeping the 3 normalizer responses consistent:

Return all matches for a query. If different match types point to same record, use the highest match type
Add match_type to Disease class

Unable to normalize MetaKB terms

In the analysis of MetaKB, we were unable to normalize these terms for a disease in MOA. This resulted in us not being able to capture 6 MOA assertions.

oncotree:TALL
T-Cell Acute Lymphoid Leukemia
T-Cell Acute Lymphoid Leukemia

It might belong in this concept group, but I might be wrong.

Unpin exact ga4gh.vrsatile.pydantic version in main

Change

"ga4gh.vrsatile.pydantic" = "==0.0.11"

"ga4gh.vrsatile.pydantic" = "~=0.0.11"

in main branch

NCIt Download Error

When running python3 -m disease.cli --normalizer="ncit" --dev, I get a zipfile.BadZipFile: File is not a zip file. We should fix this to be able to use NCIt.

Add tests for api endpoints for schema validation

emergency

bad things are happening

Capture SSSOM mappings from MONDO

https://github.com/monarch-initiative/mondo/tree/master/src/ontology/mappings

Add OMIM data

Add normalize endpoint

Revert `DiseaseDescriptor.disease` to `DiseaseDescriptor.disease_id` in main

To be consistent with other normalizers on prod. I think some staging changes accidentally got merged into the main branch.

update vrsatile-pydantic dependency

Transition to serverless

We'll be moving away from Elastic Beanstalk. This requires a few things to help clean up environments + improve performance:

Remove Elastic Beanstalk code / files
Refactor app environment variables
Separate out dev dependencies

Handle OMIM deprecated concepts

Add MONDO data

http://www.obofoundry.org/ontology/mondo.html

Improve error description if OMIM file is missing

When disease-norm has been imported as a package into another environment, it's pretty unclear upon initialization where the "no OMIM found" error is coming from -- we should tighten up the description a bit

Cleanup click message format

Currently outputs:

Loading ncit...
Loaded <disease.etl.ncit.NCIt object at 0x107a261d0> in 624.06806 seconds.
Total time for <disease.etl.ncit.NCIt object at 0x107a261d0>: 624.15317 seconds.

This is due to source being reinitialized here.

Expected output:

Loading ncit...
Loaded ncit in 624.06806 seconds.
Total time for ncit: 624.15317 seconds.

Use local DynamoDB by default

We currently have the production database as the default, but we should switch to using the local database.

Indicate whether term is oncologic

Occasionally it's helpful to compare disease normalizer terms with a focus on whether they are cancers. Probably doable by looking at attributes/term lineage in MONDO.

Update ga4gh.vrsatile.pydantic to reflect schema changes

Don't import obsolete MONDO/NCIt concepts

Load Disease Ontology from local file

Somewhat bizarrely, owlread2 doesn't seem to recognize local copies of the DO OBO or OWL files as valid, but it is able to gather data from the remote repository. We should figure out what's going on so that we can use a consistent source file, given that DO updates intermittently.

Dockerfile

A Dockerfile would be useful

Get app logs to show in EB

other_id -> xref, xref -> associated_with

Rename Value Object IDs

disease_id --> id

Resolve NCIT/UMLS identifier confusions

eg, NCIT:C27819 == UMLS:C0027819?

Add missing NCIt and DOID concepts

Merges for the normalization endpoint are showing about 350 concepts from DO and NCIt that our first pass missed. For NCIt, many of them are subclasses of "finding" (not disease/disorder), but not all (it's not clear why ncit:C98923 isn't getting picked up, for example).

A full list is here: https://gist.github.com/jsstevenson/2d625d962edeeb21b2fd3f6a974c75eb'

Many of the DOID concepts seem to be deprecated (but still referenced by Mondo -- should be ok to skip)

Add pediatric attribute to NCIt

Use bioversions to get MONDO version

We are retrieving the latest version from bioversions which pulls the from bioregistry. bioregistry currently stores the latest version as 2022-12-01. However, the latest version is actually 20223-01-04 as seen in MONDO's latest releases here. I need to do a deeper dive on how bioregistry gets the latest version for mondo to create an issue/PR, but for now we can use GitHub's API.

To get the correct latest version, we can do:

import requests

response = requests.get("https://api.github.com/repos/monarch-initiative/mondo/releases/latest")
self._version = response.json()["name"].replace("v", "")

We could also replace the download url with if we wanted to ensure that the latest release is always being downloaded. The current url works, but might be safer to change in the event that there is a mismatch (seems unlikely?):

url = f"http://purl.obolibrary.org/obo/mondo/releases/{self._version}/mondo.owl"

For the term "testicular cancer", normalization is creating a group which puts the "pediatric" term as the group label. IE, the group containing mondo:0037250 is using ncit:C5053 as its anchor label.
That group also seems to be dropping IDs. The Mondo term lists a couple of xrefs but they're falling out of the finalized merge group.

Refine normalized concept set construction/record generation

Currently, we use Mondo cross-references to create record ID sets, and then generate merged records from all 3 sources (NCIt > Mondo> DO).

Some of this behavior should be refined:

Should we build merged concepts for NCIt/DO records that aren't linked from Mondo?
Should we include all 3 sources in merged concept construction?
How should this change as we add OMIM and OncoTree?

Use ga4gh.vrsatile.pydantic models

https://pypi.org/project/ga4gh.vrsatile.pydantic/

Enable search by associated_with fields

Also double-check handling of ICD codes, "icd.o" -> "icdo"

cancervariants / disease-normalization Goto Github PK

disease-normalization's People

Contributors

Stargazers

Watchers

disease-normalization's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs