GithubHelp home page GithubHelp logo

cancervariants / disease-normalization Goto Github PK

View Code? Open in Web Editor NEW
5.0 6.0 0.0 4.35 MB

Services and guidelines for normalizing disease terms

Home Page: https://disease-normalizer.readthedocs.io/latest/

License: MIT License

Python 56.33% Shell 0.15% Jupyter Notebook 43.52%
bioinformatics precision-medicine disease-classification biomedical-informatics python

disease-normalization's People

Contributors

jsstevenson avatar jsstevenson-tmp avatar korikuzma avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

disease-normalization's Issues

Handle missing merge ref

This shouldn't ever happen, but occasionally does for one reason or another -- in this case, we should handle it a bit more gracefully and enable more verbose logging.

Get GitHub Actions to pass

The test GitHub action has been failing for some time in main. We should update the test data so that tests will pass

Clean OMIMPS xref values

They're currently treated as xrefs, but I think (?) they can be converted into direct OMIM concept identifiers, which would broaden the reach of some of the normalized concept entries.

Handling grade in cancers

Different sources inconsistently provide specific low and/or high grade concepts for some cancers. In the tree- and ontology-based sources, these are usually represented as children of the generic grade-less concept of the specific cancer.

  • Is it possible/desirable to retain that relationship? Practically, is there a way to retrieve both the graded and grade-less concept in one search?

MONDO `_get_concept_id` raises ValueError when `ref=MESH:MESH:1622152`

This happens during the ETL for MONDO.

Traceback:

Loading mondo...
  ...
  File "metakb/.venv/lib/python3.10/site-packages/disease/etl/mondo.py", line 132, in <dictcomp>
    str(key): [self._get_concept_id(g[1]) for g in group]  # type: ignore
  File "metakb/.venv/lib/python3.10/site-packages/disease/etl/mondo.py", line 132, in <listcomp>
    str(key): [self._get_concept_id(g[1]) for g in group]  # type: ignore
  File "metakb/.venv/lib/python3.10/site-packages/disease/etl/mondo.py", line 103, in _get_concept_id
    prefix, id_no = ref.split(":")
  ValueError: too many values to unpack (expected 2)

This doesn't seem to be a valid ID as seen here. So will just catch the exception and log.

Return all matches in search/

For the sake of keeping the 3 normalizer responses consistent:

  • Return all matches for a query. If different match types point to same record, use the highest match type
  • Add match_type to Disease class

Unable to normalize MetaKB terms

In the analysis of MetaKB, we were unable to normalize these terms for a disease in MOA. This resulted in us not being able to capture 6 MOA assertions.

oncotree:TALL
T-Cell Acute Lymphoid Leukemia
T-Cell Acute Lymphoid Leukemia

It might belong in this concept group, but I might be wrong.

NCIt Download Error

When running python3 -m disease.cli --normalizer="ncit" --dev, I get a zipfile.BadZipFile: File is not a zip file. We should fix this to be able to use NCIt.

Transition to serverless

We'll be moving away from Elastic Beanstalk. This requires a few things to help clean up environments + improve performance:

  • Remove Elastic Beanstalk code / files
  • Refactor app environment variables
  • Separate out dev dependencies

Improve error description if OMIM file is missing

When disease-norm has been imported as a package into another environment, it's pretty unclear upon initialization where the "no OMIM found" error is coming from -- we should tighten up the description a bit

Cleanup click message format

Currently outputs:

Loading ncit...
Loaded <disease.etl.ncit.NCIt object at 0x107a261d0> in 624.06806 seconds.
Total time for <disease.etl.ncit.NCIt object at 0x107a261d0>: 624.15317 seconds.

This is due to source being reinitialized here.

Expected output:

Loading ncit...
Loaded ncit in 624.06806 seconds.
Total time for ncit: 624.15317 seconds.

Indicate whether term is oncologic

Occasionally it's helpful to compare disease normalizer terms with a focus on whether they are cancers. Probably doable by looking at attributes/term lineage in MONDO.

Load Disease Ontology from local file

Somewhat bizarrely, owlread2 doesn't seem to recognize local copies of the DO OBO or OWL files as valid, but it is able to gather data from the remote repository. We should figure out what's going on so that we can use a consistent source file, given that DO updates intermittently.

Add missing NCIt and DOID concepts

Merges for the normalization endpoint are showing about 350 concepts from DO and NCIt that our first pass missed. For NCIt, many of them are subclasses of "finding" (not disease/disorder), but not all (it's not clear why ncit:C98923 isn't getting picked up, for example).

A full list is here: https://gist.github.com/jsstevenson/2d625d962edeeb21b2fd3f6a974c75eb'

Many of the DOID concepts seem to be deprecated (but still referenced by Mondo -- should be ok to skip)

Use bioversions to get MONDO version

We are retrieving the latest version from bioversions which pulls the from bioregistry. bioregistry currently stores the latest version as 2022-12-01. However, the latest version is actually 20223-01-04 as seen in MONDO's latest releases here. I need to do a deeper dive on how bioregistry gets the latest version for mondo to create an issue/PR, but for now we can use GitHub's API.

To get the correct latest version, we can do:

import requests

response = requests.get("https://api.github.com/repos/monarch-initiative/mondo/releases/latest")
self._version = response.json()["name"].replace("v", "")

We could also replace the download url with if we wanted to ensure that the latest release is always being downloaded. The current url works, but might be safer to change in the event that there is a mismatch (seems unlikely?):

url = f"http://purl.obolibrary.org/obo/mondo/releases/{self._version}/mondo.owl"

Capture MTHU terms from OMIM

We seem to be missing some terms containing "MTHU" identifiers from OMIM -- should investigate why, and whether they're worth adding

Normalization wonkiness

Two things seem to be happening on a recent build of the data:

  1. For the term "testicular cancer", normalization is creating a group which puts the "pediatric" term as the group label. IE, the group containing mondo:0037250 is using ncit:C5053 as its anchor label.

  2. That group also seems to be dropping IDs. The Mondo term lists a couple of xrefs but they're falling out of the finalized merge group.

Refine normalized concept set construction/record generation

Currently, we use Mondo cross-references to create record ID sets, and then generate merged records from all 3 sources (NCIt > Mondo> DO).

Some of this behavior should be refined:

  • Should we build merged concepts for NCIt/DO records that aren't linked from Mondo?
  • Should we include all 3 sources in merged concept construction?
  • How should this change as we add OMIM and OncoTree?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.