cancervariants / therapy-normalization Goto Github PK

View Code? Open in Web Editor NEW

10.0 5.0 3.0 3.16 MB

Services and guidelines for normalizing drug and other therapy terms

Home Page: https://normalize.cancervariants.org/therapy/

License: MIT License

Python 75.67% Jupyter Notebook 24.22% Shell 0.11%

bioinformatics precision-medicine bioinformatics-data

therapy-normalization's People

Contributors

Stargazers

Watchers

Forkers

susannasiebert richardhj cthoyt

therapy-normalization's Issues

Version of normalizer in response

NCIt ETL Methods

Update NCIt ETL methods to work with DynamoDB.

Application Refactor

We are currently writing normalization routines for each normalizer, but we should move to a single datastore for queries going forward.

This will reduce the amount of code we need to test and maintain.

This is a big issue, and will require several components. I am adding the EPIC label to designate this as the top-level issue, and we can track the related tasks on the Application Refactor project.

Update other_identifiers and xrefs

other_identifiers should now include RxNorm and ChemIDPlus.

Non-drug therapies

Our recommendations will require handling of non-drug (e.g. radiation) therapy terms.

Application Refactor: DynamoDB

Transition from sqlite database to DynamoDB.

We can track the related tasks on the Application Refactor Project.

Create normalize endpoint

The normalize endpoint should generate a single, merged concept for search terms.

Standardize logging

Non-urgent, but I think it'd be helpful at some point to briefly talk about what/how we should be logging across normalizers and sources and make it uniform.

Implement RxNorm Normalizer

https://www.nlm.nih.gov/research/umls/rxnorm/index.html

Add DB URL environment variable check to CLI updating tool

Merge in the dgidb branch and parameterize

Currently, the dgidb branch uses a revised response format from the master branch, where instead of an array of normalizer responses it returns a JSON object for each normalizer class.

I think it is better to maintain the array, but parameterize what normalizers are returned and optionally allow for the DGIdb response format.

Update source meta information

Add additional attributes to source meta:

```
rdp_url: Optional[str]
```
```
non_commercial: Optional[bool]
```
```
share_alike: Optional[bool]
```
```
attribution: Optional[bool]
```

Capture drug approval status

Drug approval status by locale is commonly requested information. We should capture this in our term normalization effort.

Use ORM API for insertion and selection statements

We're using a lot of raw SQL currently for both entering and pulling data -- for development reasons and for better security, we should alter this to use more of SQLAlchemy's feature base.

Remove xrefs from RxNorm with SRL > 1

Wikidata normalizer update / retrieve methods

Currently we focus only on loading data for normalizers, but for this to be production ready / reusable we need to provide methods to update the data files used for the normalizers.

Add test suite and CI

We currently have one lonely unit test implemented in pytest, but should build that out to cover the ChEMBL normalizer and other use cases.

@korikuzma and @jsstevenson have identified GitHub Actions as a good starting point for CI.

Add CLI for populating database

Make database updates a stand-alone task that can be called from a command line interface.

Click is a common and useful CLI tool. Typer is a Click-based CLI tool by the FastAPI creators. I mention the latter due to its closeness to the FastAPI toolset, but I think we should implement the CLI using Click directly due to its maturity and active community.

The first CLI util we'll want to make is to run one or more the ETL ops created in #30. By default users should specify which source they want to update, though we should provide an --all flag that can shortcut to everything. We should build this utility with thought on how we would parallelize this I/O bound process.

License meta

We should include metadata on data licenses on a per-normalizer basis:

e.g. CC licenses: https://creativecommons.org/

Existing drug terminologies and normalization services

This thread tracks known and public drug terminologies and normalization services. Initial list here: https://docs.google.com/spreadsheets/d/1bhA4RVt7HT2oIMVWeaXcnxFNcUQ4rEl9mB5cBSK8Kzw/edit#gid=0

Implement NCIt normalizer

It would be great to get NCIt concept codes where possible. For drugs, this will be descendent concepts of Pharmacologic Substance.

Create download methods for normalizers

Right now, we manually download the normalizer's data file. We need to create methods to download from the source's site.

Refactor data path checks out of query

Currently, the normalizer list in query.py instantiates the normalizers on test runs (even when mocked), which raises file not found errors. We need to do some dependency injection or something of the sort to divert it to the correct pathways.

Whitespace sanitization

We should consider how we want to handle queries with whitespace.

The below example shows how the addition of whitespace between hyphens can change the results we get.

No whitespace

Whitespace in between hyphens

Handle wrongly-excluded ChemIDplus records

Our current filtering strategy to get relevant ChemIDplus records (checking for nomenclature tags) removes some entries that our NCIt data links to. So, we want to be able to opt-in records pointed to by NCIt. We need some combo of the following:

NCIt should be able to tell us which ChemIDplus records are missing (or at least build a list of all referenced ChemIDplus records).
When performing a normal ChemIDplus load, it should be able to reference an existing allow-list to rule-in records that it would otherwise exclude.
ChemIDplus should probably be able to perform a run that only adds in allow-listed records, rather than reimporting the entire source.
We need to think about how this would work if users import both sources at once, or if they import one and then the other later.

Add xrefs attribute

Separate between those representing therapy concepts from those representing associated concepts.

Refactor query.py to search db

Currently query.py queries each resource using independent search routines based off of the normalize routine for each normalizer class. After implementing the db in #29, we are now positioned to build out a generalized query using the ORM.

Add "alternative matches" item to normalization endpoint response

The normalization endpoint returns a single merged record, but it should also provide a list of merge IDs to other possible matches, if there are any. This would include any distinct merged groups that come up at the same MatchType level (should only apply to TradeName/Alias/Xref/AssociatedWith).

We should capture every time this happens with grouped concepts drawn from the same source(s) in the logs under WARNING.

We also want to run a big analysis to get an idea of where/how often this could happen.

Run an analysis of all label_and_type fields, break out by item_type, to get an idea of how frequent alternative matches could come up (and break out by different vs same MatchType numeric values)
log instances in real time under WARNING
provide list of other possible merged concept IDs (probably under warnings)

ChEMBL ETL Methods

Update ChEMBL ETL methods to work with DynamoDB.

Use Mock classes for testing

(where possible)

https://docs.pytest.org/en/stable/monkeypatch.html?highlight=mocking#monkeypatching-returned-objects-building-mock-classes

It's probably fine to pull most of the 'mock' db layer out of the existing normalization tests, as long as there's a way to catch update requests before they alter a local DB, validate them, and prevent them from making any changes.

Trade Name normalization

DGIdb did not successfully identify ChEMBL concepts for several drug trade names:

TECFIDERA
ADRIAMYCIN
WESTCORT
CASODEX
FOSCAVIR

We may not be searching those, let's be sure to add ChEMBL trade names to our search index.

Update documentation

Given the various changes and additions to the app's structure and features, we should update the public-facing docs to provide a more thorough introduction.

ChEMBL normalizer not matching case-sensitively

The ChEMBL normalizer returns match_type code 100 (a case-sensitive primary match) for what should be case-insensitive primary matches, eg:

...and the same for alias matches:

Regimens

We will need to appropriately map terms to regimens and combination therapies.

DrugBank ETL Methods

Update DrugBank ETL methods to work with DynamoDB.

Capture biosimilar relationships

Collect and represent biosimilar linkages between drug concepts

Refactor CLI

Refactor CLI to work with DynamoDB

Implement DrugBank normalizer

From: https://go.drugbank.com/releases/latest

Update meta info

Declare tool version variable
Add response date/time

Provide in both endpoints (so shelved until #62 closes)

Create ETL methods in normalizer classes

We currently have data load methods in normalizers. We should extend these classes to allow for the full Extract, Transform, Load (ETL) process to the ORM implemented in #29.

Related issue: #7

Add service-info endpoint

Per https://github.com/ga4gh-discovery/ga4gh-service-info.

See https://editor.swagger.io/?url=https://raw.githubusercontent.com/ga4gh-discovery/ga4gh-service-info/develop/service-info.yaml

Add ORM and define initial schema for db

As we transition to a unified datastore for the application, the first objective will be to implement an ORM and define the schema for our database.

A guide on doing this with FastAPI is available here.

We already have a data model specified, which should help with constructing the initial db. As this is going to be a largely read-only, we can and should specify some useful db indices during this process.

AttributeError when no args are given to cli.py

Eg:

python3 cli.py

raises an AttributeError - should instead print the help message

Background data loading

Currently the application will silently load data for the normalizers in the background if missing on startup. Users are not notified, and may think application is hanging.

Possible solutions include:

A console handler for the logger, which can be used to emit loading messages to terminal as they are captured by the logger.
Raising an exception, with an alternate endpoint for preparing data.

Outstanding questions:

Some records have multiple "NameOfSubstance" attributes - could possibly resolve by "Source" tag? (currently grabbing the first item listed -- which appears to always be NLM)
Label vs trade names - "NameOfSubstance" vs "SystematicName"?

Wikidata ETL Methods

Update Wikidata ETL methods to work with DynamoDB.

cancervariants / therapy-normalization Goto Github PK

therapy-normalization's People

Contributors

Stargazers

Watchers

Forkers

therapy-normalization's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs