GithubHelp home page GithubHelp logo

cancervariants / therapy-normalization Goto Github PK

View Code? Open in Web Editor NEW
10.0 5.0 3.0 3.16 MB

Services and guidelines for normalizing drug and other therapy terms

Home Page: https://normalize.cancervariants.org/therapy/

License: MIT License

Python 75.67% Jupyter Notebook 24.22% Shell 0.11%
bioinformatics precision-medicine bioinformatics-data

therapy-normalization's People

Contributors

ahwagner avatar dependabot[bot] avatar jsstevenson avatar jsstevenson-tmp avatar korikuzma avatar mcannon068nw avatar ohsu-machineuser avatar susannasiebert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

therapy-normalization's Issues

Application Refactor

We are currently writing normalization routines for each normalizer, but we should move to a single datastore for queries going forward.

This will reduce the amount of code we need to test and maintain.

This is a big issue, and will require several components. I am adding the EPIC label to designate this as the top-level issue, and we can track the related tasks on the Application Refactor project.

Non-drug therapies

Our recommendations will require handling of non-drug (e.g. radiation) therapy terms.

Standardize logging

Non-urgent, but I think it'd be helpful at some point to briefly talk about what/how we should be logging across normalizers and sources and make it uniform.

Merge in the dgidb branch and parameterize

Currently, the dgidb branch uses a revised response format from the master branch, where instead of an array of normalizer responses it returns a JSON object for each normalizer class.

I think it is better to maintain the array, but parameterize what normalizers are returned and optionally allow for the DGIdb response format.

Update source meta information

Add additional attributes to source meta:

  • rdp_url: Optional[str]
    
  • non_commercial: Optional[bool]
    
  • share_alike: Optional[bool]
    
  • attribution: Optional[bool]
    

Capture drug approval status

Drug approval status by locale is commonly requested information. We should capture this in our term normalization effort.

Wikidata normalizer update / retrieve methods

Currently we focus only on loading data for normalizers, but for this to be production ready / reusable we need to provide methods to update the data files used for the normalizers.

Add test suite and CI

We currently have one lonely unit test implemented in pytest, but should build that out to cover the ChEMBL normalizer and other use cases.

@korikuzma and @jsstevenson have identified GitHub Actions as a good starting point for CI.

Add CLI for populating database

Make database updates a stand-alone task that can be called from a command line interface.

Click is a common and useful CLI tool. Typer is a Click-based CLI tool by the FastAPI creators. I mention the latter due to its closeness to the FastAPI toolset, but I think we should implement the CLI using Click directly due to its maturity and active community.

The first CLI util we'll want to make is to run one or more the ETL ops created in #30. By default users should specify which source they want to update, though we should provide an --all flag that can shortcut to everything. We should build this utility with thought on how we would parallelize this I/O bound process.

Refactor data path checks out of query

Currently, the normalizer list in query.py instantiates the normalizers on test runs (even when mocked), which raises file not found errors. We need to do some dependency injection or something of the sort to divert it to the correct pathways.

Whitespace sanitization

We should consider how we want to handle queries with whitespace.

The below example shows how the addition of whitespace between hyphens can change the results we get.

No whitespace
Screen Shot 2020-10-05 at 11 31 40 AM

Whitespace in between hyphens
Screen Shot 2020-10-05 at 11 31 56 AM

Handle wrongly-excluded ChemIDplus records

Our current filtering strategy to get relevant ChemIDplus records (checking for nomenclature tags) removes some entries that our NCIt data links to. So, we want to be able to opt-in records pointed to by NCIt. We need some combo of the following:

  • NCIt should be able to tell us which ChemIDplus records are missing (or at least build a list of all referenced ChemIDplus records).
  • When performing a normal ChemIDplus load, it should be able to reference an existing allow-list to rule-in records that it would otherwise exclude.
  • ChemIDplus should probably be able to perform a run that only adds in allow-listed records, rather than reimporting the entire source.
  • We need to think about how this would work if users import both sources at once, or if they import one and then the other later.

Add xrefs attribute

Separate between those representing therapy concepts from those representing associated concepts.

Refactor query.py to search db

Currently query.py queries each resource using independent search routines based off of the normalize routine for each normalizer class. After implementing the db in #29, we are now positioned to build out a generalized query using the ORM.

Add "alternative matches" item to normalization endpoint response

The normalization endpoint returns a single merged record, but it should also provide a list of merge IDs to other possible matches, if there are any. This would include any distinct merged groups that come up at the same MatchType level (should only apply to TradeName/Alias/Xref/AssociatedWith).

We should capture every time this happens with grouped concepts drawn from the same source(s) in the logs under WARNING.

We also want to run a big analysis to get an idea of where/how often this could happen.

  • Run an analysis of all label_and_type fields, break out by item_type, to get an idea of how frequent alternative matches could come up (and break out by different vs same MatchType numeric values)
  • log instances in real time under WARNING
  • provide list of other possible merged concept IDs (probably under warnings)

Trade Name normalization

DGIdb did not successfully identify ChEMBL concepts for several drug trade names:

  • TECFIDERA
  • ADRIAMYCIN
  • WESTCORT
  • CASODEX
  • FOSCAVIR

We may not be searching those, let's be sure to add ChEMBL trade names to our search index.

Update documentation

Given the various changes and additions to the app's structure and features, we should update the public-facing docs to provide a more thorough introduction.

Regimens

We will need to appropriately map terms to regimens and combination therapies.

Update meta info

  • Declare tool version variable
  • Add response date/time

Provide in both endpoints (so shelved until #62 closes)

Add ORM and define initial schema for db

As we transition to a unified datastore for the application, the first objective will be to implement an ORM and define the schema for our database.

A guide on doing this with FastAPI is available here.

We already have a data model specified, which should help with constructing the initial db. As this is going to be a largely read-only, we can and should specify some useful db indices during this process.

Background data loading

Currently the application will silently load data for the normalizers in the background if missing on startup. Users are not notified, and may think application is hanging.

Possible solutions include:

  1. A console handler for the logger, which can be used to emit loading messages to terminal as they are captured by the logger.
  2. Raising an exception, with an alternate endpoint for preparing data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.