cancervariants / therapy-normalization Goto Github PK
View Code? Open in Web Editor NEWServices and guidelines for normalizing drug and other therapy terms
Home Page: https://normalize.cancervariants.org/therapy/
License: MIT License
Services and guidelines for normalizing drug and other therapy terms
Home Page: https://normalize.cancervariants.org/therapy/
License: MIT License
Update NCIt ETL methods to work with DynamoDB.
We are currently writing normalization routines for each normalizer, but we should move to a single datastore for queries going forward.
This will reduce the amount of code we need to test and maintain.
This is a big issue, and will require several components. I am adding the EPIC label to designate this as the top-level issue, and we can track the related tasks on the Application Refactor project.
other_identifiers
should now include RxNorm and ChemIDPlus.
Our recommendations will require handling of non-drug (e.g. radiation) therapy terms.
The normalize
endpoint should generate a single, merged concept for search terms.
Non-urgent, but I think it'd be helpful at some point to briefly talk about what/how we should be logging across normalizers and sources and make it uniform.
Currently, the dgidb branch uses a revised response format from the master branch, where instead of an array of normalizer responses it returns a JSON object for each normalizer class.
I think it is better to maintain the array, but parameterize what normalizers are returned and optionally allow for the DGIdb response format.
Add additional attributes to source meta:
rdp_url: Optional[str]
non_commercial: Optional[bool]
share_alike: Optional[bool]
attribution: Optional[bool]
Drug approval status by locale is commonly requested information. We should capture this in our term normalization effort.
We're using a lot of raw SQL currently for both entering and pulling data -- for development reasons and for better security, we should alter this to use more of SQLAlchemy's feature base.
Currently we focus only on loading data for normalizers, but for this to be production ready / reusable we need to provide methods to update the data files used for the normalizers.
We currently have one lonely unit test implemented in pytest, but should build that out to cover the ChEMBL normalizer and other use cases.
@korikuzma and @jsstevenson have identified GitHub Actions as a good starting point for CI.
Make database updates a stand-alone task that can be called from a command line interface.
Click is a common and useful CLI tool. Typer is a Click-based CLI tool by the FastAPI creators. I mention the latter due to its closeness to the FastAPI toolset, but I think we should implement the CLI using Click directly due to its maturity and active community.
The first CLI util we'll want to make is to run one or more the ETL ops created in #30. By default users should specify which source they want to update, though we should provide an --all
flag that can shortcut to everything. We should build this utility with thought on how we would parallelize this I/O bound process.
We should include metadata on data licenses on a per-normalizer basis:
e.g. CC licenses: https://creativecommons.org/
This thread tracks known and public drug terminologies and normalization services. Initial list here: https://docs.google.com/spreadsheets/d/1bhA4RVt7HT2oIMVWeaXcnxFNcUQ4rEl9mB5cBSK8Kzw/edit#gid=0
It would be great to get NCIt concept codes where possible. For drugs, this will be descendent concepts of Pharmacologic Substance.
Right now, we manually download the normalizer's data file. We need to create methods to download from the source's site.
Currently, the normalizer list in query.py instantiates the normalizers on test runs (even when mocked), which raises file not found errors. We need to do some dependency injection or something of the sort to divert it to the correct pathways.
Our current filtering strategy to get relevant ChemIDplus records (checking for nomenclature tags) removes some entries that our NCIt data links to. So, we want to be able to opt-in records pointed to by NCIt. We need some combo of the following:
Separate between those representing therapy concepts from those representing associated concepts.
The normalization endpoint returns a single merged record, but it should also provide a list of merge IDs to other possible matches, if there are any. This would include any distinct merged groups that come up at the same MatchType level (should only apply to TradeName/Alias/Xref/AssociatedWith).
We should capture every time this happens with grouped concepts drawn from the same source(s) in the logs under WARNING.
We also want to run a big analysis to get an idea of where/how often this could happen.
warnings
)Update ChEMBL ETL methods to work with DynamoDB.
(where possible)
DGIdb did not successfully identify ChEMBL concepts for several drug trade names:
We may not be searching those, let's be sure to add ChEMBL trade names to our search index.
Given the various changes and additions to the app's structure and features, we should update the public-facing docs to provide a more thorough introduction.
We will need to appropriately map terms to regimens and combination therapies.
Update DrugBank ETL methods to work with DynamoDB.
Collect and represent biosimilar linkages between drug concepts
Refactor CLI to work with DynamoDB
Provide in both endpoints (so shelved until #62 closes)
We currently have data load methods in normalizers. We should extend these classes to allow for the full Extract, Transform, Load (ETL) process to the ORM implemented in #29.
Related issue: #7
As we transition to a unified datastore for the application, the first objective will be to implement an ORM and define the schema for our database.
A guide on doing this with FastAPI is available here.
We already have a data model specified, which should help with constructing the initial db. As this is going to be a largely read-only, we can and should specify some useful db indices during this process.
Eg:
python3 cli.py
raises an AttributeError - should instead print the help message
Currently the application will silently load data for the normalizers in the background if missing on startup. Users are not notified, and may think application is hanging.
Possible solutions include:
Alongside #29, we should add migrations.
See https://alembic.sqlalchemy.org/en/latest/tutorial.html to get started.
Refactor query.py to work with DynamoDB
https://www.nlm.nih.gov/databases/download/chemidplus.html
https://chem.nlm.nih.gov/chemidsearch/api
Outstanding questions:
Update Wikidata ETL methods to work with DynamoDB.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.