cancervariants / gene-normalization Goto Github PK
View Code? Open in Web Editor NEWServices and guidelines for normalizing genes
Home Page: https://gene-normalizer.readthedocs.io/latest/
License: MIT License
Services and guidelines for normalizing genes
Home Page: https://gene-normalizer.readthedocs.io/latest/
License: MIT License
The normalize
endpoint should generate a single, merged concept for search terms.
NCBI has retired gene identifiers in the past, e.g.:
warnings
attribute for each such entry, akin to: ncbigene:103344718 is a discontinued gene concept.
A docker container would be useful
Consider creating sample data to test ETL methods. If we don't go this route, we should clean up the current test data
For elastic beanstalk
Separate between those representing gene concepts from those representing associated concepts.
If not all sources are loaded in the database, the queries on the search endpoint will fail.
When specifying locations, we should use VRS Location objects.
ChromosomeLocation for the ISCN-style entries in the HGNC "location" field
SequenceLocation for the Chr/Start/Stop entries from ensembl.
This should reduce the following attributes:
seqid
start
stop
strand
location
down to:
location
: (VRS Location)
strand
: enum( '+', '-', Null)
The merged concept for hgnc:37133
has alternate_labels
: "A1BGAS", "FLJ23569", "NCRNA00181", "A1BG-AS"
. Querying these alternate_labels
returns different match_type
scores, when they theoretically should return the same score.
Apr 7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/main.py", line 114, in normalize
Apr 7 00:37:15 ip-10-130-14-142 web: resp = query_handler.normalize(html.unescape(q))
Apr 7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/query.py", line 483, in normalize
Apr 7 00:37:15 ip-10-130-14-142 web: matching_records.sort(key=self._record_order)
Apr 7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/query.py", line 412, in _record_order
Apr 7 00:37:15 ip-10-130-14-142 web: src = record['src_name'].upper()
Apr 7 00:37:15 ip-10-130-14-142 web: TypeError: 'NoneType' object is not subscriptable
Creating concept groups is slow and creating concept groups in production environment is even slower. We should look into speeding this up.
Switch to downloading files from FTP sites
Our EB currently uses python 3.7. We should upgrade to 3.8.
NCBI uses 3 different files (history, info, gff). History and info data are updated daily, but gff data is versioned by assembly. We currently use the timestamp at which we retrieve the data (we should also fix this so that it's the timestamp from the ftp site). I think we should consider storing metadata for each file. Also, the current source meta does not indicate the files used and instead points to the ftp site.
We currently only use the non alternative loci set. We should also include the alternative loci set from the download page.
Forgot to update schema examples to reflect VRS/VRSATILE updates
We had been using vrs-python models for validation. The addition of validators being used in schemas are now causing pydantic validation errors when loading sources
gene.vrs_locations
EBSampleApp-Python.iml
?@jarbesfeld 's GH Actions in py-gene-fusions are failing due to our schema classes
Not just the strongest match per source
@jarbesfeld will be using these models in py-gene-fusions
Some helpful posts:
We currently have the production database as the default, but we should switch to using the local database.
Some models/fields have been renamed or deprecated
Add an option to CLI to use local files rather than downloading from the source's FTP site
This will help with going serverless
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.