The gene-normalization from cancervariants

gene-normalization's Issues

The normalize endpoint should generate a single, merged concept for search terms.

Implement NCBI Normalizer

https://www.ncbi.nlm.nih.gov/home/download/

Update search and normalize response

Set use_enum_values in pydantic model config
Change response_datetime to str

Import architecture updates from therapy-normalization

Capture previous gene identifiers from NCBI

NCBI has retired gene identifiers in the past, e.g.:

ncbigene:401317 now maps to ncbigene:9586. Our normalizer should match the old ID to the current record. This should be treated analogous to the "previous symbols" attribute in HGNC.
ncbigene:103344718 is a discontinued gene. We should normalize to concepts like this, but also have a status attribute that makes it clear this is no longer considered a gene. We should also emit a warning in our warnings attribute for each such entry, akin to: ncbigene:103344718 is a discontinued gene concept.

Allow SEQREPO_DATA_PATH to be set by env var

Add other_identifier match type and DB reference

Test data

Consider creating sample data to test ETL methods. If we don't go this route, we should clean up the current test data

Add xrefs attribute

Separate between those representing gene concepts from those representing associated concepts.

Error when searching without all sources loaded

If not all sources are loaded in the database, the queries on the search endpoint will fail.

VRS locations

When specifying locations, we should use VRS Location objects.

ChromosomeLocation for the ISCN-style entries in the HGNC "location" field

SequenceLocation for the Chr/Start/Stop entries from ensembl.

This should reduce the following attributes:
seqid
start
stop
strand
location

down to:
location: (VRS Location)
strand: enum( '+', '-', Null)

The merged concept for hgnc:37133 has alternate_labels: "A1BGAS", "FLJ23569", "NCRNA00181", "A1BG-AS". Querying these alternate_labels returns different match_type scores, when they theoretically should return the same score.

TPX2 raises TypeError in search and normalize

Apr  7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/main.py", line 114, in normalize
Apr  7 00:37:15 ip-10-130-14-142 web: resp = query_handler.normalize(html.unescape(q))
Apr  7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/query.py", line 483, in normalize
Apr  7 00:37:15 ip-10-130-14-142 web: matching_records.sort(key=self._record_order)
Apr  7 00:37:15 ip-10-130-14-142 web: File "/var/app/current/gene/query.py", line 412, in _record_order
Apr  7 00:37:15 ip-10-130-14-142 web: src = record['src_name'].upper()
Apr  7 00:37:15 ip-10-130-14-142 web: TypeError: 'NoneType' object is not subscriptable

Improve performance for creating concept groups

Creating concept groups is slow and creating concept groups in production environment is even slower. We should look into speeding this up.

Consider supporting GRCh37 assemblies

Consider switching current partition and sort keys

We currently add a GSI on concept_id in #97 . However, we should see if we're able to use concept_id as the partition key and label_and_type as the sort key to prevent the extra creation of a GSI. Did not do this in #97 due to interest in time

FTP Download

Switch to downloading files from FTP sites

Fix normalize response

EB use python 3.8

Our EB currently uses python 3.7. We should upgrade to 3.8.

NCBI Source Meta

NCBI uses 3 different files (history, info, gff). History and info data are updated daily, but gff data is versioned by assembly. We currently use the timestamp at which we retrieve the data (we should also fix this so that it's the timestamp from the ftp site). I think we should consider storing metadata for each file. Also, the current source meta does not indicate the files used and instead points to the ftp site.

Better documentation
- Add type hints
DRY
Remove unused code
~~- [ ] Rather than using vrs-python's VRS models, use ga4gh.vrsatile.pydantic models in gene.vrs_locations~~
Check if we can remove EBSampleApp-Python.iml?
Add flake8-annotations + double quotes
String enums in schemas

NCBI
Ensembl

Ensembl: biotype

cancervariants / gene-normalization Goto Github PK

gene-normalization's People

Contributors

Stargazers

Watchers

Forkers

gene-normalization's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs