GithubHelp home page GithubHelp logo

data-transform-vkgl's Introduction

THIS MOLGENIS VERSION IS IN ARCHIVE MODE. PLEASE USE NEXT GENERATION AT MOLGENIS-EMX2

Build status Quality Status

Welcome to MOLGENIS

MOLGENIS is a collaborative open source project on a mission to generate great software infrastructure for life science research.

Develop

MOLGENIS has a frontend and a backend. You can develop on them separately. When you want to develop an API and and App simultaneously you need to checkout both.

Useful links

Deploy MOLGENIS

data-transform-vkgl's People

Contributors

bartcharbon avatar dennishendriksen avatar dependabot[bot] avatar fdlk avatar marikaris avatar sidohaakma avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-transform-vkgl's Issues

Content-Type header missing in translator service response

How to Reproduce

curl -i -H 'Content-Type: application/json' -d '["NM_000088.3:c.589G>T"]' <url>

Expected behavior

Response contains a Content-Type: application/json header.

Observed behavior

Response doesn't contain a Content-Type: application/json header.

Genes with invalid HGNC symbols are always written to errorfile

For some genes, the symbol might still be valid, although they are not HGNC approved. This is the case for LOC genes and Corf genes.

This variant for instance, wasn't valid in the april release:
hg19 2 NC_000002.11:g.88019368G>A LOC730268
Currently it does have a valid HGNC symbol, but this can happen again in the future for other genes.

More info

Translation fails because extrinsic validation is not possible

How to Reproduce

curl -i -H 'Content-Type: application/json' -d '["NM_000014.4:c.1104+11T>C"]' <url>/h2v
HTTP/1.1 200 OK
Server: nginx/1.17.10
Date: <cut>
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: <cut>

Expected behavior

The variant is translated without performing validation, optionally report the error as a warning in addition.

Observed behavior

[{"error": "Cannot validate sequence of an intronic variant (NM_000014.4:c.1104+11T>C)"}]

See: https://hgvs.readthedocs.io/en/stable/_modules/hgvs/validator.html

Build stability warnings

> mvn clean spring-boot:run

[WARNING] 
[WARNING] Some problems were encountered while building the effective model for org.molgenis:data-transform-vkgl:jar:1.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 177, column 15
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING] 

Variant formatter reports 'Failed to fetch from SeqRepo' errors

How to Reproduce

Download jar via https://github.com/molgenis/mvl-vcf-converter/releases/tag/v0.0.1

java -jar mvl-vcf-converter.jar -i <path>/MVL_Totaal-Molecular_variants-2020-10-08_07-47-00.txt -t https://variants.molgenis.org/h2v -o <path>/MVL_Totaal-Molecular_variants-2020-10-08_07-47-00.vcf

Ask me for MVL_Totaal-Molecular_variants-2020-10-08_07-47-00.txt which is not a public resource.

Expected

No SeqRepo errors

Actual

Many errors such as:

skipping variant due to error 'NM_025185.4:c.536G>C': Failed to fetch NM_025185.4 from SeqRepo (/usr/local/share/seqrepo/2019-06-20) ('Alias NM_025185.4 (namespace: None)')
skipping variant due to error 'NM_025185.4:c.645G>A': Failed to fetch NM_025185.4 from SeqRepo (/usr/local/share/seqrepo/2019-06-20) ('Alias NM_025185.4 (namespace: None)')

Pipeline outputs "" for empty timestamp input

How to Reproduce

Run the pipeline with input based on:

$ head -n 2 vkgl_export_amc_20210614.tsv | cut -f 1-3
timestamp       id      chromosome
                1

Expected

Output is similar to:

$ head -n 2 vkgl_raw_amc_v2.tsv | cut -f 1-3
timestamp       id      chromosome
              1

Observed

All timestamp in the output are empty double quotes.

$ head -n 2 vkgl_raw_amc_v2.tsv | cut -f 1-3
timestamp       id      chromosome
""              1

Importing this file in MOLGENIS results in:

Conversion failure in entity type [vkgl_raw_amc_v2] attribute [timestamp]; Failed to convert from type [java.lang.String] to type [java.time.Instant] for value '"'; nested exception is java.time.format.DateTimeParseException: Text '"' could not be parsed, unparsed text found at index 0

Variants that are reported twice, once with the correct gene and once with an outdated one, are not reported as duplicate

Run a duplicate variant through the pipeline, once with the correct genename and once with an outdated one, for instance
KMT2D and MLL2.

Example variant: 12:49440141-49440141 G>C on KMT2D (previously MLL2)

Expected:
Variant is reported as duplicate and written to the errorfile

Observed:
Duplicate variant determination is done before gene translation, so the file gets correctly translated and therefore is reported twice in the output file.

Use `GCA_000001405.15_GRCh38_no_alt_analysis_set` as `GRCh38` reference

Please consider switching the GRCh38 reference from GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set to GCA_000001405.15_GRCh38_no_alt_analysis_set, because:

Variant identifiers change when gene symbols change

If a variant gene symbol changes but still indicates the same gene (e.g. from the previous approved symbol to the current approved symbol) this causes the variant identifier to change. This causes issues when for users using the identifier to keep track of variant changes through time.

The variant identifier should be based on the stable HGNC gene identifier instead of the unstable HGNC gene symbol.

Id's are too long

Id's should be less than 255 characters.

Observed: Refs and alts could be very large, this causes the id to be potential bigger than 255 characters.
Expected: ids should not be longer than 255 characters. They should be hashed.

Treat gene symbols with incorrect casing as invalid

Symbols contain only uppercase Latin letters and Arabic numerals, and punctuation is avoided, with an exception for hyphens in specific groups

source: https://www.genenames.org/about/guidelines/

Currently gene symbols with invalid casing (e.g. all lower-case) are considered as valid gene symbols. This results in an issue when determining consensus due to different gene-variant identifiers. Furthermore downstream users (e.g. VEP VKGL plugin or Alissa) have to take into account these casing issues when coupling data.

Gene-variants with gene symbols with invalid casing should be written to the error file with a message stating that the gene symbol is invalid because it contains lower-case characters.

Valid withdrawn genes using alias or previous symbols

Consider the following two records:

HGNC ID	Status	Approved symbol	Approved name	Alias symbol	Previous symbol	Chromosome	Chromosome location	Locus group	NCBI gene ID	Ensembl gene ID	UCSC gene ID
HGNC:623	Entry Withdrawn	APPL	amyloid beta (A4) precursor protein-like		APPL1	9	9q31-qter	other
HGNC:13196	Entry Withdrawn	ZWINTAS	ZWINT antisense RNA	MPP5		10	10q21.1	non-coding RNA

Expected: APPL, APPL1, ZWINTAS, MPP5 are 'invalid' genes
Actual: APPL and ZWINTAS are 'invalid' genes. APPL1 and MPP5 are 'valid' genes.

Translator service produces invalid alt values

How to Reproduce

curl -i -H 'Content-Type: application/json' -d '["NM_144670.5:c.2719_2720delGG"]' <url>/h2v
HTTP/1.1 200 OK
Server: nginx/1.17.10
Date: <cut>
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: <cut>

Expected behavior

[{"ref": "AGG", "alt": "A", "chrom": "12", "pos": 9007380, "type": "del"}]

Observed behavior

[{"ref": "GG", "alt": ".", "chrom": "12", "pos": 9007381, "type": "del"}]

The value of the ref/alt/chrom/pos fields should be VCF values since that is the goal of the translator service: converting hgvs to vcf. The . value for alt alleles in VCF implies the missing value which means that there are no alternate alleles for a VCF record (https://samtools.github.io/hts-specs/VCFv4.2.pdf).

I am aware of the keep_left_anchor=True URL parameter which produces the expected behavior.

Proposed solution:

  • Make keep_left_anchor=True the default and only behavior since keep_left_anchor=False can't produce valid VCF in some cases.

Variant validator has issues with delins that's possibly an inversion

Variant:

chr start stop ref alt gene
2 215632255 215632256 CA TG BARD1

Converted to:
NC_000002.11:g.215632255_215632256delinsTG

When submitting this to the variant validation service we get the following error:
Unexpected error processing :NC_000002.11:g.215632255_215632256delinsTG; 'Inv' object has no attribute 'alt'

Seeing that error, the variant might actually be an inversion, which would mean that we need to write it like:
NC_000002.11:g.215632255_215632256inv

However, looking at the reported variant, this is probably not an inversion.

Reference genome from 215632253-215632257:
CACATG
Complementary stand:
GTGTAC
Taking a look at the complementary strand suggests it might be an inversion (GT>TG), but if the variant is on the forward strand (CA>TG), it isn't. We need to look into that and find out what the actual variant is.

To do:

  1. Find out if the variant is an inversion
  2. Write code that creates proper HGVS g for inversions
  3. Find out why the variant validator won't take this variant, whether it's an inversion or not
  4. Find out what happens with other inversions:
    a. check for the delins notation
    b. check for the proper inv notation
  5. Find out how we can properly process correct variants and report invalid ones

SyntaxError: invalid syntax when starting translator service

How to Reproduce

data-transform-vkgl\docker> docker-compose up

Expected behavior

No errors

Observed behavior

Starting docker_variant-formatter_1 ... done
Starting docker_uta_1               ... done
Starting docker_seqrepo_1           ... done
Attaching to docker_uta_1, docker_seqrepo_1, docker_variant-formatter_1
uta_1                | LOG:  database system was shut down at 2020-09-24 04:47:43 UTC
uta_1                | LOG:  MultiXact member wraparound protections are now enabled
uta_1                | LOG:  autovacuum launcher started
uta_1                | LOG:  database system is ready to accept connections
seqrepo_1            | WARNING:biocommons.seqrepo.cli:2018-08-21: instance already exists; skipping
docker_seqrepo_1 exited with code 0
variant-formatter_1  | No handlers could be found for logger "biocommons.seqrepo"
variant-formatter_1  | /usr/local/lib/python2.7/site-packages/bioutils/_versionwarning.py:12: UserWarning: Support for Python < 3.6 is now deprecated and will be dropped on 2019-03-31. See https://github.com/biocommons/org/wiki/Migrating-to-Python-3.6
variant-formatter_1  |   "Support for Python < 3.6 is now deprecated and"
variant-formatter_1  | Local instances (/usr/local/share/seqrepo)
variant-formatter_1  |   2018-08-21
variant-formatter_1  | No handlers could be found for logger "hgvs"
variant-formatter_1  | /usr/local/lib/python2.7/site-packages/bioutils/_versionwarning.py:12: UserWarning: Support for Python < 3.6 is now deprecated and will be dropped on 2019-03-31. See https://github.com/biocommons/org/wiki/Migrating-to-Python-3.6
variant-formatter_1  |   "Support for Python < 3.6 is now deprecated and"
variant-formatter_1  | Traceback (most recent call last):
variant-formatter_1  |   File "./server.py", line 2, in <module>
variant-formatter_1  |     from hgvs.parser import Parser
variant-formatter_1  |   File "/usr/local/lib/python2.7/site-packages/hgvs/parser.py", line 28, in <module>
variant-formatter_1  |     import hgvs.sequencevariant
variant-formatter_1  |   File "/usr/local/lib/python2.7/site-packages/hgvs/sequencevariant.py", line 11, in <module>
variant-formatter_1  |     import hgvs.variantmapper
variant-formatter_1  |   File "/usr/local/lib/python2.7/site-packages/hgvs/variantmapper.py", line 17, in <module>
variant-formatter_1  |     import hgvs.normalizer
variant-formatter_1  |   File "/usr/local/lib/python2.7/site-packages/hgvs/normalizer.py", line 103
variant-formatter_1  |     raise HGVSInvalidVariantError(f"{var}: coordinates are out-of-bounds")
variant-formatter_1  |                                                                         ^
variant-formatter_1  | SyntaxError: invalid syntax

Pseudogenes should not be submitted as such to ClinVar

ClinVar doesn't like pseudogenes, we should remove their genenames in the ClinVar submission files to get them through validation. Pseudogene that has caused trouble: FTHL18P.

If a variant with a pseudogene is encountered in the ClinVar submission tool, its genename should be removed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.