counsyl / hgvs Goto Github PK

View Code? Open in Web Editor NEW

171.0 171.0 80.0 16.31 MB

HGVS variant name parsing and generation

License: MIT License

Makefile 0.80% Python 99.20%

hgvs's People

Contributors

Stargazers

Watchers

hgvs's Issues

enhancement - add compatibility with RNA and non-coding RNA sequences

At the moment the hgvs module is only able to work with coding DNA, genomic and protein sequences. It would be great if all sequence types could be accepted by the module. I will be very happy to contribute in this task, so please let me know how could I help.

Syntax error when trying to parse valid R variant

import hgvs.parser
hp = hgvs.parser.Parser()
hp.parse_hgvs_variant("NM13423:r.831_832ins831+1_831+60")
...
ometa.runtime.ParseError:
NM13423:r.831_832ins831+1_831+60
^
Parse error at line 1, column 20: Syntax error. trail: [rna_iupac rna rna_ins rna_edit r_posedit r_variant hgvs_variant]
...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
hgvs.exceptions.HGVSParseError: NM13423:r.831_832ins831+1_831+60: char 20: Syntax error

Issue with installing in Ubuntu

I seem to have an issue installing HGVS when running "python setup.py install" I encounter the following:

Traceback (most recent call last):
File "setup.py", line 35, in
main()
File "setup.py", line 30, in main
parse_requirements('requirements-dev.txt')],
File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1200, in parse_requirements
skip_regex = options.skip_requirements_regex
AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

Support inversions (inv)

NM_007300.4:c.2902_2959inv currently fails

https://varnomen.hgvs.org/recommendations/DNA/variant/inversion/

Catch invalid HGVS names like NC_000005.10:g.177421339_177421327delACTCGAGTGCTCC

NC_000005.10:g.177421339_177421327delACTCGAGTGCTCC appears in ClinVar, and is an invalid name (the genomic start/stop coords are not in increasing order). This causes parse_hgvs_name to raise an IndexError. It should raise InvalidHGVSName instead

'dict' object has no attribute 'tx_position'

python hgvs-convert.py

DEBUG seqdb._create_seqLenDict: Building sequence length index...
Traceback (most recent call last):
File "hgvs-convert.py", line 35, in
print(hgvs.parse_hgvs_name("NM_000352.3:c.215A>G",genome,transcripts))
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 1365, in parse_hgvs_name
chrom, start, end, ref, alt = get_vcf_allele(hgvs, genome, transcript)
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 662, in get_vcf_allele
chrom, start, end = hgvs.get_vcf_coords(transcript)
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 1181, in get_vcf_coords
chrom, start, end = self.get_coords(transcript)
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 1142, in get_coords
chrom = transcript.tx_position.chrom
AttributeError: 'dict' object has no attribute 'tx_position'

The script I am using is

import pyhgvs as hgvs
import pyhgvs.utils as hgvs_utils
from pygr.seqdb import SequenceFileDB

genome = SequenceFileDB('/ifs/e63data/offitlab/Human_Decoy_REF/hs37d5.fa')

with open('/ifs/e63data/offitlab/REFGENE/sorted.curated_geneTrack_wo_chr_sorted.refgene') as infile:
transcripts = hgvs_utils.read_transcripts(infile)

def get_transcript(name):
return transcripts.get(name)

print(hgvs.parse_hgvs_name("NM_000352.3:c.215A>G",genome,transcripts))

parse_hgvs_name() crashes if start>end

I have trouble converting chr19:g.10291325_10291323dup (rs147441348) into chrom, pos, ref, alt using parse_hgvs_name(). The traceback is

Traceback (most recent call last):
  File "XXX", line 76, in main
    get_transcript=get_transcript)
  File "xxxx/pyhgvs/__init__.py", line 1360, in parse_hgvs_name
    chrom, start, end, ref, alt = get_vcf_allele(hgvs, genome, transcript)
  File "xxxx/pyhgvs/__init__.py", line 672, in get_vcf_allele
    alt = ref[0] + alt
IndexError: string index out of range

pyghvs is unable to retrieve the ref bases which is likely to be caused by get_genomic_sequence() which in turn does not support end coordinates bigger that start coordinates. Now, I am not sure this is wrong. However, I can paste chr19:g.10291325_10291323dup into Alamut in my case and find the variant. Exchanging start/end seems to yield the correct result, too.

AttributeError: 'module' object has no attribute 'utils'

In your example, the line

transcripts = hgvs.utils.read_transcripts('genes.refGene')

is throwing the error:
transcripts = hgvs.utils.read_transcripts('genes.refGene')
AttributeError: 'module' object has no attribute 'utils'

Any thoughts?

Running UTA locally

hi, I preferred to run UTA locally, and I have downloaded and installed the docker and the postgreSQL docker. But "docker" technology is quite new to me, and I am not sure how to run the database. Could you help me on this? Thanks

hgvs/pyhgvs/data/genes.refGen file

dear:

How do I create this file : hgvs/pyhgvs/data/genes.refGen ，This file is out of date and I want to update it。

I want to use the latest transcripts。

how to get genes.refGene with version

README.md (import hgvs error)

Noticed that the code in the UI's readme didnt work for me it looks like I was resolved in the examples1.py file. In the second line use

import pyhgvs.utils as hgvs_utils
intstead of
import hgvs.utils as hgvs_utils

pip install for python3 fails (os x 10.11.3)

Unable to parse a HGVS variant in format that VEP accepts

pyhgvs.InvalidHGVSName: Invalid HGVS cDNA allele "3252delC+3263insC"

VEP's web interface was able to translate that just fine, so I'm assuming that is the correct HGVS format. I gave it the variant as such:

ENST00000333535:c.3252delC+3263insC

a format which worked for all of my other variants. Just a PSA unless there is some older/newer format version for this kind of variant of which I am unaware.

Is there a way to use this to convert protein HGVS to genome space VCF coordinates?

I have been using Pierre Lindenbaum's tool BackLocate to accomplish this (http://lindenb.github.io/jvarkit/BackLocate.html) but if there's a smoother, more pythonic way using this tool it's not clear to me from the documentation.

No module read_transcripts in hgvs_utils

My code is an exact copy of the README.md file on your site. I can't get your package to work as directed.
>>> import pyhgvs as hgvs
>>> import hgvs.utils as hgvs_utils
>>> hgvs_utils.read_transcripts
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'read_transcripts'

I am trying to use Ensembl transcripts as well, and the documentation is rather sparse on that.

hgvs_utils not installing?

OS X 10.11.3
python 2.7.10

Or am I supposed to install this separately?

I git cloned hgvs and ran python setup.py install

>>> import hgvs.utils as hgvs_utils
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named hgvs.utils

AttributeError: 'module' object has no attribute 'read_transcripts'

Hello, I installed the 'hgvs', use:
pip install 'hgvs'
pip install 'pygr'

But there are some issues, how to fix it ?

[root@bio-x-2 hgvs]# python
Python 2.7.5 (default, Sep 15 2016, 22:37:39)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import pyhgvs as hgvs
import hgvs.utils as hgvs_utils
from pygr.seqdb import SequenceFileDB
hgvs_utils.read_transcripts()
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'read_transcripts'

HGVS / genome coordinate conversion does not account for cDNA alignment gaps

RefSeq transcript sequences can be different from the reference sequence (even if they agree with 1 build they can be different across builds). These sequences are aligned against the genome to produce exon coordinates in GFF releases.

This alignment can sometimes produce insertions / deletions (5-10% of transcripts), eg in the GFF file there is a “cDNA match” string that records the alignment, and has a “Gap” entry:

NC_000002.12    RefSeq  cDNA_match      73385758        73386192        431.411 +       .       ID=daa36283c6058f57b6347eb074291b21;Target=NM_015120.4 1 438 +;assembly_bases_aln=5003;assembly_bases_seq=5003;consensus_splices=44;exon_identity=0.999768;for_remapping=2;gap_count=1;identity=0.999768;idty=0.993151;matches=12925;num_ident=12925;num_mismatch=0;pct_coverage=99.9768;pct_coverage_hiqual=99.9768;pct_identity_gap=99.9768;pct_identity_ungap=100;product_coverage=1;rank=1;splices=44;weighted_identity=0.999771;Gap=M185 I3 M250

NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there was 185 bases matched, 3 bases inserted then back to matching. You can see how this affects PyHGVS conversion downstream from the gaps:

2:73385942 A>T: NM_015120.4(ALMS1):c.74A>T (correct)
2:73385943 A>T: NM_015120.4(ALMS1):c.75A>T (off by 3, VEP gives NM_015120.4:c.78A>T)
2:73385944 G>C: NM_015120.4(ALMS1):c.76G>C (off by 3, VEP gives NM_015120.4:c.79G>C)

License

Thanks for sharing very useful library!

Would you mind adding License for this software?

enhancement - Compatibility with Ensembl genePred information

Since RefSeq (NCBI) is not the only source for annotation, it's also useful to have compatibility with other gene sets sources, like Ensembl genePred information (easy to obtain from Ensembl gtf files)

add pip install

having an option to pip install pyhgvs would make package management much easier.

Python3 version

Any chance to make it happen? It seems much better than biocommon hgvs since it requires connection to uta resources.

Announcing cdot - a way to load lots of transcripts fast

I've made a Python package that provides ~800k transcripts (both RefSeq and Ensembl) for PyHGVS

https://github.com/SACGF/cdot

You can either download a JSON.gz file, or use a REST service. To use it:

from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory, RESTPyHGVSTranscriptFactory

factory = RESTPyHGVSTranscriptFactory()
# factory = JSONPyHGVSTranscriptFactory(["./cdot-0.2.1.refseq.grch38.json.gz"])  # Uses local JSON file
pyhgvs.parse_hgvs_name(hgvs_c, genome, get_transcript=factory.get_transcript_grch37)

Need updated version of genes.refGene

Hi I ahve some variants in HGVS format which has NM_004364.4 transcript.

This transcript is not there in pyhgvs/data/genes.refGene file.

Can you please tell me how can I get the updated file or add this to the file.

Thank you

Regards

how to get coordinate of "AB026906.1:c.40_42del" by hgvs code

i have used genes.refGene(#26 (comment)) and hg19.fa

genes.refGene does not have "AB026906.1" transcript

Error :
Traceback (most recent call last):
File "first_py.py", line 38, in
hgvs_name, genome, get_transcript=get_transcript)
File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name
ValueError: transcript is required

Incorrect translation when the HGVS string does not contain a reference or alt allele

I've come across this problem with strings such as NM_007294.3:c.1209dup - which IMHO should actually be NM_007294.3:c.1209dupT (which is how ClinVar represents the variant), but mutalyzer claims that NM_007294.3:c.1209dup is valid HGVS... When I parse its name with

chrom, offset, ref, alt = hgvs.parse_hgvs_name(variant, genome, get_transcript=get_transcript)

I get the results that ref and alt are both 'C', where alt should be 'CC'. If there's a way around this, please let me know!

Thanks!

how to Check whether p.val meets the HGVS specification

I'm trying to localize all variants of CIVIC

But I'm not sure whether some variants meet HGVS standards

This is an outstanding project, but in readme, I haven't seen an example of analyzing protein level variation

I want to know if it can do this, and thank any other suggestions

update of genes.refGene files

I need to use an updated version of refseq. Is it available any script to download the current version of the file 'genes.refGene' or I should to build it by hand?. Thank you. Angela

get_transcripts()

I am running the sample script from GitHUB but using my local version of refGene and Human Genome reference.

import pyhgvs as hgvs
import pyhgvs.utils as hgvs_utils
from pygr.seqdb import SequenceFileDB

genome = SequenceFileDB('hs37d5.fa')

with open('sorted.curated_geneTrack_wo_chr_sorted.refgene') as infile:
transcripts = hgvs_utils.read_transcripts(infile)

def get_transcript(name):
return transcripts.get(name)

chrom, offset, ref, alt = hgvs.parse_hgvs_name('NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
print(chrom, offset, ref, alt)

I am encountering this error:

File "hgvs-convert.py", line 34, in
chrom, offset, ref, alt = hgvs.parse_hgvs_name('NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name
ValueError: transcript is required

how to create or find "genes.refGene" file for hg19 and hg38

how to create or find "genes.refGene" file for hg19, hg38.
i have got "genes.refGene" file from USSC but these are not working for my case

error shows :

Traceback (most recent call last):
File "first_py.py", line 38, in
hgvs_name, genome, get_transcript=get_transcript)
File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name
ValueError: transcript is required

how to get pdot

Hello, I see from example usage how to get HGVS cdot from REF/ALT. Is there a built-in function to get the pdot? Thanks.

pyhgvs normalize is right,but not give base number instead of base itself

I am new to this package, and want to know how to get the right normalize result.
thanks a lot

dup longer than 100 bases converted back to delins (due to hardcoding of 100 in code)

Expected: Converting a long HGVS dup to variant coordinates then back again will make a dup
Actual: A long dup is converted to a delins:

from pyhgvs import parse_hgvs_name, variant_to_hgvs_name

g_hgvs_str = "NC_000001.10:g.235611675_235611994dup"
c_hgvs_str = "NM_003193.4(TBCE):c.1411_1501dup"


chrom, offset, ref, alt = parse_hgvs_name(g_hgvs_str, f, None)
g_hgvs_name = variant_to_hgvs_name(chrom, offset, ref, alt, f, None)

print(f"{g_hgvs_str=} => {g_hgvs_name=}")

chrom, offset, ref, alt = parse_hgvs_name(c_hgvs_str, f, transcript)
c_hgvs_name = variant_to_hgvs_name(chrom, offset, ref, alt, f, transcript)

print(f"{c_hgvs_str=} => {c_hgvs_name=}")

Output:

g_hgvs_str='NC_000001.10:g.235611675_235611994dup' => g_hgvs_name=HGVSName('g.235611773_235611774ins320')
c_hgvs_str='NM_003193.4(TBCE):c.1411_1501dup' => c_hgvs_name=HGVSName('NM_003193.4(TBCE):c.1491+18_1491+19ins320')

This is because hgvs_justify_indel only looks a hardcoded 100 bases around the indel

If you change the code to:

    size = max(len(ref), len(alt)) + 1
    start = max(offset - size, 0)
    end = offset + size

It keeps the dup:

g_hgvs_str='NC_000001.10:g.235611675_235611994dup' => g_hgvs_name=HGVSName('g.235611675_235611994dup320')
c_hgvs_str='NM_003193.4(TBCE):c.1411_1501dup' => c_hgvs_name=HGVSName('NM_003193.4(TBCE):c.1411_1501dup320')

Add NC_ALLELE parse

Awesome work! Thanks!

There are some variants which have no mRNA or cDNA hgvs,

eg. rs716274，NC_000011.9:g.103418158A>G

NC_ALLELE is empty and not being processed now.

HGVS output from oncotator MAF gives an error

I get the following error when reading data from the HGVS_coding_DNA_change column of oncotator MAF output (http://www.broadinstitute.org/oncotator/).

InvalidHGVSName: Invalid HGVS cDNA allele "5407-17T>-"

Not sure if this is an oncotator issue or a pyhgvs issue.

error naming CFTR:c.1521_1523delCTT

Using hg18.fa and the provided genes.refGene in the git repo. I don't think this is a problem but let me know if you think it is.

chrom, offset, ref, alt = ('chr7', 116986881, 'TCTT', 'T')
transcript = get_transcript('NM_000492.3')
hgvs_name = hgvs.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript)
print(hgvs_name)
#returns NM_000492.3(CFTR):c.-133267_-133265delCTT

However I don't think this is correct. Shouldn't it be CFTR:c.1521_1523delCTT?
Goods news: I tried an alternative form of FDel508 and got the same result

#NM_000492.3 is the transcript for CFTR
chrom, offset, ref, alt = ('chr7', 11698688, 'ATCT', 'A')
transcript = get_transcript('NM_000492.3')
hgvs_name = hgvs.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript)
print(hgvs_name)
#returns NM_000492.3(CFTR):c.-133267_-133265delCTT

So I think it is just how it is counting from is possibly off. Any thoughts? Thanks! Let me know if I can help contribute!

single base pair insertion name comes up as slightly off

Getting a systematic issue:
Every cdna name from vcf records is correct except for single base pair insertion.

shouldBe getting
CFTR:c.1006_1007insG CFTR:c.1007insG
CFTR:c.1029_1030insG CFTR:c.1030insG
CFTR:c.1660_1661insA CFTR:c.1661insA
CFTR:c.3883_3884insG CFTR:c.3884insG

So its close but it doesn't get the first coordinate. Multi-bp insertions are correct. Any idea why there is a difference?

Rename repository to pyhgvs?

This package was renamed from hgvs to pyhgvs a while ago, but the GitHub url still uses hgvs. Switching is actually pretty low-cost, since GH sets up redirects from the old name to the new name, so old links don't break. Even git pull/push still works (I've done this with a few repositories in the past).

Incorrect HGVS to VCF conversion for some genomic indels

Hi, genomic indels are often wrong because get_coords() adjustment of start/end is only done for indels if self.kind == 'c'

Testing against examples from the ClinGen allele registry:

http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/allele?hgvsOrDescriptor=NM_000492.3%3Ac.1155_1156dupTA

    'NM_000492.3:c.1155_1156dupTA' # correct resolves to ('chr7', 117182104, 'A', 'AAT')
    # Same as above but without optional trailing base - issue #32
    'NM_000492.3:c.1155_1156dup' # Error - resolves to ('chr7', 117182107, 'A', 'A')
    # Genomic coordinate of above
    "chr7:g.117182108_117182109dup" # Error - resolves to ('7', 117182109, 'A', 'A')

    # Genomic coordinate of above but shifted with optional base suffix
    "chr7:g.117182105_117182106dupAT" # Error - resolves to ('7', 117182106, 'T', 'T')

I would do a pull request but I've been working with existing pull request #25 and it doesn't look like this project is being updated anymore. If you merge #25 please ping this issue and I'll make a pull request.

Fixes are to remove test for if self.kind == 'c': in get_coords()

I've patched my fork: https://github.com/sacgf/hgvs

Experiment w/ github integration

Reference HGVS without reference base leads to wrong coordinates and reference allele

The current regex treats the last digit as a ref digit, ie it uses it to multiply "N" that many times. This makes the coordinate wrong as the last digit is cut off, eg:

In [6]: HGVSName("NC_000017.11:g.50199235=")                                                                                                                               
Out[6]: HGVSName('NC_000017.11:g.5019923NNNNN=')

In [7]: HGVSName("NM_018090.5:c.462=")                                                                                                                                     
Out[7]: HGVSName('NM_018090.5:c.46NN=')

Unit test test_hgvs_names.py

# Copy pasted from BRCA1:c.101A= test with "A" removed

    ('BRCA1:c.101=', True,
     {
         'gene': 'BRCA1',
         'kind': 'c',
         'cdna_start': CDNACoord(101),
         'cdna_end': CDNACoord(101),
         'ref_allele': '',
         'alt_allele': '',
         'mutation_type': '=',
     }),

# Copy pasted from BRCA1:g.101A= test with "A" removed

    ('BRCA1:g.101=', True,
     {
         'gene': 'BRCA1',
         'kind': 'g',
         'start': 101,
         'end': 101,
         'ref_allele': '',
         'alt_allele': '',
         'mutation_type': '=',
     }),

Currently fails with:

AssertionError: CDNACoord(10, 0) != CDNACoord(101, 0)

Fix is to add a new regex just above the existing "No change" regexes, ie in HGVSRegex:

CDNA_ALLELE = [
    CDNA_START + EQUAL, 
    # old regexes
]

GENOMIC_ALLELE = [
    COORD_START + EQUAL,
    # old regexes
]

I am not sure whether the protein HGVS is affected, and if need to specify the ref ie whether "p.1000=" is valid or not

counsyl / hgvs Goto Github PK

hgvs's People

Contributors

Stargazers

Watchers

Forkers

hgvs's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs