counsyl / hgvs Goto Github PK
View Code? Open in Web Editor NEWHGVS variant name parsing and generation
License: MIT License
HGVS variant name parsing and generation
License: MIT License
At the moment the hgvs module is only able to work with coding DNA, genomic and protein sequences. It would be great if all sequence types could be accepted by the module. I will be very happy to contribute in this task, so please let me know how could I help.
import hgvs.parser
hp = hgvs.parser.Parser()
hp.parse_hgvs_variant("NM13423:r.831_832ins831+1_831+60")
...
ometa.runtime.ParseError:
NM13423:r.831_832ins831+1_831+60
^
Parse error at line 1, column 20: Syntax error. trail: [rna_iupac rna rna_ins rna_edit r_posedit r_variant hgvs_variant]
...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
hgvs.exceptions.HGVSParseError: NM13423:r.831_832ins831+1_831+60: char 20: Syntax error
I seem to have an issue installing HGVS when running "python setup.py install" I encounter the following:
Traceback (most recent call last):
File "setup.py", line 35, in
main()
File "setup.py", line 30, in main
parse_requirements('requirements-dev.txt')],
File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1200, in parse_requirements
skip_regex = options.skip_requirements_regex
AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'
NM_007300.4:c.2902_2959inv currently fails
https://varnomen.hgvs.org/recommendations/DNA/variant/inversion/
NC_000005.10:g.177421339_177421327delACTCGAGTGCTCC appears in ClinVar, and is an invalid name (the genomic start/stop coords are not in increasing order). This causes parse_hgvs_name
to raise an IndexError. It should raise InvalidHGVSName instead
python hgvs-convert.py
DEBUG seqdb._create_seqLenDict: Building sequence length index...
Traceback (most recent call last):
File "hgvs-convert.py", line 35, in
print(hgvs.parse_hgvs_name("NM_000352.3:c.215A>G",genome,transcripts))
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 1365, in parse_hgvs_name
chrom, start, end, ref, alt = get_vcf_allele(hgvs, genome, transcript)
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 662, in get_vcf_allele
chrom, start, end = hgvs.get_vcf_coords(transcript)
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 1181, in get_vcf_coords
chrom, start, end = self.get_coords(transcript)
File "/home/josephv/Pythonmodules/lib/python2.7/site-packages/pyhgvs-0.9.4-py2.7.egg/pyhgvs/init.py", line 1142, in get_coords
chrom = transcript.tx_position.chrom
AttributeError: 'dict' object has no attribute 'tx_position'
The script I am using is
import pyhgvs as hgvs
import pyhgvs.utils as hgvs_utils
from pygr.seqdb import SequenceFileDB
genome = SequenceFileDB('/ifs/e63data/offitlab/Human_Decoy_REF/hs37d5.fa')
with open('/ifs/e63data/offitlab/REFGENE/sorted.curated_geneTrack_wo_chr_sorted.refgene') as infile:
transcripts = hgvs_utils.read_transcripts(infile)
def get_transcript(name):
return transcripts.get(name)
print(hgvs.parse_hgvs_name("NM_000352.3:c.215A>G",genome,transcripts))
I have trouble converting chr19:g.10291325_10291323dup
(rs147441348) into chrom, pos, ref, alt using parse_hgvs_name()
. The traceback is
Traceback (most recent call last):
File "XXX", line 76, in main
get_transcript=get_transcript)
File "xxxx/pyhgvs/__init__.py", line 1360, in parse_hgvs_name
chrom, start, end, ref, alt = get_vcf_allele(hgvs, genome, transcript)
File "xxxx/pyhgvs/__init__.py", line 672, in get_vcf_allele
alt = ref[0] + alt
IndexError: string index out of range
pyghvs is unable to retrieve the ref bases which is likely to be caused by get_genomic_sequence()
which in turn does not support end coordinates bigger that start coordinates. Now, I am not sure this is wrong. However, I can paste chr19:g.10291325_10291323dup
into Alamut in my case and find the variant. Exchanging start/end seems to yield the correct result, too.
In your example, the line
transcripts = hgvs.utils.read_transcripts('genes.refGene')
is throwing the error:
transcripts = hgvs.utils.read_transcripts('genes.refGene')
AttributeError: 'module' object has no attribute 'utils'
Any thoughts?
hi, I preferred to run UTA locally, and I have downloaded and installed the docker and the postgreSQL docker. But "docker" technology is quite new to me, and I am not sure how to run the database. Could you help me on this? Thanks
dear:
How do I create this file : hgvs/pyhgvs/data/genes.refGen ,This file is out of date and I want to update it。
I want to use the latest transcripts。
Noticed that the code in the UI's readme didnt work for me it looks like I was resolved in the examples1.py file. In the second line use
import pyhgvs.utils as hgvs_utils
intstead of
import hgvs.utils as hgvs_utils
pyhgvs.InvalidHGVSName: Invalid HGVS cDNA allele "3252delC+3263insC"
VEP's web interface was able to translate that just fine, so I'm assuming that is the correct HGVS format. I gave it the variant as such:
ENST00000333535:c.3252delC+3263insC
a format which worked for all of my other variants. Just a PSA unless there is some older/newer format version for this kind of variant of which I am unaware.
I have been using Pierre Lindenbaum's tool BackLocate to accomplish this (http://lindenb.github.io/jvarkit/BackLocate.html) but if there's a smoother, more pythonic way using this tool it's not clear to me from the documentation.
My code is an exact copy of the README.md file on your site. I can't get your package to work as directed.
>>> import pyhgvs as hgvs
>>> import hgvs.utils as hgvs_utils
>>> hgvs_utils.read_transcripts
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'read_transcripts'
I am trying to use Ensembl transcripts as well, and the documentation is rather sparse on that.
OS X 10.11.3
python 2.7.10
Or am I supposed to install this separately?
I git cloned hgvs and ran python setup.py install
>>> import hgvs.utils as hgvs_utils
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named hgvs.utils
Hello, I installed the 'hgvs', use:
pip install 'hgvs'
pip install 'pygr'
But there are some issues, how to fix it ?
[root@bio-x-2 hgvs]# python
Python 2.7.5 (default, Sep 15 2016, 22:37:39)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import pyhgvs as hgvs
import hgvs.utils as hgvs_utils
from pygr.seqdb import SequenceFileDB
hgvs_utils.read_transcripts()
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'read_transcripts'
RefSeq transcript sequences can be different from the reference sequence (even if they agree with 1 build they can be different across builds). These sequences are aligned against the genome to produce exon coordinates in GFF releases.
This alignment can sometimes produce insertions / deletions (5-10% of transcripts), eg in the GFF file there is a “cDNA match” string that records the alignment, and has a “Gap” entry:
NC_000002.12 RefSeq cDNA_match 73385758 73386192 431.411 + . ID=daa36283c6058f57b6347eb074291b21;Target=NM_015120.4 1 438 +;assembly_bases_aln=5003;assembly_bases_seq=5003;consensus_splices=44;exon_identity=0.999768;for_remapping=2;gap_count=1;identity=0.999768;idty=0.993151;matches=12925;num_ident=12925;num_mismatch=0;pct_coverage=99.9768;pct_coverage_hiqual=99.9768;pct_identity_gap=99.9768;pct_identity_ungap=100;product_coverage=1;rank=1;splices=44;weighted_identity=0.999771;Gap=M185 I3 M250
NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there was 185 bases matched, 3 bases inserted then back to matching. You can see how this affects PyHGVS conversion downstream from the gaps:
2:73385942 A>T: NM_015120.4(ALMS1):c.74A>T (correct)
2:73385943 A>T: NM_015120.4(ALMS1):c.75A>T (off by 3, VEP gives NM_015120.4:c.78A>T)
2:73385944 G>C: NM_015120.4(ALMS1):c.76G>C (off by 3, VEP gives NM_015120.4:c.79G>C)
Thanks for sharing very useful library!
Would you mind adding License for this software?
Since RefSeq (NCBI) is not the only source for annotation, it's also useful to have compatibility with other gene sets sources, like Ensembl genePred information (easy to obtain from Ensembl gtf files)
having an option to pip install pyhgvs would make package management much easier.
Any chance to make it happen? It seems much better than biocommon hgvs since it requires connection to uta resources.
I've made a Python package that provides ~800k transcripts (both RefSeq and Ensembl) for PyHGVS
You can either download a JSON.gz file, or use a REST service. To use it:
from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory, RESTPyHGVSTranscriptFactory
factory = RESTPyHGVSTranscriptFactory()
# factory = JSONPyHGVSTranscriptFactory(["./cdot-0.2.1.refseq.grch38.json.gz"]) # Uses local JSON file
pyhgvs.parse_hgvs_name(hgvs_c, genome, get_transcript=factory.get_transcript_grch37)
Hi I ahve some variants in HGVS format which has NM_004364.4 transcript.
This transcript is not there in pyhgvs/data/genes.refGene file.
Can you please tell me how can I get the updated file or add this to the file.
Thank you
Regards
i have used genes.refGene(#26 (comment)) and hg19.fa
genes.refGene does not have "AB026906.1" transcript
Error :
Traceback (most recent call last):
File "first_py.py", line 38, in
hgvs_name, genome, get_transcript=get_transcript)
File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name
ValueError: transcript is required
I've come across this problem with strings such as NM_007294.3:c.1209dup - which IMHO should actually be NM_007294.3:c.1209dupT (which is how ClinVar represents the variant), but mutalyzer claims that NM_007294.3:c.1209dup is valid HGVS... When I parse its name with
chrom, offset, ref, alt = hgvs.parse_hgvs_name(variant, genome, get_transcript=get_transcript)
I get the results that ref and alt are both 'C', where alt should be 'CC'. If there's a way around this, please let me know!
Thanks!
I'm trying to localize all variants of CIVIC
But I'm not sure whether some variants meet HGVS standards
This is an outstanding project, but in readme, I haven't seen an example of analyzing protein level variation
I want to know if it can do this, and thank any other suggestions
I need to use an updated version of refseq. Is it available any script to download the current version of the file 'genes.refGene' or I should to build it by hand?. Thank you. Angela
I am running the sample script from GitHUB but using my local version of refGene and Human Genome reference.
import pyhgvs as hgvs
import pyhgvs.utils as hgvs_utils
from pygr.seqdb import SequenceFileDB
genome = SequenceFileDB('hs37d5.fa')
with open('sorted.curated_geneTrack_wo_chr_sorted.refgene') as infile:
transcripts = hgvs_utils.read_transcripts(infile)
def get_transcript(name):
return transcripts.get(name)
chrom, offset, ref, alt = hgvs.parse_hgvs_name('NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
print(chrom, offset, ref, alt)
I am encountering this error:
File "hgvs-convert.py", line 34, in
chrom, offset, ref, alt = hgvs.parse_hgvs_name('NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name
ValueError: transcript is required
how to create or find "genes.refGene" file for hg19, hg38.
i have got "genes.refGene" file from USSC but these are not working for my case
error shows :
Traceback (most recent call last):
File "first_py.py", line 38, in
hgvs_name, genome, get_transcript=get_transcript)
File "build/bdist.linux-x86_64/egg/pyhgvs/init.py", line 1356, in parse_hgvs_name
ValueError: transcript is required
Hello, I see from example usage how to get HGVS cdot from REF/ALT. Is there a built-in function to get the pdot? Thanks.
Expected: Converting a long HGVS dup to variant coordinates then back again will make a dup
Actual: A long dup is converted to a delins:
from pyhgvs import parse_hgvs_name, variant_to_hgvs_name
g_hgvs_str = "NC_000001.10:g.235611675_235611994dup"
c_hgvs_str = "NM_003193.4(TBCE):c.1411_1501dup"
chrom, offset, ref, alt = parse_hgvs_name(g_hgvs_str, f, None)
g_hgvs_name = variant_to_hgvs_name(chrom, offset, ref, alt, f, None)
print(f"{g_hgvs_str=} => {g_hgvs_name=}")
chrom, offset, ref, alt = parse_hgvs_name(c_hgvs_str, f, transcript)
c_hgvs_name = variant_to_hgvs_name(chrom, offset, ref, alt, f, transcript)
print(f"{c_hgvs_str=} => {c_hgvs_name=}")
Output:
g_hgvs_str='NC_000001.10:g.235611675_235611994dup' => g_hgvs_name=HGVSName('g.235611773_235611774ins320')
c_hgvs_str='NM_003193.4(TBCE):c.1411_1501dup' => c_hgvs_name=HGVSName('NM_003193.4(TBCE):c.1491+18_1491+19ins320')
This is because hgvs_justify_indel only looks a hardcoded 100 bases around the indel
If you change the code to:
size = max(len(ref), len(alt)) + 1
start = max(offset - size, 0)
end = offset + size
It keeps the dup:
g_hgvs_str='NC_000001.10:g.235611675_235611994dup' => g_hgvs_name=HGVSName('g.235611675_235611994dup320')
c_hgvs_str='NM_003193.4(TBCE):c.1411_1501dup' => c_hgvs_name=HGVSName('NM_003193.4(TBCE):c.1411_1501dup320')
Awesome work! Thanks!
There are some variants which have no mRNA or cDNA hgvs,
eg. rs716274,NC_000011.9:g.103418158A>G
NC_ALLELE is empty and not being processed now.
I get the following error when reading data from the HGVS_coding_DNA_change column of oncotator MAF output (http://www.broadinstitute.org/oncotator/).
InvalidHGVSName: Invalid HGVS cDNA allele "5407-17T>-"
Not sure if this is an oncotator issue or a pyhgvs issue.
Using hg18.fa and the provided genes.refGene in the git repo. I don't think this is a problem but let me know if you think it is.
chrom, offset, ref, alt = ('chr7', 116986881, 'TCTT', 'T')
transcript = get_transcript('NM_000492.3')
hgvs_name = hgvs.format_hgvs_name(
chrom, offset, ref, alt, genome, transcript)
print(hgvs_name)
#returns NM_000492.3(CFTR):c.-133267_-133265delCTT
However I don't think this is correct. Shouldn't it be CFTR:c.1521_1523delCTT?
Goods news: I tried an alternative form of FDel508 and got the same result
#NM_000492.3 is the transcript for CFTR
chrom, offset, ref, alt = ('chr7', 11698688, 'ATCT', 'A')
transcript = get_transcript('NM_000492.3')
hgvs_name = hgvs.format_hgvs_name(
chrom, offset, ref, alt, genome, transcript)
print(hgvs_name)
#returns NM_000492.3(CFTR):c.-133267_-133265delCTT
So I think it is just how it is counting from is possibly off. Any thoughts? Thanks! Let me know if I can help contribute!
Getting a systematic issue:
Every cdna name from vcf records is correct except for single base pair insertion.
shouldBe getting
CFTR:c.1006_1007insG CFTR:c.1007insG
CFTR:c.1029_1030insG CFTR:c.1030insG
CFTR:c.1660_1661insA CFTR:c.1661insA
CFTR:c.3883_3884insG CFTR:c.3884insG
So its close but it doesn't get the first coordinate. Multi-bp insertions are correct. Any idea why there is a difference?
This package was renamed from hgvs to pyhgvs a while ago, but the GitHub url still uses hgvs. Switching is actually pretty low-cost, since GH sets up redirects from the old name to the new name, so old links don't break. Even git pull/push
still works (I've done this with a few repositories in the past).
Hi, genomic indels are often wrong because get_coords() adjustment of start/end is only done for indels if self.kind == 'c'
Testing against examples from the ClinGen allele registry:
'NM_000492.3:c.1155_1156dupTA' # correct resolves to ('chr7', 117182104, 'A', 'AAT')
# Same as above but without optional trailing base - issue #32
'NM_000492.3:c.1155_1156dup' # Error - resolves to ('chr7', 117182107, 'A', 'A')
# Genomic coordinate of above
"chr7:g.117182108_117182109dup" # Error - resolves to ('7', 117182109, 'A', 'A')
# Genomic coordinate of above but shifted with optional base suffix
"chr7:g.117182105_117182106dupAT" # Error - resolves to ('7', 117182106, 'T', 'T')
I would do a pull request but I've been working with existing pull request #25 and it doesn't look like this project is being updated anymore. If you merge #25 please ping this issue and I'll make a pull request.
Fixes are to remove test for if self.kind == 'c':
in get_coords()
I've patched my fork: https://github.com/sacgf/hgvs
The current regex treats the last digit as a ref digit, ie it uses it to multiply "N" that many times. This makes the coordinate wrong as the last digit is cut off, eg:
In [6]: HGVSName("NC_000017.11:g.50199235=")
Out[6]: HGVSName('NC_000017.11:g.5019923NNNNN=')
In [7]: HGVSName("NM_018090.5:c.462=")
Out[7]: HGVSName('NM_018090.5:c.46NN=')
Unit test test_hgvs_names.py
# Copy pasted from BRCA1:c.101A= test with "A" removed
('BRCA1:c.101=', True,
{
'gene': 'BRCA1',
'kind': 'c',
'cdna_start': CDNACoord(101),
'cdna_end': CDNACoord(101),
'ref_allele': '',
'alt_allele': '',
'mutation_type': '=',
}),
# Copy pasted from BRCA1:g.101A= test with "A" removed
('BRCA1:g.101=', True,
{
'gene': 'BRCA1',
'kind': 'g',
'start': 101,
'end': 101,
'ref_allele': '',
'alt_allele': '',
'mutation_type': '=',
}),
Currently fails with:
AssertionError: CDNACoord(10, 0) != CDNACoord(101, 0)
Fix is to add a new regex just above the existing "No change" regexes, ie in HGVSRegex:
CDNA_ALLELE = [
CDNA_START + EQUAL,
# old regexes
]
GENOMIC_ALLELE = [
COORD_START + EQUAL,
# old regexes
]
I am not sure whether the protein HGVS is affected, and if need to specify the ref ie whether "p.1000=" is valid or not
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.