rasmushenningsson / variantcallformat.jl Goto Github PK

View Code? Open in Web Editor NEW

13.0 3.0 3.0 939 KB

Read and write VCF and BCF files

License: Other

Julia 100.00%

vcf bcf vcf-files bioinformatics julia genetics

variantcallformat.jl's Introduction

VariantCallFormat.jl

VariantCallFormat.jl is based on previous work in GeneticVariation.jl. Big thanks to the original authors!

Description

VariantCallFormat.jl provides read/write functionality for VCF files as well as for its binary sister format BCF.

VCF files are use ubiquitously in bioinformatics to represent genetic variants.

Installation

Install VariantCallFormat.jl from the Julia REPL:

using Pkg
Pkg.add("VariantCallFormat")

variantcallformat.jl's People

Contributors

Stargazers

Watchers

Forkers

anmabu mashu jakobnissen

variantcallformat.jl's Issues

Support upcoming Automa.jl release

The description below was writen by @jakobnissen. Saving it here for future reference!

For the rewrite, it should be a minor rewrite, with just a few lines changed.
I recommend reading this tutorial on Automa and how it works:
https://biojulia.net/post/automa1/
Basically, the change in Automa is that the new version will, when you create a Machine, check that the actions can be unambiguously resolved. And they can't be in the current version of VCF

When you try to compile VCF with the latest version of Automa, you get this error
ERROR: LoadError: LoadError: Ambiguous DFA: Input 0x2e can lead to actions nothing or [:mark]
Stacktrace:
what it means is that there is at least one possible input where the byte 0x2e can lead to two different actions, and it's impossible to resolve which. To fix it, go into this file https://github.com/rasmushenningsson/VariantCallFormat.jl/blob/main/src/reader.jl, and try to see which regex pattern contains 0x2e where there may be an ambiguity

Or to put it even more starkly: With the current version of Automa, there is at least one input that will cause the VCF parser to do the wrong thing, silently.

To debug it, it may be useful to first figure out which exact Machine it is that raises the error, then take the regex that produces the machine and convert it to an NFA, then convert that NFA to a dot file and visualize it. Then look for places in the graph where 0x2e leads to two distinct paths
(I could do the PR, but I think it's more durable if you learn to debug Automa youself, and I'm happy to help) :)

Weird headers throwing error

Hi,

I received some (G)VCF files where lines 16-19 of the header lines look like

##GVCFBlock0-20=minGQ=0(inclusive),maxGQ=20(exclusive)
##GVCFBlock20-30=minGQ=20(inclusive),maxGQ=30(exclusive)
##GVCFBlock30-40=minGQ=30(inclusive),maxGQ=40(exclusive)
##GVCFBlock40-100=minGQ=40(inclusive),maxGQ=100(exclusive)

which causes the following error

using VariantCallFormat
file = "MH0289561.v1.1a091483-abb5-4bb4-b14a-b5e4046d0a84.rb.g.vcf"
reader = VCF.Reader(open(file))

ERROR: VariantCallFormat.Reader file format error on line 16
Stacktrace:
 [1] error(::String, ::Int64)
   @ Base ./error.jl:42
 [2] _readheader!(reader::VariantCallFormat.Reader, state::BioCore.Ragel.State{BufferedStreams.BufferedInputStream{IOStream}})
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:106
 [3] readheader!(reader::VariantCallFormat.Reader)
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:80
 [4] Reader
   @ ~/.julia/packages/VariantCallFormat/wT4q6/src/reader.jl:7 [inlined]
 [5] VariantCallFormat.Reader(input::IOStream)
   @ VariantCallFormat ~/.julia/packages/VariantCallFormat/wT4q6/src/reader.jl:20
 [6] top-level scope
   @ REPL[7]:1

However the following works (deleting the - in the key name)

##GVCFBlock020=minGQ=0(inclusive),maxGQ=20(exclusive)
##GVCFBlock2030=minGQ=20(inclusive),maxGQ=30(exclusive)
##GVCFBlock3040=minGQ=30(inclusive),maxGQ=40(exclusive)
##GVCFBlock40100=minGQ=40(inclusive),maxGQ=100(exclusive)

Could you consider changing the behavior of your package? I'm not sure if including the - in the header invalidates the VCF spec (this is v4.2), however.

Roadmap (near future)

Roadmap overview

Here we outline the updates that are coming in VariantCallFormat.jl 0.5.x and 0.6.
It's mainly an effort to update VariantCallFormat.jl to be written in "modern" Julia and to clean up the interface.

0.5.x goals

No breaking changes (compared to GeneticVariation.jl 0.4).
Define the public interface of VariantCallFormat.jl. (Since most functions are unexported at the moment, it is currently not clear what is public and what is private.)
Implement the same methods/interface for VCF and BCF (making it easier for users to write generic code).
Follow conventions from Base Julia.
Support getproperty for Record. E.g. record.pos, record.filter, record.info[key]. This is an easy, understandable interface and a nice way to get around name clash issues with Base.
Add deprecations in preparation for 0.6.

0.6 goals

Improve type stability for info and genotype.
Breaking changes needed to unify VCF/BCF interface.
Minimize and document breaking changes.

Specific changes planned

Header interface

Public interface: Add metainfo(header) and sampleids(header)
Do not wrap Base methods without reason - deprecate eltype, length, iterate, pushfirst!, push!. (Use e.g. push!(metainfo(header)) instead.)
Deprecate findall(header,tag) and replace with metainfo(header,tag).

VCF/BCF records

Add AbstractRecord.
Add getproperty support.
VCF/BCF methods should take the same argument types and use the same return types (as far as possible).
Implement missing BCF methods such as hasqual(), hasinfo()
Unify missing data handling. E.g. hasqual(), qual() should behave identically for VCF/BCF.

Type stability

Achieving type stability for some VCF/BCF fields is tricky, because the data types are encoded in the file and cannot be known at compile-time. However, often the user knows. Consider the INFO field DP that must be an integer, by specifying the desired type in the call, the user code will be type stable and the user will get a good error message if the type is wrong (malformed file). Example:

record.info(Int,"DP")

As a bonus, this makes it easy to handle both VCF/BCF with generic, type-stable code, even though the field is saved as a String in VCF and as (vector of) int8_t/int16_t/int32_t/float/char in BCF.

If the user doesn't specify the type, the VCF/BCF header will used to determine the return type (when needed).

1000Genomes VCF not parsing

shell> ls
1000GENOMES-phase_3.vcf.gz  Manifest.toml  Project.toml

julia> reader = VCF.Reader(GzipDecompressorStream(open("1000GENOMES-phase_3.vcf.gz")))
VariantCallFormat.Reader(BioCore.Ragel.State{BufferedStreams.BufferedInputStream{TranscodingStreams.TranscodingStream{GzipDecompressor, IOStream}}}(BufferedStreams.BufferedInputStream{TranscodingStreams.TranscodingStream{GzipDecompressor, IOStream}}(<128.0 KiB buffer, 95% filled, data immobilized>), 1, 68, false), VariantCallFormat.Header:
  metainfo tags: fileformat fileDate source reference INFO contig
     sample IDs: )

julia> for record in reader
       end
ERROR: VariantCallFormat.Reader file format error on line 68 ~>";MA=T;MA"
Stacktrace:
 [1] error(::Type, ::String, ::Int64, ::String, ::String)
   @ Base ./error.jl:44
 [2] _read!(reader::VariantCallFormat.Reader, state::BioCore.Ragel.State{BufferedStreams.BufferedInputStream{TranscodingStreams.TranscodingStream{GzipDecompressor, IOStream}}}, record::VariantCallFormat.Record)
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:164
 [3] read!(reader::VariantCallFormat.Reader, record::VariantCallFormat.Record)
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:134
 [4] tryread!(reader::VariantCallFormat.Reader, output::VariantCallFormat.Record)
   @ BioCore.IO ~/.julia/packages/BioCore/YBJvb/src/IO.jl:73
 [5] iterate(reader::VariantCallFormat.Reader, nextone::VariantCallFormat.Record)
   @ BioCore.IO ~/.julia/packages/BioCore/YBJvb/src/IO.jl:84
 [6] top-level scope
   @ ./REPL[10]:2

julia>

Problem with parsing a VCF file

Hi,

I used samtools and bcftools as well as bwa to generate VCF file.
These tools are pretty much well established and it's strange to see that they generate VCF file that I can't read with VariantCallFormat. I made a check with pyvcf 0.6.8 and I could read the file without any problems.

It appears problem was with CHROM field which contained this NC_000014.8:106641561-106641856 while POS was 1 instead of absolute position.
After writing small function that fixes first column to be NC_000014.8 and second to be 106641561+1 I could read the file.

However, accessing record.pos gives me some range which is not even correct. While it should give the position.
I checked other VCF file which didn't have any issue and is available online, and again pos returns wrong value

v[1]
VariantCallFormat.Record:
   chromosome: NC_000001.10
     position: 10001
   identifier: rs1570391677
    reference: T
    alternate: A C
      quality: <missing>
       filter: <missing>
  information: RS=1570391677 dbSNPBuildID=154 SSR=0 PSEUDOGENEINFO=DDX11L1:100287102 VC=SNV R5 GNO FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0 COMMON 
       format: <missing>

julia> v[1].pos
14:18

whereas in Python using PyVCF record.pos gives me rightfully 10001 value.

In [2]: import vcf

In [3]: vcf_reader = vcf.Reader(open('test.vcf','r'))

In [4]: r = [r for r in vcf_reader][1]

In [5]: r.POS
Out[5]: 10002

Again this is public dbSNP dataset, something very established.
This all suggests that VariantCallFormat (VariantCallFormat v0.5.5 according to Pkg.status) is broken, so as suggested I am filing a bug report.

Hope it helps to solve the problem? Do you have any tips for temporal walk-around? Or is writing custom parser the quickest ?

Assertion Error: endpos != 0 when accessing info field

I have several VCF files generated by lofreq. VCF files generated for tumor samples have no problems with the VCF.Reader(). The normal VCF files do not like VCF.info(record,"DP4") and throw the assertion error endpos != 0. from line 517 of record.jl

INFO line of the header
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth">
##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency">
##INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias at this position">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=CONSVAR,Number=0,Type=Flag,Description="Indicates that the variant is a consensus variant (as opposed to a low frequency variant).">
##INFO=<ID=HRUN,Number=1,Type=Integer,Description="Homopolymer length to the right of report indel position">

Example Record
1 11877 . G A 13 . DP=43;AF=0.023256;SB=3;DP4=24,18,0,1

Now I have surmised the error is because DP4 is the end of the line. if I add a ; followed by any random letter .R for example

1 11877 . G A 13 . DP=43;AF=0.023256;SB=3;DP4=24,18,0,1;R

the reader has no issue reading the field "DP4" and returns 24,18,0,1

This seem trivial. I do not understand why "DP4" can not be the end of the line (or the information tag) as clearly there are VCF files written this way. Other readers I have tried do not have any problems with reading these files. Any suggestions for a patch without having to alter lots of VCF files would be appreciated.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Read from middle of VCF file?

I suppose this package is the same as GeneticVariation.jl, but any chance this package will support multithreaded read? The standard

reader = VCF.Reader(open("example.vcf", "r"))
for record in reader
    # do something
end
close(reader)

requires looping over every record. On large VCF files, just looping through all records can take a few hours. Essentially we need some way to query the reader at the ith position.