GithubHelp home page GithubHelp logo

medvedevgroup / vargeno Goto Github PK

View Code? Open in Web Editor NEW
19.0 3.0 4.0 36.28 MB

Towards fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics.

Home Page: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty641/5056043

License: MIT License

Makefile 2.22% C++ 89.95% C 1.89% CMake 1.40% R 1.71% TeX 2.45% Shell 0.27% GDB 0.06% Batchfile 0.02% Python 0.03%
snps genotyping bioinformatics algorithms data-structures computational-biology

vargeno's Introduction

VarGeno

Fase SNP genotyping tool for whole genome sequencing data and large SNP database.

Install from Bioconda

VarGeno can be installed from Bioconda with command conda install vargeno.

Go to this link for more information about Bioconda.

If you do not have Bioconda installed, you can install VarGeno from source code.

Quick Usage

VarGeno takes as input:

  1. A reference genome sequence in FASTA file format.
  2. A list of SNPs to be genotyped in VCF file format.
  3. Sequencing reads from the donor genome in FASTQ file format.

Before genotyping an individual, you must construct indices for the reference and SNP list using the following commands:

vargeno index ref.fa snp.vcf index_prefix

To perform the genotyping:

vargeno geno index_prefix reads.fq snp.vcf output_filename

Here index_prefix should be the same string as index generating.

Output format: VCF

VarGeno's genotyping results are in the "FORMAT" column of VCF file.

  1. genotypes: in "GT" field: 0/0, 0/1 or 1/1.
  2. genotype quality: in "GQ" field, encoded as a phred quality (Integer).

For details of "GT" and "GQ" fields, please refer to The Variant Call Format(VCF) Version 4.2 Specification.

Install from Source Code

Prerequisites

  • A modern, C++11 ready compiler, such as g++ version 4.9 or higher.
  • The cmake build system (only necessary to install SDSL library. If SDSL library already installed, cmake is not needed)
  • A 64-bit operating system. Either Mac OS X or Linux are currently supported.

Install Command

git clone https://github.com/medvedevgroup/vargeno.git
cd vargeno
export PREFIX=$HOME
bash ./install.sh

You should then see vargeno in vargeno directory. To verify that your installation is correct, you can run the toy example below.

Example

The example dataset is in https://github.com/medvedevgroup/vargeno/tree/master/test .

In this example, we genotype 100 SNPs on human chromosome 22 with a small subset of 1000 Genome Project Illumina sequencing reads. The whole process should finish in around a minute and requries 34 GB RAM.

  1. go to test data directory

  2. pre-process the reference and SNP list to generate indices:

vargeno index chr22.fa snp.vcf test_prefix
  1. genotype variants:
vargeno geno test_prefix reads.fq snp.vcf genotyped.vcf

The expected output of VarGeno on the example dataset should be https://github.com/medvedevgroup/vargeno/blob/master/test/expected_output.

Memory Lite Version

The memory lite version of VarGeno (VarGeno-Lite) is maintained as an independent project in https://github.com/medvedevgroup/vargeno_lite.

Citation

If you use VarGeno in your research, please cite

  • Chen Sun and Paul Medvedev, Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics.

VarGeno's algorithm is built on top of LAVA's. Its code is built on top of LAVA's and it reuses a lot of LAVA's code. It uses some code from the AllSome project.

vargeno's People

Contributors

bbsunchen avatar pashadag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

vargeno's Issues

Indexing problem

I'm facing the same problem reported in #7 .
I am taking into account the test dataset, but the index command generates empty .dict files. I'm running vargeno on Ubuntu 20.04.2 in conda environment and the terminal output shows that the bit vector associated with the Bloom Filter from VCF is empty:

[BloomFilter constructBfFromGenomeseq] bit vector: 27926073/9600000000                                                  
[BloomFilter constructBfFromGenomeseq] lite bit vector: 29747190/18400000000                                            
[BloomFilter constructBfFromVCF] bit vector: 0/1120000000                                                               
SNP Dictionary                                                                                                          
Total k-mers:        2816                                                                                               
Unambig k-mers:      2816                                                                                               
Ambig unique k-mers: 0                                                                                                  
Ambig total k-mers:  0                                                                                                  
Ref Dictionary                                                                                                         
Total k-mers:        34894128                                                                                          
Unambig k-mers:      31402166                                                                                           
Ambig unique k-mers: 926632                                                                                             
Ambig total k-mers:  3491962  

Empty .dict files

I've been trying to run vargeno on non-human data and running into problems at the indexing stage. No error is reported during the process, but the .dict files are both empty, and so the genotyping step fails.

I'm working with a fragmentary reference assembly of a grasshopper genome, so both the bioinformatic and biological properties of the data are not at all what vargeno was designed for.

Do you have any tips for troubleshooting? Attached (here) is a sample of the .vcf input. Since my data is not human data and I'm obviously not working with dbSNP it's a little unclear how to properly format this file. Variants were detected with freebayes in the first instance.

Here is the terminal output:

$ vargeno index packardii.sub.fa snp.vcf test
[BloomFilter constructBfFromGenomeseq] bit vector: 755356701/9600000000
[BloomFilter constructBfFromGenomeseq] lite bit vector: 988176227/18400000000
[BloomFilter constructBfFromVCF] bit vector: 0/1120000000
SNP Dictionary
Total k-mers:        21626752
Unambig k-mers:      20575340
Ambig unique k-mers: 296062
Ambig total k-mers:  1051412
Ref Dictionary
Total k-mers:        1305711431
Unambig k-mers:      1130124620
Ambig unique k-mers: 36489256
Ambig total k-mers:  175586811

And here are the output files:

-rw-r--r--  1 oliver users   12348187 Feb  5 11:42 test.chrlens
-rw-r--r--  1 oliver users 1200000008 Feb  5 10:43 test.ref.bf
-rw-r--r--  1 oliver users 2300000008 Feb  5 10:43 test.ref.bf.lite.bf
-rw-r--r--  1 oliver users          0 Feb  5 14:47 test.ref.dict
-rw-r--r--  1 oliver users  140000008 Feb  5 11:41 test.snp.bf
-rw-r--r--  1 oliver users          0 Feb  5 11:42 test.snp.dict

All of the test files (in /vargeno/test) run fine and reproduce the provided output files. I'm running on Ubuntu 18.04.5 in a conda environment with the following packages:

# packages in environment at /home/oliver/miniconda2/envs/vargeno:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
bioawk                    1.0                  hed695b0_5    bioconda
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
seqtk                     1.3                  hed695b0_2    bioconda
vargeno                   1.0.3                hc9558a2_1    bioconda
zlib                      1.2.11            h516909a_1010    conda-forge

Test example produces an empty VCF file

Running the test example from the documentation produces an empty VCF file:

vargeno index chr22.fa snp.vcf test_prefix
vargeno geno test_prefix reads.fq snp.vcf genotyped.vcf

the resulting genotyped.vcf file does not contain any variants:

cat genotyped.vcf | tail -3

produces:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	DONOR

Output VCF file is empty

Hello,
I tried to run vargeno (last commit, 00ee0f0) on the following data: reference, VCFs (provided by 1000genomes), and sample sequenced from NA12878 individual.

I ran some tests on different chromosomes but I always got an output VCF file that contains only the header. I tried to figure out the reasons of this and these are my hypotheses.

I think the main problem is in the index step. Indeed, in my various tests, I obtain

...
[BloomFilter constructBfFromVCF] bit vector: 0/1120000000
...

From here, I started digging in the code and I saw that there could be some problems on how you index the input reference. In more details:

  • when you read the VCF file and extract the chromosome name from each line, you add "chr" at its start:
    if(chr_name[0] != 'c') chr_name = "chr" + chr_name;
    but you don't do the same when you read the headers from the input FASTA file:
    id = line.substr(1);
    For instance, this is a problem if the headers in the FASTA file contain only the chromosome number. I think this problem affects also the way you store information in the .chrlens file of your index:

    vargeno/src/qv.cc

    Line 2345 in 00ee0f0

    fprintf(chrlens, "%s %lu\n", ref.seqs[i].name, ref.seqs[i].size);
  • when a header in the FASTA file contains the unique identifier for the sequence and also additional information (such as: ">22 dna:chromosome chromosome:GRCh37:22:1:51304566:1"), you consider all the line as unique identifier:
    id = line.substr(1);

    but when you parse the VCF file, you consider as chromosome name only the unique identifier since you get the chromosome number from the first column of the VCF (this should not affect the .chrlens file since you use a different function to read the reference FASTA when you write the .chrlens file)

I tried to solve these two problems by changing some lines in your code but I'm not sure if what I've done is right and enough (I'll anyway open a pull request: it could be a good starting point for you). With my fixes, now the output is not empty anymore.

Moreover, I think that this behaviour occurs also if the input VCF contains the field GT specified in the header and the GT columns (as in the VCFs provided by the 1000genomes project). If I run vargeno index, I obtain:

...
SNP Dictionary               
Total k-mers:        0
Unambig k-mers:      0
Ambig unique k-mers: 0
Ambig total k-mers:  0
...

Currently, I solved this problem by removing out from the VCFs the line in the header and the ~2500 columns of the samples. Maybe you could find a better solution to this or maybe you can update the readme accordingly.

Thanks in advance!

Best,
Luca

Segmentation fault during geno step

Hi,
when I try to run vargeno on the same data linked in my previous issue (#2), it crashes during the geno step.

This is the output of vargeno index:

[BloomFilter constructBfFromGenomeseq] bit vector: 1130814221/9600000000
[BloomFilter constructBfFromGenomeseq] lite bit vector: 2131757218/18400000000
[BloomFilter constructBfFromVCF] bit vector: 68265608/1120000000
SNP Dictionary
Total k-mers:        2593345952
Unambig k-mers:      2367171409
Ambig unique k-mers: 37905369
Ambig total k-mers:  226174543
Ref Dictionary
Total k-mers:        2858648351
Unambig k-mers:      2488558606
Ambig unique k-mers: 61723937
Ambig total k-mers:  370089745

and these are the files produced during the index step:

4.0K    vargeno.RMNISTHS_30xdownsample.index.chrlens
1.2G    vargeno.RMNISTHS_30xdownsample.index.ref.bf
2.2G    vargeno.RMNISTHS_30xdownsample.index.ref.bf.lite.bf
34G     vargeno.RMNISTHS_30xdownsample.index.ref.dict
134M    vargeno.RMNISTHS_30xdownsample.index.snp.bf
39G     vargeno.RMNISTHS_30xdownsample.index.snp.dict

When running the geno step, vargeno prints "Processing..." and crashes shortly thereafter:

Initializing...
Processing...
Segmentation fault (core dumped)

\time reports that it is terminated by signal 11 but I'm not sure where this happens. At first I thought that it was due to RAM saturation (the machine used to test the tool is equipped with 256GB of RAM) but the same behaviour occurs on a cluster with 1TB of RAM.

Anyway, I also tried to run vargeno on a smaller set of variants (I halved the input VCF) and it is able to conclude the analysis.

The complete VCF contains 84739838 variants and the sample consists of 696168435 reads. The whole (unzipped) data accounts for ~240GB of disk space. If you want to reproduce this behaviour on your machine, I can share the data with you.

Luca

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.