GithubHelp home page GithubHelp logo

zilong-li / vcfppr Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 1.0 12.74 MB

vcfpp.h + htslib + Rcpp = fast VCF/BCF https://doi.org/10.1093/bioinformatics/btae049

Home Page: https://zilong-li.github.io/vcfppR/

License: Other

R 0.77% C++ 1.38% Makefile 1.76% C 93.45% Roff 1.10% M4 1.27% Shell 0.27%
bioinformatics fastr vcf htslib population-genetics population-genomics rstats vcf-data visulization

vcfppr's Introduction

vcfppR: rapid manipulation of the VCF/BCF file

R-CMD-check CRAN status https://github.com/Zilong-Li/vcfppR/releases/latest codecov Downloads

The vcfppR package implements various powerful functions for fast genomics analyses with VCF/BCF files using the C++ API of vcfpp.h.

  • Load/save content of VCF/BCF into R objects with highly customizable options
  • Visualize and chracterize variants
  • Compare two VCF/BCF files and report various statistics
  • Streaming read/write VCF/BCF files with fine control of everything
  • Paper shows vcfppR is 20x faster than vcfR. Also, much faster than cyvcf2

Installation

## install.package("vcfppR") ## from CRAN
remotes::install_github("Zilong-Li/vcfppR") ## from latest github

Cite the work

If you find it useful, please cite the paper

library(vcfppR)
citation("vcfppR")

URL as filename

All functions in vcfppR support URL as filename of VCF/BCF files.

phasedvcf <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
rawvcf <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr21.recalibrated_variants.vcf.gz"
svfile <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20210124.SV_Illumina_Integration/1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.vcf.gz"
popfile <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/20130606_g1k_3202_samples_ped_population.txt"

vcfcomp: compare two VCF files and report concordance

Want to investigate the concordance between two VCF files? vcfcomp is the utility function you need! For example, in benchmarkings, we intend to calculate the genotype correlation between the test and the truth.

library(vcfppR)
res <- vcfcomp(test = rawvcf, truth = phasedvcf,
               stats = "r2", region = "chr21:1-5100000", 
               formats = c("GT","GT"))
par(mar=c(5,5,2,2), cex.lab = 2)
vcfplot(res, col = 2, cex = 2, lwd = 3, type = "b")

res <- vcfcomp(test = rawvcf, truth = phasedvcf,
               stats = "pse",
               region = "chr21:5000000-5500000",
               samples = "HG00673,NA10840",
               return_pse_sites = TRUE)
#> stats F1 or NRC or PSE only uses GT format
vcfplot(res, what = "pse", which=1:2, main = "Phasing switch error", ylab = "HG00673,NA10840")

Check out the vignettes for more!

vcfsummary: variants characterization

Want to summarize variants discovered by genotype caller e.g. GATK? vcfsummary is the utility function you need!

Small variants

res <- vcfsummary(rawvcf,"chr21:10000000-10010000")
vcfplot(res, pop = popfile, col = 1:5, main = "Number of SNP & INDEL variants per population")

Complex structure variants

res <- vcfsummary(svfile, svtype = TRUE, region = "chr20")
vcfplot(res, main = "Structure Variant Counts", col = 1:7)

vcftable: read VCF as tabular data

vcftable gives you fine control over what you want to extract from VCF/BCF files.

Read SNP variants with GT format and two samples

res <- vcftable(phasedvcf, "chr21:1-5100000", samples = "HG00673,NA10840", vartype = "snps")
str(res)
#> List of 10
#>  $ samples: chr [1:2] "HG00673" "NA10840"
#>  $ chr    : chr [1:194] "chr21" "chr21" "chr21" "chr21" ...
#>  $ pos    : int [1:194] 5030578 5030588 5030596 5030673 5030957 5030960 5031004 5031031 5031194 5031224 ...
#>  $ id     : chr [1:194] "21:5030578:C:T" "21:5030588:T:C" "21:5030596:A:G" "21:5030673:G:A" ...
#>  $ ref    : chr [1:194] "C" "T" "A" "G" ...
#>  $ alt    : chr [1:194] "T" "C" "G" "A" ...
#>  $ qual   : num [1:194] 2.14e+09 2.14e+09 2.14e+09 2.14e+09 2.14e+09 ...
#>  $ filter : chr [1:194] "." "." "." "." ...
#>  $ info   : chr [1:194] "AC=74;AF=0.0115553;CM=0;AN=6404;AN_EAS=1170;AN_AMR=980;AN_EUR=1266;AN_AFR=1786;AN_SAS=1202;AN_EUR_unrel=1006;AN"| __truncated__ "AC=53;AF=0.00827608;CM=1.78789e-05;AN=6404;AN_EAS=1170;AN_AMR=980;AN_EUR=1266;AN_AFR=1786;AN_SAS=1202;AN_EUR_un"| __truncated__ "AC=2;AF=0.000312305;CM=3.21821e-05;AN=6404;AN_EAS=1170;AN_AMR=980;AN_EUR=1266;AN_AFR=1786;AN_SAS=1202;AN_EUR_un"| __truncated__ "AC=2;AF=0.000312305;CM=0.00016985;AN=6404;AN_EAS=1170;AN_AMR=980;AN_EUR=1266;AN_AFR=1786;AN_SAS=1202;AN_EUR_unr"| __truncated__ ...
#>  $ gt     : int [1:194, 1:2] 0 0 0 0 0 0 0 0 0 0 ...
#>  - attr(*, "class")= chr "vcftable"

Read SNP variants with PL format and drop the INFO column

res <- vcftable(rawvcf, "chr21:1-5100000", vartype = "snps", format = "PL", info = FALSE)
str(res)
#> List of 10
#>  $ samples: chr [1:3202] "HG00096" "HG00097" "HG00099" "HG00100" ...
#>  $ chr    : chr [1:1383] "chr21" "chr21" "chr21" "chr21" ...
#>  $ pos    : int [1:1383] 5030082 5030088 5030105 5030253 5030278 5030347 5030356 5030357 5030391 5030446 ...
#>  $ id     : chr [1:1383] "." "." "." "." ...
#>  $ ref    : chr [1:1383] "G" "C" "C" "G" ...
#>  $ alt    : chr [1:1383] "A" "T" "A" "T" ...
#>  $ qual   : num [1:1383] 70.1 2773.1 3897.8 102.6 868.9 ...
#>  $ filter : chr [1:1383] "VQSRTrancheSNP99.80to100.00" "VQSRTrancheSNP99.80to100.00" "VQSRTrancheSNP99.80to100.00" "VQSRTrancheSNP99.80to100.00" ...
#>  $ info   : chr(0) 
#>  $ PL     : int [1:1383, 1:9606] NA 64 64 0 0 0 0 0 0 0 ...
#>  - attr(*, "class")= chr "vcftable"

Read INDEL variants with DP format and QUAL>100

res <- vcftable(rawvcf, "chr21:1-5100000", vartype = "indels", format = "DP", qual=100)
str(res)
#> List of 10
#>  $ samples: chr [1:3202] "HG00096" "HG00097" "HG00099" "HG00100" ...
#>  $ chr    : chr [1:180] "chr21" "chr21" "chr21" "chr21" ...
#>  $ pos    : int [1:180] 5030240 5030912 5030937 5031018 5031125 5031147 5031747 5031881 5031919 5031984 ...
#>  $ id     : chr [1:180] "." "." "." "." ...
#>  $ ref    : chr [1:180] "AC" "CA" "T" "C" ...
#>  $ alt    : chr [1:180] "A" "C" "TGGTGCACGCCTGCAGTCCCGGC" "CCTATGATCACACCGT" ...
#>  $ qual   : num [1:180] 6872 321 233 270 58361 ...
#>  $ filter : chr [1:180] "VQSRTrancheINDEL99.00to100.00" "PASS" "PASS" "PASS" ...
#>  $ info   : chr [1:180] "AC=82;AF=0.0136439;AN=6010;BaseQRankSum=-0.55;ClippingRankSum=0;DP=18067;FS=0;InbreedingCoeff=0.0664;MLEAC=76;M"| __truncated__ "AC=19;AF=0.0030586;AN=6212;BaseQRankSum=0.358;ClippingRankSum=-0.594;DP=30725;FS=44.58;InbreedingCoeff=-1.1608;"| __truncated__ "AC=1;AF=0.00015753;AN=6348;BaseQRankSum=-1.793;ClippingRankSum=-0.48;DP=32486;FS=27.337;InbreedingCoeff=-0.0212"| __truncated__ "AC=2;AF=0.000314268;AN=6364;BaseQRankSum=1.59;ClippingRankSum=0.358;DP=44916;FS=0;InbreedingCoeff=-0.0092;MLEAC"| __truncated__ ...
#>  $ DP     : int [1:180, 1:3202] 3 15 13 16 19 20 5 34 29 24 ...
#>  - attr(*, "class")= chr "vcftable"

R API of vcfpp.h

There are two classes i.e. vcfreader and vcfwriter offering the full R-bindings of vcfpp.h. Check out the examples in the tests folder or refer to the manual.

?vcfppR::vcfreader

vcfppr's People

Contributors

kalibera avatar zilong-li avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

kalibera

vcfppr's Issues

vcfcomp returns NAs

Hi Zilong-Li,

Thank you for your software. I found it from your reply here.

I am currently trying to use vcfcomp to test concordance between some imputed genotypes and their original genotypes using the following script:
library(vcfppR)
res <- vcfcomp(test = "x.vcf.gz", truth = "y.vcf.gz", region="z|z", af="af.tsv", stats = "r2")
str(res)
write.table(res, file = "concordance", sep = "\t", col.names = T, row.names = T, quote = FALSE)

I used vcftools --freq to output the allele frequency for af.tsv.

However, this returns a dataframe fulls of NAs:

str(res)
List of 2
$ samples: chr [1:25] "/fastdata/bop21kgl/RawData/LIMS26076p5/Clean_aligned/418-2054_221014_L003__all_mapped_rehead.bam" "/fastdata/bop21$
$ r2 : logi [1, 1:17] NA NA NA NA NA NA ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:17] "(0,1e-05]" "(1e-05,2e-05]" "(2e-05,5e-05]" "(5e-05,0.0001]" ...
write.table(res, file = "concordance", sep = "\t", col.names = T, row.names = T, quote = FALSE)

proc.time()
user system elapsed
190.199 5.955 196.606

Do you have a suggestion of where this is going wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.