GithubHelp home page GithubHelp logo

mklarqvist / tomahawk Goto Github PK

View Code? Open in Web Editor NEW
41.0 41.0 9.0 54.81 MB

Fast calculations of linkage-disequilibrium in large-scale human cohorts

Home Page: https://mklarqvist.github.io/tomahawk/

License: MIT License

Makefile 1.28% C++ 97.24% Dockerfile 0.67% Shell 0.81%
bioinformatics genetics genomics linkage-disequilibrium population-genetics vectorization

tomahawk's People

Contributors

mklarqvist avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tomahawk's Issues

Tomahawk crash with "std::regex_error" error

I'm trying to use Tomahawk on CentOS 7. Compilation is successful, but the resulting binary crashes immediately after execution:

$ ./tomahawk
terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
zsh: abort      ./tomahawk

I am not using the provided install.sh script, instead using the following library versions as provided by CentOS 7:

  • libcurl-devel-7.29.0-57.el7.x86_64
  • openssl-devel-1.0.2k-19.el7.x86_64
  • htslib-devel-1.9-5.el7.x86_64
  • libzstd-devel-1.4.4-1.el7.x86_64

I notice you're using a specific commit hash for htslib in install.sh. Does tomahawk depend on certain versions of any of these libraries? Thanks!

compilation issue

I'm having trouble compiling this on Linux. I tried both your dockerfile docker build -t tomahawk . and on our local system (centOS7.8 with gcc 7.3.1) and received the same error. It looks like a problem with the way headers are being included.

g++ -std=c++0x -O3 -msse4.2  -I../htslib/ -I./include/ -I./lib/ -I/usr/local/include/ -c -DVERSION=\"beta-0.7.1\" -o lib/header_internal.o lib/header_internal.cpp
In file included from lib/header_internal.cpp:1:0:
lib/header_internal.h:16:44: error: expected class-name before '{' token
 class VcfHeaderInternal : public VcfHeader {
                                            ^
lib/header_internal.h:37:28: error: 'string' in namespace 'std' does not name a type
  void AddSample(const std::string& sample_name);
                            ^
lib/header_internal.cpp: In function 'const bcf_hrec_t* tomahawk::GetPopulatedHrec(const bcf_idpair_t&)':
lib/header_internal.cpp:12:2: error: 'cerr' is not a member of 'std'
  std::cerr << "No populated hrec in idPair. Error in htslib." << std::endl;
  ^
lib/header_internal.cpp:12:66: error: 'endl' is not a member of 'std'
  std::cerr << "No populated hrec in idPair. Error in htslib." << std::endl;
...

best,

Jared

Legacy support for < SSE4.2

Tomahawk will not compile on a target architectures with < SSE4.2 as Cloudflare ZLIB requires the _mm_crc32_u32 intrinsic first described in SSE4.2.

We are currently considering to revert this implementation back to standard ZLIB

how to extract TWO binary for region

I'm trying to extract .two format data in native binary format, however this always yields a text file:

tomahawk view -i in.calc_sorted.two -B -I tig00000855 -a 0 -f 2 >tig00000855.two

How do I output binary TWO format?

Another core dump error on import

Hi,

I'm trying to import a vcf file tmp_subset.vcf.gz and getting the following error:

$ tomahawk import -i out/tmp_subset.vcf -o tmp/test.twk
Program: tomahawk-beta-0.7.1-dirty (Tools for computing, querying and storing LD data)
Libraries: tomahawk-0.7.0; ZSTD-1.5.6; htslib 1.13+ds
Contact: Marcus D. R. Klarqvist [email protected]
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT

[2024-05-29 10:33:41,791][LOG] Calling import...
[2024-05-29 10:33:41,792][LOG][READER] Opening out/tmp_subset.vcf...
Segmentation fault (core dumped)

The VCF file format looks OK to me so I began trying to use gdb to diagnose the problem:

$ gdb --args tomahawk import -i out/tmp_subset.vcf -o tmp/test.twk
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
https://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from tomahawk...
(gdb) break tomahawk::VcfHeaderInternal::AddContigInfo
Breakpoint 1 at 0x3ef80: file lib/header_internal.cpp, line 17.

Could you help with this?

tomahawk import does not generate a twk file

I've been trying tomahawk with diffferent VCF data sets. For one set, it worked very well and the results looked great. When I tried another VCF, for the same region, but with more samples, everything ran fine at the "tomahawk" import step but it did not generate a "twk" file. It did not throw an error message either.

The command run was (also tried with default options):

tomahawk import -I locus.bcf -o locus_snp -m 0.2 -h 0.001

The output for the good set has these messages:

Program: tomahawk beta-0.6.1
Contact: Marcus D. R. Klarqvist <[email protected]>
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT
----------
[2018-11-08 10:31:01,989][LOG] Calling import...
[2018-11-08 10:31:01,997][LOG][VCF] Constructing lookup table for 603 contigs...
[2018-11-08 10:31:01,998][LOG][RLE] Samples: 86 > 15... Skip
[2018-11-08 10:31:01,998][LOG][RLE] Samples: 86 < 4095...
[2018-11-08 10:31:01,998][LOG][RLE] Using 16-bit width...
[2018-11-08 10:31:01,998][LOG][WRITER] Opening: locus_snp.twk...
[2018-11-08 10:31:02,129][LOG][WRITER] Wrote: 2,576 variants to 6 blocks...

The messages for the bigger data set are:

Program: tomahawk beta-0.6.1
Contact: Marcus D. R. Klarqvist <[email protected]>
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT
----------
[2018-11-08 10:32:18,162][LOG] Calling import...
[2018-11-08 10:32:18,164][LOG][VCF] Constructing lookup table for 603 contigs...

Apparently, tomahawk didn't import any variants.

The VCF files came from GATK pipeline. Both VCFs are sliced at the same locus of a 2Mbp region, the working one with 86 samples, and not working one with 353 samples.

Could you take a look at this?

LTO and gcc version <= 4.9.2

We have permanently removed the Link Time Optimisation flags from the make files as there is a gcc bug triggering an internal compilation error in versions <= 4.9.2. Users seeking to squeeze out some additional performance may want to update their gcc builds and add these compiler flags back

core dump error on import

Hi,

I'm trying to import a bcf file that was generated by first converting a GATK vcf to bcf with bcftools. I'm getting the following error:

Program:   tomahawk-beta-0.7.1 (Tools for computing, querying and storing LD data)
Libraries: tomahawk-0.7.0; ZSTD-1.4.0; htslib 1.9
Contact: Marcus D. R. Klarqvist <[email protected]>
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT
----------
[2019-05-20 16:53:15,426][LOG] Calling import...
[2019-05-20 16:53:15,426][LOG][READER] Opening snp.bcf...
[2019-05-20 16:53:15,433][LOG][VCF] Constructing lookup table for 608 contigs...
[2019-05-20 16:53:15,434][LOG][VCF] Samples: 56...
[2019-05-20 16:53:15,434][LOG][WRITER] Opening snp.twk...
00000000
00001010
tomahawk: lib/core.cpp:117: void tomahawk::twk1_t::calculateHardyWeinberg(): Assertion `ref == 0 || ref == 1 || ref == 4 || ref == 5' failed.
Aborted (core dumped)

The SNPs seem to meet the expectations of the program. I'm not entirely sure what's going wrong here. Please let me know if additional info would be useful.

tomahawk fails to import from bcf: Assertion failed: (ref == 0 || ref == 1 || ref == 4 || ref == 5)

First of all, thank you for the amazing software!

I was able to convert all bcf files but one, so I guess the input file is the problem. Nonetheless, the error message is a bit cryptic for me:

$ ../tomahawk/tomahawk import -i BR.bcf -o BR
program:   tomahawk-beta-0.7.1 (Tools for computing, querying and storing LD data)
Libraries: tomahawk-0.7.0; ZSTD-1.5.1; htslib 1.14
Contact: Marcus D. R. Klarqvist <[email protected]>
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT
----------
[2021-12-28 12:10:21,629][LOG] Calling import...
[2021-12-28 12:10:21,629][LOG][READER] Opening BR.bcf...
[2021-12-28 12:10:21,637][LOG][VCF] Constructing lookup table for 22 contigs...
[2021-12-28 12:10:21,637][LOG][VCF] Samples: 171...
[2021-12-28 12:10:21,637][LOG][WRITER] Opening BR.twk...00000000
00000001
00000000
00000001
00000000
00000001
00000000
00000001
00000000
00000101
00000000
00000001
00000000
00000001
00000101
00000001
00000101
00000000
00000001
00000000
...
Assertion failed: (ref == 0 || ref == 1 || ref == 4 || ref == 5), function calculateHardyWeinberg, file lib/core.cpp, line 117.
Abort trap: 6

I have circumvented by commenting lines 116-117 in core.cpp and then compiling again, since I know there are no HW equilibrium issues in this population. Anyhow, it would be nice to know what is causing this problem. Should I send my .bcf file to [email protected]?

Thank you

Unphased Math Reference

Hi,

Is there a citation/reference for the math tomahawk does on unphased genotypes to compute LD?

Clumping and pairwise LD for specified SNP lists

I was alerted to your package recently and it looks extremely valuable, congratulations!

I did have a couple of feature requests, apologies if this is already implemented I didn't see the documentation.

  1. Clumping - where SNPs are ordered based on their p-value in GWAS and are iteratively filtered by removing any SNPs in LD with the SNP with the lowest p-value
  2. Creating an LD matrix for a list of SNPs (e.g. rather than a region)

segmentation fault

This project is very exciting! Thanks for sharing the code.

I'm just trying to follow along with your examples, and I ran into a problem.

I wonder if you can provide any advice for debugging?

These steps work OK:

wget -nc http://s3.amazonaws.com/1000genomes/release/20101123/interim_phase1_release/ALL.chr21.phase1.projectConsensus.genotypes.vcf.gz

vcfgz=ALL.chr21.phase1.projectConsensus.genotypes.vcf.gz
bcf=ALL.chr21.phase1.projectConsensus.genotypes.bcf
prefix=${vcfgz%%.vcf.gz}

tabix -p vcf ALL.chr21.phase1.projectConsensus.genotypes.vcf.gz
bcftools convert --output-type b --threads 8 --output ${prefix}.bcf $vcfgz

tomahawk import -i $bcf -o $prefix -n 0.2 -h 0.001
ls -lha ALL.chr21*
-rw-rw-r-- 1 slowikow srlab 328M Feb  2 21:04 ALL.chr21.phase1.projectConsensus.genotypes.bcf
-rw-rw-r-- 1 slowikow srlab  17M Feb  2 21:17 ALL.chr21.phase1.projectConsensus.genotypes.twk
-rw-rw-r-- 1 slowikow srlab  29K Feb  2 21:17 ALL.chr21.phase1.projectConsensus.genotypes.twk.twi
-rw-rw-r-- 1 slowikow srlab 301M May 23  2012 ALL.chr21.phase1.projectConsensus.genotypes.vcf.gz
-rw-rw-r-- 1 slowikow srlab  34K Feb  2 20:59 ALL.chr21.phase1.projectConsensus.genotypes.vcf.gz.tbi

Calculating LD failed:

tomahawk calc -pdi ALL.chr21.phase1.projectConsensus.genotypes.twk -o ALL.chr21.phase1.projectConsensus.genotypes -a 5 -r 0.1 -P 0.1 -c 990 -C 1 -t 28

Program: tomahawk beta-0.2-3-g850e04d5-master
Contact: Marcus D. R. Klarqvist <[email protected]>
Documentation: https://github.com/mklarqvist/Tomahawk
License: MIT
----------
[2018-02-02 21:19:12,375][LOG] Calling calc...
[2018-02-02 21:19:13,563][LOG][TOTEMPOLE] Found: 496 blocks...
[2018-02-02 21:19:13,563][LOG][TOTEMPOLE] Found: 1 contigs and 1,094 samples...
[2018-02-02 21:19:13,563][LOG][TOTEMPOLE] Found: 476,154 variants...
[2018-02-02 21:19:13,564][LOG][RLE] Samples: 1094 > 15... Skip
[2018-02-02 21:19:13,564][LOG][RLE] Samples: 1094 < 4095...
[2018-02-02 21:19:13,564][LOG][RLE] Using 16-bit width...
[2018-02-02 21:19:13,564][LOG][BALANCER] Case is diagonal (chunk 0/990)...
[2018-02-02 21:19:13,564][LOG][BALANCER] Total comparisons: 66 and per thread: 2
Segmentation fault (core dumped)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.