bioinfologics / w2rap-contigger Goto Github PK

View Code? Open in Web Editor NEW

43.0 43.0 14.0 62.76 MB

An Illumina PE genome contig assembler, can handle large (17Gbp) complex (hexaploid) genomes.

Home Page: http://bioinfologics.github.io/the-w2rap-contigger/

License: MIT License

CMake 0.25% C++ 97.78% C 1.96%

w2rap-contigger's People

Contributors

Stargazers

Watchers

Forkers

ljyanesm minzhuxie barrk idiv-biodiversity transgirlcodes akachigerry httphangithub amjadhpc katofcats altingia pythseq vikash84 weiszd

w2rap-contigger's Issues

errors when compiling with gcc-8 in MacOS

Step1 crashes when realloc'ing the vector where read sequences are stored, there seems to be some incompatibility in the code of the mempool allocator/owner.

Update licensing information

So far the old DISCOVAR license is all over the place.

hbv2gfa segmentation fault

Hi,
I've run w2rap-contigger to step 4 and then run hbv2gfa as follows:

export OMP_PROC_BIND=spread
export MALLOC_PER_THREAD=1
source w2rap-a43f5a0;
source gcc-5.2.0;
hbv2gfa -o polecat_k180_out -i polecat_k180.large_K.clean --stats_only 1 -g 2400000

Here’s the stdout:
hbv2gfa from w2rap-contigger
Reading graph and paths...
DONE!
=== Graph stats ===
Canonical graph sequences size: 46916876167429

And here’s the stderr:
/var/spool/PBS/mom_priv/jobs/68505.UVL00000253-P000.SC: line 7: 331372 Segmentation fault hbv2gfa -o polecat_k180_out -i polecat_k180.large_K.clean --stats_only 1 -g 2400000

I've also tried without --stats_only, but I get the same Segmentation fault.

Check step2 counting using large number of batches

Step2 w2rap-contigger
w2rap-contigger -K 200 --threads 64 (No batches):
Tue Sep 25 15:04:08 2018: 3669737625 / 17919060456 kmers with Freq >= 4

w2rap-contigger -p w2rap -o . -t 48 -m 900 -d 60:
Sat Mar 10 18:30:02 2018: 3687182131 / 8645611169 kmers with Freq >= 4

Step 1 Speed for w2rap contigger

Hey!

On running a profiler on the w2rap-contigger, one sees that ~98.5% of the time is being spent in the function: PQVecEncoder::init. It is currently taking really long times to load reads into memory. Is there any way you can recommend speeding this up?

Thanks

Eliminate all data exchange via disk-files in the later stages

I clutters the disk, but also makes functions more difficult to debug, understand, etc.

Please tag a stable release

Thanks!

add `make install` target

The usual workflow for installation is:

cmake
make
[sudo] make install

At the moment, w2rap-contigger does not provide an automated way to install the built executables:

$ make install
make: *** No rule to make target `install'.  Stop.

PR incoming, just give me a sec.

w2rap-contiger with 150bp pe library

Hey,
is it possible to use the w2rap-contiger with 2 x 150 bp paired end library or is there a minimum for the read length?
I tried to assemble a smaller genome with 2 x 150 bp paired end library but after step 5 I always get :

Fatal error (pid=27838) at Mi Mai 03 14:17:59 2017:
Illegal value for CP_MAX_QDIFF.

I am not sure whether this problem arise because of the read length or because of a installation problem.

K parameter determination

Hi,
What is the best strategy to determine the K parameter?

Thank you in advance,

Michal

Improve CLI log

Make all log lines include date/time.
Make all long operations output progress, and heuristics report analysis stats.
Also include runtime stats on each step (memory and time spent on step).

Review and clean up CLI options

Review all parameters, check that only needed parameters for steps to be run are enforced.

Provide a minimum consistent set of options for PathFinder, and enable PathFinder by default

Step 3 crashed

Hi team,

first thank for your work. My job at step 3 everytime. I have tried with different Kmer size.

Here my log :

--== Step 3: Repathing to second (large K) graph ==--
/bin/sh: 1: set: Illegal option -o pipefail
Fri Aug 17 16:50:02 2018: beginning repathing 21722531 edges from K=60 to K2=60
Fri Aug 17 16:50:02 2018: constructing places from 133868092 paths
Fri Aug 17 16:50:03 2018: 118287130 / 133868092 reads pathed, 82739270 spanning junctions
Fri Aug 17 16:51:57 2018: sorting 118287130 places
Fri Aug 17 16:54:43 2018: 41782245 unique places
Fri Aug 17 16:54:43 2018: building all
Fri Aug 17 16:58:25 2018: calling LongReadsToPaths
Killed

So first I have this error (illegal option -o pipefail) and after that it just crash with this comment "killed".

I have checked my system log :
Out of memory: Kill process 10849 (w2rap-contigger) score 964 or sacrifice child

So do you have an idea how to reduce the memory used for step 3?

I will try to normalize my libraries (reducing my number of reads).

Thanks,

Yann

step2 crashed

w2rap-contigger -t 60 -m 800 -p w2rap -o w2rap_output -r D2005364A-WR_L01_1.cleaned.fq.gz,D2005364A-WR_L01_2.cleaned.fq.gz

Welcome to w2rap-contigger
WARNING: you are running the code with omp_proc_bind_false, parallel performance may suffer
--== Step 1: Reading input files ==--
Fri Dec 02 11:43:16 2022: finding input files
Fri Dec 02 11:43:16 2022: reading 2 files (which may take a while)

INPUT FILES:
[1a,type=frag,sample=C,lib=1,frac=1] D2005364A-WR_L01_1.cleaned.fq.gz
[1b,type=frag,sample=C,lib=1,frac=1] D2005364A-WR_L01_2.cleaned.fq.gz

Fri Dec 02 13:10:28 2022: found 1 samples
Fri Dec 02 13:10:28 2022: starts = 0
Fri Dec 02 13:10:28 2022: data extraction complete, peak = 371.93 GB
1.45 hours used extracting reads
Reading input files DONE!

--== Step 2: Building first (small K) graph ==--
Fri Dec 02 13:10:28 2022: creating kmers from reads...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
terminate called recursively
Aborted (core dumped)

No error message (34045 Aborted)

Hi, I wanted to run w2rap-contigger, but it stops at step 2.

w2rap-contigger/bin/w2rap-contigger -t 48 -m 10000 -r /tmp/slurm-7878835/R1.fq,/tmp/slurm-7878835/R2.fq -o /tmp/slurm-7878835/w2rap -p w2rap --tmp_dir /tmp/slurm-7878835/temp --min_freq 10

Welcome to w2rap-contigger
--== Step 1: Reading input files ==--
Wed Dec 02 12:21:00 2020: finding input files
Wed Dec 02 12:21:00 2020: reading 2 files (which may take a while)

INPUT FILES:
[1a,type=frag,sample=C,lib=1,frac=1] /tmp/slurm-7878835/R1.fq
[1b,type=frag,sample=C,lib=1,frac=1] /tmp/slurm-7878835/R2.fq

Fri Dec 04 10:10:25 2020: found 1 samples
Fri Dec 04 10:10:25 2020: starts = 0
Fri Dec 04 10:10:25 2020: data extraction complete, peak = 244.52 GB
1.91 days used extracting reads
Reading input files DONE!


--== Step 2: Building first (small K) graph ==--
Fri Dec 04 10:10:25 2020: creating kmers from reads...
terminate called recursively
/var/spool/slurm/slurmd/job7878835/slurm_script: line 28: 34045 Aborted                 w2rap-contigger/bin/w2rap-contigger -t $SLURM_CPUS_PER_TASK -m 10000 -r $TMPDIR/R1.fq,$TMPDIR/R2.fq -o $TMPDIR/w2rap -p w2rap --tmp_dir $TMPDIR/temp --min_freq 10
[ERRO] /tmp/slurm-7878835/w2rap/*.fasta: fastx: open /tmp/slurm-7878835/w2rap/*.fasta: no such file or directory

Rename output files to meaningful names

w2rap-contigger memory issues / advices to run w2rap

Hi @jonwright99 and @bjclavijo

Could you please give me some advices to deal with the w2rap pipeline.
I'm using the w2rap pipeline to assembly a mammal genome (+-2.9GB).
I'm working in a cluster with 512Gb ram and 32 cpus.

I have 3 short read libraries (total +-800GB) with insert sizes between 400 and 500bp. (Kat hist's are below)
library 1
mde_l1_hist.pdf
library 2
mde_l2_hist.pdf
library 3
mde_l3_hist.pdf
All libraries
mde_l1_l2_l3 (1).pdf

All the libraries were quality controlled and trimmed with Trimmomatic.
One of this libraries were done to perform 10x, mde_l3 .
Once we obtained short molecular lengths, we decided use it as short reads and posteriorly in re-scaffolding phase with arcs.

Before run the command below (using this docker https://hub.docker.com/r/villegar/w2rap) i runned all steps mentioned in your tutorial, and all pipeline steps are working perfectly.
We are using the command:
singularity exec /share/apps/singularity/w2rap.simg w2rap-contigger -t 30 -m 400 -r mde_l1_1_1.fastq,mde_l1_2_1.fastq,mde_l2_1_1.fastq,mde_l2_2_1.fastq,mde_l3_1_1.fastq,mde_l3_2_1.fastq -o mde_contigs_200k --min_freq 20 -p mde_k200 -d 32 --dump_all 1

In the step 2 I'm having the same problem of #27 issue.
"terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc"

To deal with this issue i removed the 10x library, decreasing the coverage and the memory needed (perform the assembly with 1 and 2 library (+-50x - kat hist peak).
Command:
singularity exec /share/apps/singularity/w2rap.simg w2rap-contigger -t 30 -m 400 -r mde_l1_1_1.fastq,mde_l1_2_1.fastq,mde_l2_1_1.fastq,mde_l2_2_1.fastq -o mde_contigs_200k --min_freq 12 -p mde_k200 -d 32 --dump_all 1
library 1and 2
mde_l1_l2_hist.pdf
However, the program gave me the same error ("terminate called after throwing an instance of 'std::bad_alloc';what(): std::bad_alloc")

Now, i'm considering three other options to solve the problem:

1º. increase the --min_freq can be applied without lose to many information?. What value do you recommend for this situation?
2º. Subsampling the datasets (e.g. with kat or other program) and run the assembly.
3º. Remove the polymorphic reads, first peak (small one). I can perform this with kat?

What do you think about this situation?

Best Regards
André

Undefined reference to `std::__throw_logic_error(char const*)'

Hi,

In trying to install the w2rap-contigger on a Red Hat cluster, I'm repeatedly getting an error during the make step, in "Linking CXX executable bin/04_patching." The initial error message is:

[ 92%] Building CXX object CMakeFiles/04_patching.dir/src/paths/long/large/Unsat.cc.o
Linking CXX executable bin/04_patching
CMakeFiles/01_unipaths.dir/src/modules/unipaths_01.cc.o: In function _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag.isra.115': /afs/crc.nd.edu/x86_64_linux/g/gcc/5.2.0/build/include/c++/5.2.0/bits/basic_string.tcc:216: undefined reference to std::__throw_logic_error(char const*)'

Followed by about half a million other lines of undefined references and similar errors (too large to attach in its entirety, but I included the first 10% in case it's helpful
w2rap-contigger_compilation_error_output.snippet_1.txt
).

The cmake step completes without a problem:

cmake -D CMAKE_CXX_COMPILER=gcc -D MALLOC_LIBRARY=/afs/crc.nd.edu/user/r/rlove1/local/bin/jemalloc-4.2.1/lib/libjemalloc.so .
-- The C compiler identification is GNU 4.4.7
-- The CXX compiler identification is GNU 5.2.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /afs/crc.nd.edu/x86_64_linux/g/gcc/5.2.0/build/bin/gcc
-- Check for working CXX compiler: /afs/crc.nd.edu/x86_64_linux/g/gcc/5.2.0/build/bin/gcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /afs/crc.nd.edu/user/r/rlove1/local/bin/w2rap-contigger

From the verbosity of the error messages, it seems like something very basic has gone missing or been delinked, but I'm not sure what it is.

This is with cmake 3.2.2 and gcc 5.2.0.

Thanks,
Becca

Make error: omp_get_proc_bind

On compiling with make -j 4, I get the following errors:

w2rap-contigger/src/modules/w2rap-contigger.cc:164:27: error: ‘omp_get_proc_bind’ was not declared in this scope

w2rap-contigger/src/modules/w2rap-contigger.cc:164:30: error: ‘omp_proc_bind_false’ was not declared in this scope

How can I fix these?

Some trouble with CP_MAX_QDIFF value !

Hello,
i've try to run w2rap-contigger on my illumina data set. The first steps (01_unipaths, 02_qgraph and 03_clean) are ok. But in the patching process , i've got some trouble with CP_MAX_QDIFF value, and this message in the output "Illegal value for CP_MAX_QDIFF". This variable is define in the patching script , and can't be modified.
Thanks in advance for your help.

Arnaud

Terminate: quality score larger than 63

Hi, recently, the program has stopped with the following error, so shall I trying to change the reads score ?
Meanwhile, how to set up the K value if the reads score is 100bp?
Thanks!
--== Step 1: Reading input files ==--
Thu Apr 20 22:48:23 2017: finding input files
Thu Apr 20 22:48:23 2017: reading 2 files (which may take a while)

Your input reads are funny. I found a quality score of 66.
The maximum value that I allow is 63.

incomplete record

Hi all,
I am trying to run w2rap-contigger, with the following command line:

w2rap-contigger -t 36 -m 800 --from_step 1 --to_step 7 -r file_1.fastq,file_2.fastq -o Assembly/w2rap/EC -p Assembly_K200 -s 200 --dump_perf 1

but I get the following error at the start of the assembly:

"See incomplete record in faile1.fastq ot file2.fastq"

I have checked both files and there are no empty lines nor incomplete records in these. Both files have the same length and the same names for paired reads. What else could cause the getline to fail?

I look forward to hearing from you.

Best regards,

Juan Montenegro

denoising Illumina reads

Hi,
Do you recommend to do denoising Illumina reads e.g. with any of khmer recipes or any other tool?

Thank you in advance,

Michal

Output the contig overlap graph in GFA format

Graphical Fragment Assembly (GFA) format describes the contig overlap graph of a genome sequence assembly. Bandage can be used to visualize this graph. The format specification was originally drafted by Heng Li and is now community-maintained and nearing completeness. The format allows optional fields similarly to the SAM spec.

Step 3 crash

I am getting the following error on step 3

--== Step 3: Repathing to second (large K) graph ==--
Tue Mar 02 13:15:06 2021: beginning repathing 31790560 edges from K=60 to K2=64
Tue Mar 02 13:15:06 2021: constructing places from 1308014986 paths
Tue Mar 02 13:15:12 2021: 1265069099 / 1308014986 reads pathed, 173319806 spanning junctions
Tue Mar 02 13:20:49 2021: sorting 1264197670 places
Tue Mar 02 14:03:45 2021: 77032864 unique places
Tue Mar 02 14:03:45 2021: building all
Tue Mar 02 14:08:44 2021: calling LongReadsToPaths
Fatal error (pid=16780) at Tue Mar 02 14:08:44 2021:
Illegal value 64 for K in BigK dispatcher.

What should I do?

Speeding up loading reads into memory

Hey! I have a dataset with about 1T of reads in fastq format. This takes about a week to load into memory in the first step. Is there anyway to quicken this process?

Missing licensing information

Perhaps ftp://ftp.broadinstitute.org/pub/crd/Discovar/LICENSE should be included?

Can w2rap-contigger use for non-PCR-free NGS assembling?

Hi,

Can w2rap-contigger use for non-PCR-free NGS assembling?

Best wishes,
Kun

Dump final HBV, paths and GFAs consistently

All needs to check out with the fasta representation.

Illegal involution value

Hi!

At the end of Step 7, I encounter this error:

============GFA DUMP STARTING============
Graph has 23455016 edges
Dumping edges
Dumping connections
============GFA DUMP ENDED============

Forward Node Edge Classification:
nothing_fw: 222492 ( 149531677 kmers )
line_fw: 44 ( 37056 kmers )
join_fw: 7728870 ( 5211720522 kmers )
split_fw: 15460764 ( 5856529166 kmers )
join_split_fw: 42846 ( 21464707 kmers )
--== Step 7: PE-Scaffolding ==--
Sun Dec 24 22:18:04 2017: deleting 0 gaps and adding 13 gaps to force symmetry
Sun Dec 24 22:20:16 2017: done making gaps, time used = 13.7 minutes
--== PE-Scaffolding DONE!

e = 410162, inv[e] = -1, hb.EdgeObjectCount( ) = 23394717
Illegal involution value.
Abort.
Aborted (core dumped)

Why might this be happening?

Segfault during Step 6: removing small components

Hello,

w2rap-contigger (downloaded 2019-10-24) ran Segfault during step 6 on a small dataset.
I am attaching the log file and the fastq files that caused the error.
assembly.log
R1.fq.gz
R2.fq.gz

Best Regards,
R. Cui

Read lengths and large K

Dear developers,

I have ~40x coverage of 2x150 Illumina reads produced using 10x Chromium libraries for an organism with a complex genome (we also have long reads) and like to try w2rap-contigger. However, I don't really understand how to select a value for the parameter large K=n.

Is this value bounded by the read-length, i.e. should it be below 150 (or 300)? In other words, what are the important factors that govern the value specified for K?

Segmentation fault in Step 5

Hi,
Using a relatively large data set (~3B reads) we are experiencing some problem in Step 5 using the "step7_fix" branch. (Usually this branch is reliable and still gives the best result comparing to the latest master branch)
The machine that was in use had 2TB RAM but w2rap-contigger peaked only around 1.2TB before it crashed with "Segmentation fault".
These are the last couple of lines before the crash:

...
Fri Aug 07 09:05:43 2020: 2132052 blobs processed, paths found for 2127214
Fri Aug 07 09:05:43 20202.74 hours spent in local assemblies.
Fri Aug 07 09:05:43 2020: patching
Fri Aug 07 09:05:52 2020: 8.5 seconds used patching
Fri Aug 07 09:09:20 2020: building hb2
1.58 minutes used in new stuff 1 test
memory in use now = 554977333248
Fri Aug 07 09:42:06 2020: back from buildBigKHBVFromReads
32.8 minutes used in new stuff 2 test
peak mem usage = 626.02 GB
2.17 minutes used in new stuff 5
Fri Aug 07 09:55:13 2020: finding interesting reads
Fri Aug 07 09:56:50 2020: building dictionary
Fri Aug 07 10:02:39 2020: reducing
We need 1 passes.
Expect 2114087 keys per batch.
Provide 5285216 keys per batch.
There were 173 buffer overflows.
Fri Aug 07 12:42:39 2020: kmerizing
We need 1 passes.
Expect 3328427 keys per batch.
Provide 8321066 keys per batch.
Fri Aug 07 14:30:53 2020: cleaning
Fri Aug 07 14:34:53 2020: finding uniquely aligning edges
/mnt/ssd1/w2rap.sh: line 15:  1487 Segmentation fault      (core dumped)

We tried to lower the amount of reads to 2.3B reads but had similar result, only when we used around 1.6B reads we managed to pass step 5.
In case it helps I've saved the "core dumps".
We would really appreciate some help finding a solution to this problem and/or guidance how to debug it further.
Thanks,

Resume checkpoints in w2rap-contigger?

Hi,

I'm running w2rap-contigger on a HPC and have limited walltime for each job. Is it possible to resume from a certain checkpoint during the run so I can perhaps split jobs into several smaller tasks?

Thanks.
mht

More details on w2rap-contigger step 5 needed - run with 100bp PE short reads (Illumina)

Hi, dear develop team and fellow users,

I am trying to use w2rap-contigger to assembly a mammalian genome (~2.5GB in size) using 70X coverage paired-end 100bp short reads.
I ran this on PBS cluster, a node with 40 cores, 25GB/core. I have completed the first 4 steps (step by step). The step 5 stopped when running out of walltime of 48 hrs. The following is the command and output from the program:

w2rap-contigger -t 30 -m 1000 -r ../trimmed_reads/kimba_fastptrim_1.fq,../trimmed_reads/kimba_fastptrim_2.fq -o contig_dir -p Kimba --min_freq 10
-d 38 -K 72 --from_step 5 --to_step 5 --dump_all 1

Welcome to w2rap-contigger
Loading reads in fastb/qualp format...
DONE!
Reading large_K clean graph and paths...
DONE!
--== Step 5: Assembling gaps ==--
Mon Aug 31 15:12:52 2020: inverting paths
Mon Aug 31 15:39:34 2020: Finding unsatisfied path clusters
Mon Aug 31 17:03:23 2020: Merging 117884534 clusters
Wed Sep 02 14:05:41 2020: 6732306 non-inverted clusters

First of all, I'd like to know if the program is limited to 250 bp PE reads? Anyone run with shorter reads successfully?
Secondly, could anyone kindly help me with how many more tasks are involved in step 5? or what kind of walltime should I set?
Third, I checked the processes running on the node. The w2rap-contigger seems running on one thread only. The CPU is at most 100%. I wonder should I use fewer cores with longer walltime. Here is the info on processes:

Tasks: 1117 total, 3 running, 1114 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.4%us, 0.3%sy, 0.0%ni, 95.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1058717736k total, 644741916k used, 413975820k free, 1819956k buffers
Swap: 8388604k total, 24684k used, 8363920k free, 38329132k cached

Any help is appreciated, Thanks in advance!

Lan

Fails to build with clang on Mac OS

GCC 6.1 on Mac OS works fine. Here's the error from clang:

/tmp/w2rap-contigger-20160703-12559-bhmpw2/src/feudal/BinaryStream.h:176In file included from /tmp/w2rap-contigger-20160703-12559-bhmpw2/src/paths/MakeAlignsPathsParallelX.cc:12:63: error: no member named 'c_str' in 'std::__1::basic_istream<char>'
In file included from /tmp/w2rap-contigger-20160703-12559-bhmpw2/src/kmers/kmer_parcels/KmerParcelsBuilder.cc:10:
…
/tmp/w2rap-contigger-20160703-12559-bhmpw2/src/STLExtensions.h:158:10: fatal error'ext/functional' file not found/tmp/w2rap-contigger-20160703-12559-bhmpw2/src/Vec.h'ext/functional' file not found

cannot find -llib

Hi,
I followed the instructions on the github page. I also have jemalloc installed. When I run:
cmake -D CMAKE_CXX_COMPILER=g++ -D MALLOC_LIBRARY=/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/jemalloc-4.2.1/lib
It finishes without an error.
When I run "make -j 4" it get to about 90% then starts printing the error:
/usr/bin/ld: cannot find -llib

cmake: 3.4.1
gcc: 4.8.2
operating system: gnu/linux

Probably something on my end, but before I bother IT I thought I would check here. Also in case someone has a similar issue in the future.
thank you!
-scott

Having trouble with preoccupied kmers

Dear Bernardo and Gonzalo,

I am trying use w2rap-contigger as an alternative to discovar denovo for a 2x150bp library (mean fragment size ~350bp stdev: 77bp). The library was sequenced for a 1.2Gb genome at a depth ~60x.

I have triying to run the program in a 255Gb node with 24 cpus. And I have assigned 24 batches (before I tried with 16 and failed, presumably because of memory) and K=200 (also used K=160 with identical result). This is the command given:

w2rap-contigger -t 24 -r reads/pe400.1.fastq.gz,reads/pe400.2.fastq.gz -o test_K200 -p assembly_w2rap -K 200 -d 24 --tmp_dir $TMPDIR

After 7hours and registering a memory peak of 195GB , I get this error at step 2 "Building first (small K) graph" :

--== Step 2: Building first (small K) graph ==--
Thu Feb 14 07:46:42 2019: creating kmers from reads...
Thu Feb 14 07:46:42 2019: disk-based kmer counting with 24 batches
Thu Feb 14 07:50:13 2019: batch 0 done and dumped with 1186362054 kmers
Thu Feb 14 07:52:46 2019: batch 1 done and dumped with 1161890405 kmers
Thu Feb 14 07:56:15 2019: batch 2 done and dumped with 1155712129 kmers
Thu Feb 14 07:59:48 2019: batch 3 done and dumped with 1111309960 kmers
Thu Feb 14 08:03:52 2019: batch 4 done and dumped with 1197651683 kmers
Thu Feb 14 08:06:36 2019: batch 5 done and dumped with 1123765496 kmers
Thu Feb 14 08:10:08 2019: batch 6 done and dumped with 1118262858 kmers
Thu Feb 14 08:12:45 2019: batch 7 done and dumped with 1116155144 kmers
Thu Feb 14 08:16:54 2019: batch 8 done and dumped with 1121312844 kmers
Thu Feb 14 08:21:08 2019: batch 9 done and dumped with 1168137736 kmers
Thu Feb 14 08:24:16 2019: batch 10 done and dumped with 1139106694 kmers
Thu Feb 14 08:27:49 2019: batch 11 done and dumped with 1104489449 kmers
Thu Feb 14 08:31:23 2019: batch 12 done and dumped with 1108144080 kmers
Thu Feb 14 08:34:04 2019: batch 13 done and dumped with 1129317652 kmers
Thu Feb 14 08:37:24 2019: batch 14 done and dumped with 1165429250 kmers
Thu Feb 14 08:39:31 2019: batch 15 done and dumped with 1186222570 kmers
Thu Feb 14 08:41:18 2019: batch 16 done and dumped with 1234548485 kmers
Thu Feb 14 08:43:01 2019: batch 17 done and dumped with 1159576042 kmers
Thu Feb 14 08:44:44 2019: batch 18 done and dumped with 1128593811 kmers
Thu Feb 14 08:46:27 2019: batch 19 done and dumped with 1134586648 kmers
Thu Feb 14 08:48:09 2019: batch 20 done and dumped with 1091710266 kmers
Thu Feb 14 08:49:54 2019: batch 21 done and dumped with 1137663691 kmers
Thu Feb 14 08:51:36 2019: batch 22 done and dumped with 1139345989 kmers
Thu Feb 14 08:53:18 2019: batch 23 done and dumped with 1139502133 kmers
Thu Feb 14 08:53:18 2019: merging from disk
Thu Feb 14 09:48:30 2019: 1264015212 / 6062664922 kmers with Freq >= 4
Thu Feb 14 09:48:32 2019: updating adjacencies
Thu Feb 14 09:49:26 2019: dict finished
Thu Feb 14 09:49:26 2019: finding edges (unique paths)
1553999:1 0x7faf1e44ed80 Already occupied as 1542972:16
1553999:2 0x7fae167efac0 Already occupied as 1542972:17
1553999:3 0x7fabe5aecb00 Already occupied as 1542972:18
1553999:4 0x7faf726c9338 Already occupied as 1542972:19
1553999:5 0x7fad4e2b5640 Already occupied as 1542972:20

Fatal error (pid=27986) at Thu Feb 14 09:49:45 2019:
Having trouble with preoccupied kmers.

Do you know what could be the reason for this error?

Please let me know how could I get W2RAP-CONTIGGER working.

Thanks in advance,
Fernando

Consider enabling disk batches by default

Not enough memory

Hello there,

Thanks for providing this tool. Do you have any way to know how much memory we would need to do the assembly for files of a particular size? I'm playing with two paired-end samples that were generated on a X10 machine. Each fastq.gz files are less than 40g. If I unzip them, they are of about 160G each. So far, I tried on a fat node containing 512G of memory and it crashed every time at the second step

Performing re-exec to adjust stack size.

Tue May 02 07:52:25 2017 run on cp0302, pid=8127 [Apr 13 2017 11:04:02 R52488 ]
DiscovarDeNovo READS="sample:M008 ::
HGTGYCCXX_8_160403_FR07921224_Other__R_151123_JEFWAL_M008_R{1,2
}.fastq" OUT_DIR=Discovar_Denovo NUM_THREADS=48
MAX_MEM_GB=500

SYSTEM INFO

OS: Linux :: 2.6.32-642.6.2.el6.x86_64 :: #1 SMP Wed Oct 26 06:52:09 UTC 2016
node name: cp0302
hardware type: x86_64
cache size: 512 KB
cpu MHz: 2200.000
cpu model name: AMD Opteron(tm) Processor 6174
physical memory: 504.75 GB

Omitting memory check. If you run into problems with memory,
you might try rerunning with MEMORY_CHECK=True.

Tue May 02 07:52:25 2017: finding input files
Tue May 02 07:52:25 2017: reading 2 files (which may take a while)

INPUT FILES:
[1a,type=frag,sample=M008,lib=1,frac=1] M008_R1.fastq
[1b,type=frag,sample=M008,lib=1,frac=1] M008_R2.fastq

Tue May 02 12:38:11 2017: found 1 samples
Tue May 02 12:38:11 2017: starts = 0
Tue May 02 13:37:30 2017: using 964,997,086 reads
Tue May 02 13:37:31 2017: data extraction complete, peak mem = 375.88 GB
5.75 hours used extracting reads
Tue May 02 13:37:46 2017: see total physical memory of 541,975,564,288 bytes
Tue May 02 13:37:46 2017: see user-imposed limit on memory of 536,870,912,000 bytes
Tue May 02 13:37:46 2017: 3.74 bytes per read base, assuming max memory available
We need 46 passes.
Expect 1343834 keys per batch.
Provide 1517886 keys per batch.
There were 21 buffer overflows.

Fatal error (pid=8127) at Tue May 02 18:25:36 2017:
Insufficient memory.

Tue May 02 18:25:36 2017. Abort. Stopping.

Generating a backtrace...

Dump of stack:

CRD::exit(int), in Exit.cc:30
run, in MapReduceEngine.h:408
(...), in BuildReadQGraph.cc:179
buildReadQGraph(...), in BuildReadQGraph.cc:1311
GapToyCore(int, char**), in GapToyCore.cc:584
main, in DiscovarDeNovo.cc:43

I didn't try on the trimmed files so far, but I guess it won't work with the settings I have presently.

Another question too, is there a way to combine two samples together? The only way I thought about for the moment is concatenating the fastq files together, but it could create some issues with the library characteristics right?

Thanks a lot for your help

Installation error

Hi,
When installing the w2rap-contigger I get the following error:

In file included from /mnt/HD2/seq_an/w2rap-contigger/src/kmers/KmerRecord.h:15,
                 from /mnt/HD2/seq_an/w2rap-contigger/src/kmers/KmerRecord.cc:11:
/mnt/HD2/seq_an/w2rap-contigger/src/kmers/KmerShape.h: In static member function ‘static KmerShapeId kmer_shape_zebra<K>::getId()’:
/mnt/HD2/seq_an/w2rap-contigger/src/kmers/KmerShape.h:566:49: error: could not convert ‘kmer_shape_zebra<K>::getStringId()’ from ‘String’ {aka ‘FeudalString<char>’} to ‘KmerShapeId’
  566 |  static KmerShapeId getId() { return getStringId(); }
      |                                      ~~~~~~~~~~~^~
      |                                                 |
      |                                                 String {aka FeudalString<char>}
[ 33%] Building CXX object CMakeFiles/hb_base_libs.dir/src/math/HoInterval.cc.o
make[2]: *** [CMakeFiles/hb_base_libs.dir/src/kmers/KmerRecord.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/base_libs.dir/all] Error 2
make[1]: *** [CMakeFiles/hb_base_libs.dir/all] Error 2
make: *** [all] Error 2

The compiler I used:

gcc --version
gcc (GCC) 10.1.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

help with compiling on Mac OS

Hi,
I would greatly appreciate any suggestions on how I might be able to install the program on a Mac OS. I started by reviewing the very brief installation steps. Before cloning this repo, I first created a virtual environment using Conda as follows:

conda create -n w2rap_env python=2.7
conda activate w2rap_env 
conda install -c conda-forge cmake jemalloc llvm libgcc openmp

I then cloned the repo. Within the repo, I tried following your instructions, but received a series of error messages that largely amount to this:

Scanning dependencies of target specific_w2rap-contigger
Scanning dependencies of target hb_base_libs
Scanning dependencies of target base_libs
[  0%] Building CXX object CMakeFiles/specific_w2rap-contigger.dir/src/BasevectorTools.cc.o
[  0%] Building CXX object CMakeFiles/specific_w2rap-contigger.dir/src/CompressedSequence.cc.o
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
make[2]: *** [CMakeFiles/specific_w2rap-contigger.dir/src/CompressedSequence.cc.o] Error 1

As I understand it, my problem is not with your program at all, but my inability to specify to the Mac OS to use a compiler that isn't clang. I believe that even though I'm following your instruction to set the g++ flag (-D CMAKE_CXX_COMPILER=g++), the cmake command on the Mac OS is ignoring this:

CMake Deprecation Warning at CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


-- The C compiler identification is AppleClang 12.0.0.12000032
-- The CXX compiler identification is AppleClang 12.0.0.12000032
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Setting build type to 'Release' as none was specified.
-- Found ZLIB: /Library/Developer/CommandLineTools/SDKs/MacOSX11.1.sdk/usr/lib/libz.tbd (found version "1.2.11")
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/devonorourke/gitrepos/w2rap-contigger

Unfortunately, even after spending a bit of time trying to understand how to specify my machine to use gcc and not clang when executing a cmake command, I've only manage to fail spectacularly in trying to persuade the computer to not use clang.

This produces the same (clang) error as above, indicating to me that it's still compiling with clang, and not gcc:

cmake -DCMAKE_C_COMPILER=gcc -D CMAKE_CXX_COMPILER=g++

This also similarly failed:

cmake -D CMAKE_C_COMPILER=/usr/bin/gcc -D CMAKE_CXX_COMPILER=/usr/bin/g++ -DCMAKE_CXX_COMPILER=/usr/bin/g++  -DCMAKE_C_COMPILER=/usr/bin/gcc .

And this failed equally well:

CC=/usr/bin/gcc CXX=/usr/bin/g++ cmake -D CMAKE_C_COMPILER=/usr/bin/gcc -D CMAKE_CXX_COMPILER=/usr/bin/g++ -DCMAKE_CXX_COMPILER=/usr/bin/g++  -DCMAKE_C_COMPILER=/usr/bin/gcc .

I even tried creating a file to create aliases for these flags similar to this post, but to no avail.

An old post suggested that the clang compiler is not going to work, but a more recent post/commit suggested that gcc compilation was resolved for Mac OS. If you could point to any further details that illustrate how to install the software on a Mac OS I'd greatly appreciate it.

Thanks

fails to compile with gcc 7.2.0

tested compilers

The compiler I used (current Arch Linux):

$ gcc --version
gcc (GCC) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I suppose, other GCC versions, i.e. 7+, are affected as well, because GCC changes default compiler flags between major releases. At another system, I compiled using a GCC from the 6 series and there were no problems:

$ gcc --version
gcc (GCC) 6.2.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

error messages

The error is this one:

$ make
Scanning dependencies of target hb_base_libs
[  0%] Building CXX object CMakeFiles/hb_base_libs.dir/src/paths/long/VariantCallTools.cc.o
/home/wookietreiber/src/idiv/w2rap-contigger/src/paths/long/VariantCallTools.cc: In member function ‘void EdgesOnRef::FilterAndModifyEdits(vec<triple<int, int, FeudalString<char> > >&, vec<std::pair<FeudalString<char>, FeudalString<char> > >&)’:
/home/wookietreiber/src/idiv/w2rap-contigger/src/paths/long/VariantCallTools.cc:1872:47: error: call of overloaded ‘abs(FeudalString<char>::size_type)’ is ambiguous
                     - change[j-1].first.size()) < MinClumpSep) {
                                               ^
In file included from /usr/include/c++/7.2.0/cstdlib:75:0,
                 from /usr/include/c++/7.2.0/ext/string_conversions.h:41,
                 from /usr/include/c++/7.2.0/bits/basic_string.h:6159,
                 from /usr/include/c++/7.2.0/string:52,
                 from /usr/include/c++/7.2.0/bits/locale_classes.h:40,
                 from /usr/include/c++/7.2.0/bits/ios_base.h:41,
                 from /usr/include/c++/7.2.0/iomanip:40,
                 from /home/wookietreiber/src/idiv/w2rap-contigger/src/system/System.h:13,
                 from /home/wookietreiber/src/idiv/w2rap-contigger/src/CoreTools.h:17,
                 from /home/wookietreiber/src/idiv/w2rap-contigger/src/paths/long/VariantCallTools.h:11,
                 from /home/wookietreiber/src/idiv/w2rap-contigger/src/paths/long/VariantCallTools.cc:12:
/usr/include/stdlib.h:722:12: note: candidate: int abs(int)
 extern int abs (int __x) __THROW __attribute__ ((__const__)) __wur;
            ^~~
In file included from /usr/include/c++/7.2.0/cstdlib:77:0,
                 from /usr/include/c++/7.2.0/ext/string_conversions.h:41,
                 from /usr/include/c++/7.2.0/bits/basic_string.h:6159,
                 from /usr/include/c++/7.2.0/string:52,
                 from /usr/include/c++/7.2.0/bits/locale_classes.h:40,
                 from /usr/include/c++/7.2.0/bits/ios_base.h:41,
                 from /usr/include/c++/7.2.0/iomanip:40,
                 from /home/wookietreiber/src/idiv/w2rap-contigger/src/system/System.h:13,
                 from /home/wookietreiber/src/idiv/w2rap-contigger/src/CoreTools.h:17,
                 from /home/wookietreiber/src/idiv/w2rap-contigger/src/paths/long/VariantCallTools.h:11,
                 from /home/wookietreiber/src/idiv/w2rap-contigger/src/paths/long/VariantCallTools.cc:12:
/usr/include/c++/7.2.0/bits/std_abs.h:56:3: note: candidate: long int std::abs(long int)
   abs(long __i) { return __builtin_labs(__i); }
   ^~~
/usr/include/c++/7.2.0/bits/std_abs.h:61:3: note: candidate: long long int std::abs(long long int)
   abs(long long __x) { return __builtin_llabs (__x); }
   ^~~
/usr/include/c++/7.2.0/bits/std_abs.h:70:3: note: candidate: constexpr double std::abs(double)
   abs(double __x)
   ^~~
/usr/include/c++/7.2.0/bits/std_abs.h:74:3: note: candidate: constexpr float std::abs(float)
   abs(float __x)
   ^~~
/usr/include/c++/7.2.0/bits/std_abs.h:78:3: note: candidate: constexpr long double std::abs(long double)
   abs(long double __x)
   ^~~
make[2]: *** [CMakeFiles/hb_base_libs.dir/build.make:3111: CMakeFiles/hb_base_libs.dir/src/paths/long/VariantCallTools.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:68: CMakeFiles/hb_base_libs.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

analysis / probable fix

As far as I can tell, the ambiguity can be resolved, across all compiler versions, by explicitly casting the expression inside the abs call to the correct type, e.g.:

@@ -1868,8 +1868,8 @@ void EdgesOnRef::FilterAndModifyEdits( vec<triple<int,int,String>>& edits,
         bool i_is_indel = (change[i].first.size() != change[i].second.size());
         if (i_is_indel) inserted_base += change[i].second.size()-1;
         size_t j = i + 1;
-        while (j < edits.size() && abs(edits[j].second - edits[j-1].second 
-                    - change[j-1].first.size()) < MinClumpSep) {
+        while (j < edits.size() && abs((long double)(edits[j].second - edits[j-1].second 
+                    - change[j-1].first.size()) < MinClumpSep)) {
             nmatch += edits[j].second - edits[j-1].second - change[j-1].first.size();
             bool j_is_indel = (change[j].first.size() != change[j].second.size());
             if (j_is_indel)

Useful:

Note: I used long double here just to get it to compile and to test #30, don't know if that's the correct type. I didn't want to look at the codebase in depth to figure it out, but you surely can ;)