ruanjue / wtdbg2 Goto Github PK

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly

License: GNU General Public License v3.0

Makefile 0.18% Perl 3.12% C 92.16% C++ 4.50% Shell 0.05%

wtdbg2's Introduction

Getting Started

git clone https://github.com/ruanjue/wtdbg2
cd wtdbg2 && make
#quick start with wtdbg2.pl
./wtdbg2.pl -t 16 -x rs -g 4.6m -o dbg reads.fa.gz

# Step by step commandlines
# assemble long reads
./wtdbg2 -x rs -g 4.6m -i reads.fa.gz -t 16 -fo dbg

# derive consensus
./wtpoa-cns -t 16 -i dbg.ctg.lay.gz -fo dbg.raw.fa

# polish consensus, not necessary if you want to polish the assemblies using other tools
minimap2 -t16 -ax map-pb -r2k dbg.raw.fa reads.fa.gz | samtools sort -@4 >dbg.bam
samtools view -F0x900 dbg.bam | ./wtpoa-cns -t 16 -d dbg.raw.fa -i - -fo dbg.cns.fa

# Addtional polishment using short reads
bwa index dbg.cns.fa
bwa mem -t 16 dbg.cns.fa sr.1.fa sr.2.fa | samtools sort -O SAM | ./wtpoa-cns -t 16 -x sam-sr -d dbg.cns.fa -i - -fo dbg.srp.fa

Introduction

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output. Wtdbg2 is able to assemble the human and even the 32Gb Axolotl genome at a speed tens of times faster than CANU and FALCON while producing contigs of comparable base accuracy.

During assembly, wtdbg2 chops reads into 1024bp segments, merges similar segments into a vertex and connects vertices based on the segment adjacency on reads. The resulting graph is called fuzzy Bruijn graph (FBG). It is akin to De Bruijn graph but permits mismatches/gaps and keeps read paths when collapsing k-mers. The use of FBG distinguishes wtdbg2 from the majority of long-read assemblers.

Installation

Wtdbg2 only works on 64-bit Linux. To compile, please type make in the source code directory. You can then copy wtdbg2 and wtpoa-cns to your PATH.

Wtdbg2 also comes with an approxmimate read mapper kbm, a faster but less accurate consesus tool wtdbg-cns and many auxiliary scripts in the scripts directory.

Usage

Wtdbg2 has two key components: an assembler wtdbg2 and a consenser wtpoa-cns. Executable wtdbg2 assembles raw reads and generates the contig layout and edge sequences in a file "prefix.ctg.lay.gz". Executable wtpoa-cns takes this file as input and produces the final consensus in FASTA. A typical workflow looks like this:

./wtdbg2 -x rs -g 4.6m -t 16 -i reads.fa.gz -fo prefix
./wtpoa-cns -t 16 -i prefix.ctg.lay.gz -fo prefix.ctg.fa

where -g is the estimated genome size and -x specifies the sequencing technology, which could take value "rs" for PacBio RSII, "sq" for PacBio Sequel, "ccs" for PacBio CCS reads and "ont" for Oxford Nanopore. This option sets multiple parameters and should be applied before other parameters. When you are unable to get a good assembly, you may need to tune other parameters as follows.

Wtdbg2 combines normal k-mers and homopolymer-compressed (HPC) k-mers to find read overlaps. Option -k specifies the length of normal k-mers, while -p specifies the length of HPC k-mers. By default, wtdbg2 samples a fourth of all k-mers by their hashcodes. For data of relatively low coverage, you may increase this sampling rate by reducing -S. This will greatly increase the peak memory as a cost, though. Option -e, which defaults to 3, specifies the minimum read coverage of an edge in the assembly graph. You may adjust this option according to the overall sequencing depth, too. Option -A also helps relatively low coverage data at the cost of performance. For PacBio data, -L5000 often leads to better assemblies emperically, so is recommended. Please run wtdbg2 --help for a complete list of available options or consult README-ori.md for more help.

The following table shows various command lines and their resource usage for the assembly step:

Dataset	GSize	Cov	Asm options	CPU asm	CPU cns	Real tot	RAM
E. coli	4.6Mb	PB x20	-x rs -g4.6m -t16	53s	8m54s	42s	1.0G
C. elegans	100Mb	PB x80	-x rs -g100m -t32	1h07m	5h06m	13m42s	11.6G
D. melanogaster A4	144m	PB x120	-x rs -g144m -t32	2h06m	5h11m	26m17s	19.4G
D. melanogaster ISO1	144m	ONT x32	-xont -g144m -t32	5h12m	4h30m	25m59s	17.3G
A. thaliana	125Mb	PB x75	-x sq -g125m -t32	11h26m	4h57m	49m35s	25.7G
Human NA12878	3Gb	ONT x36	-x ont -g3g -t31	793h11m	97h46m	31h03m	221.8G
Human NA19240	3Gb	ONT x35	-x ont -g3g -t31	935h31m	89h17m	35h20m	215.0G
Human HG00733	3Gb	PB x93	-x sq -g3g -t47	2114h26m	152h24m	52h22m	338.1G
Human NA24385	3Gb	CCS x28	-x ccs -g3g -t31	231h25m	58h48m	10h14m	112.9G
Human CHM1	3Gb	PB x60	-x rs -g3g -t96	105h33m	139h24m	5h17m	225.1G
Axolotl	32Gb	PB x32	-x rs -g32g -t96	2806h40m	1456h13m	110h16m	1788.1G

The timing was obtained on three local servers with different hardware configurations. There are also run-to-run fluctuations. Exact timing on your machines may differ. The assembled contigs can be found at the following FTP:

ftp://ftp.dfci.harvard.edu/pub/hli/wtdbg/

Limitations

For Nanopore data, wtdbg2 may produce an assembly smaller than the true genome.
When inputing multiple files of both fasta and fastq format, please put fastq first, then fasta. Otherwise, program cannot find '>' in fastq, and append all fastq in one read.

Citing wtdbg2

If you use wtdbg2, please cite:

Ruan, J. and Li, H. (2019) Fast and accurate long-read assembly with wtdbg2. Nat Methods doi:10.1038/s41592-019-0669-3

Ruan, J. and Li, H. (2019) Fast and accurate long-read assembly with wtdbg2. bioRxiv. doi:10.1101/530972

Getting Help

Please use the GitHub's Issues page if you have questions. You may also directly contact Jue Ruan at [email protected].

wtdbg2's People

Contributors

Stargazers

Watchers

Forkers

wenchaolin wangzhennan14 sanvva pjm43 xuelei-dai colindaven ipstone lijingjing1 dietang pythseq xtmgah lh3 grassa hnnd wyim-pgl m-bull samstudio8 sunnycqcn aboffin cnyuanh ch127 russcd zhanmengtao samll-rookie shawnhoon palfalvi dayedepps tw7649116 tjianghit haswelliris wwyf cleverfat yzs981130 rtxux gusevfe gddcx thu-scc alexpersa7 schroederdk xma82 open-estuary yuzhenpeng bioyliu catonana liupfskygre zorrodong yixiangzhang1996 leipinji lclindu baiyuanxiang xjyx phoenix-jiaqi huangziyan11111 mdbuaa adtercero splaisan velcon-zheng chocotwig smattr arun-sub niexiaoqing caosgc33d mmt-at guoshuai1314 haozhour lzh93 wbyu sebschmi guo-cheng wolongac kullrich sabryr cyr20040123 cszhenfeili quanrd iggyb ianderrington wook2014 smediterranea griffinhon hlkfoz zhijunbioinf caorui12 chasefor volcoffa martin-g srisvs33 akirarrrr lqfore upuplxz

wtdbg2's Issues

I want to change the parameter --edge-min 2 but don't want to rerun the alignment again, is it possible ?

Hi,
After my first run is done, I want to change the parameter --edge-min 2 but don't want to rerun the alignment again, is it possible ?

Thanks~

Will wtpoa-cns automatically exit after the run is complete?

I run the command “./wtpoa-cns -t 16 -i prefix.ctg.lay -fo prefix.ctg.lay.fa”, but after generating the prefix.ctg.lay.fa file for a long time, wtpoa-cns does not Automatic termination, is this normal?

wtdbg-dot2gfa.pl does not work with contig dot file

Hi there

I would like to produce a contig GFA to inspect in Bandage. I found the wtdbg-dot2gfa.pl script but it only seems to work on the sequence graph (3.dot) rather than the contig graph. Is it possible to modify it to work with contigs?

Best
Nick

wtpoa-cns not using all requested threads

I asked wtpoa-cns to use 16 threads. However, in average, only it only uses 500% CPU on my machine. I changed the default memory allocator and it seems to improve the multi-thread performance.

I am using the E. coli example from PBcR:

http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz

The command lines I was using:

wtdbg2 -i ecoli.fa.gz -t 16 -fo test -L5000 -e2
wtpoa-cns -i test.ctg.lay -t 16 -fo test.ctg.fa

You can override the system allocator with LD_PRELAD:

LD_PRELOAD=libtcmalloc.so wtpoa-cns -i test.ctg.lay -t 16 -fo test.ctg.fa

Here are some results:

Library	Real time (sec)	User time	Sys time	Max RSS (kb)
glibc-2.12	285.901	848.230	575.720	1660412.0
jemalloc	75.703	814.820	41.580	3274516.0
tcmalloc	72.275	1023.740	26.120	1765996.0
lockless	100.658	953.020	102.220	4018172.0

You can see that the default glibc allocator (I am using CentOS 6) is quite bad, spending lots of system time on thread scheduling. tcmalloc is much better. You get almost a 4-fold speedup. jemalloc is good, too, but it takes too much extra memory.

Typically, you see the effect of memory allocators when you frequently malloc/free in each thread. Bwa suffers from this problem, too. I think there are two ways to fix this:

Use a custom memory allocator. tcmalloc has been quite good for the few examples I have tried. This solution doesn't require you to modify the C source code. However, it is a little difficult for general users to build performant binaries.
Reorganize malloc/free calls. You allocate a buffer before spawning the workers and try to avoid frequent malloc/free in each worker. Minimap2 takes this approach with a thread-local buffer. With this buffer disabled, minimap2 will become noticeably slower on many threads.

cannot find reads

Dear Jue,
Could you please indicate me how to solve this problem?
Thanks!

[Tue Jun 26 17:19:15 2018] loading reads
8071186 reads
[Tue Jun 26 17:28:34 2018] Done, 8071186 reads, 149999992485 bp, 581944848 bins
** PROC_STAT(0) **: real 559.139 sec, user 484.810 sec, sys 149.680 sec, maxrss 44491600.0 kB, maxvsize 157162940.0 kB
[Tue Jun 26 17:28:34 2018] loading alignments from "1-1.kbmap","1-2.kbmap","1-3.kbmap","1-4.kbmap","1-5.kbmap","1-6.kbmap","1-7.kbmap","1-8.kbmap","1-9.kbmap","1-10.kbmap","2-1.kbmap","2-2.kbmap","2-3.kbmap","2-4.kbmap","2-5.kbmap","2-6.kbmap","2-7.kbmap","2-8.kbmap","2-9.kbmap","2-10.kbmap","3-1.kbmap","3-2.kbmap","3-3.kbmap","3-4.kbmap","3-5.kbmap","3-6.kbmap","3-7.kbmap","3-8.kbmap","3-9.kbmap","3-10.kbmap","4-1.kbmap","4-2.kbmap","4-3.kbmap","4-4.kbmap","4-5.kbmap","4-6.kbmap","4-7.kbmap","4-8.kbmap","4-9.kbmap","4-10.kbmap","5-1.kbmap","5-2.kbmap","5-3.kbmap","5-4.kbmap","5-5.kbmap","5-6.kbmap","5-7.kbmap","5-8.kbmap","5-9.kbmap","5-10.kbmap","6-1.kbmap","6-2.kbmap","6-3.kbmap","6-4.kbmap","6-5.kbmap","6-6.kbmap","6-7.kbmap","6-8.kbmap","6-9.kbmap","6-10.kbmap","7-1.kbmap","7-2.kbmap","7-3.kbmap","7-4.kbmap","7-5.kbmap","7-6.kbmap","7-7.kbmap","7-8.kbmap","7-9.kbmap","7-10.kbmap","8-1.kbmap","8-2.kbmap","8-3.kbmap","8-4.kbmap","8-5.kbmap","8-6.kbmap","8-7.kbmap","8-8.kbmap","8-9.kbmap","8-10.kbmap","9-1.kbmap","9-2.kbmap","9-3.kbmap","9-4.kbmap","9-5.kbmap","9-6.kbmap","9-7.kbmap","9-8.kbmap","9-9.kbmap","9-10.kbmap","10-1.kbmap","10-2.kbmap","10-3.kbmap","10-4.kbmap","10-5.kbmap","10-6.kbmap","10-7.kbmap","10-8.kbmap","10-9.kbmap","10-10.kbmap"
62610000 -- Cannot find read "m54173_18030852/49217650/1065_26917" in LINE:62611733 in build_nodes_graph -- wtdbg.c:1034 --
63650000 -- Cannot find read "m54174_180329_082736/7826476/7300_29818" in LINE:63653665 in build_nodes_graph -- wtdbg.c:1034 --
63740000 -- Cannot find read "m54170_180315_1329554172_180322_035319/49480042/167_25553" in LINE:63740153 in build_nodes_graph -- wtdbg.c:1034 --
64370000 -- Cannot find read "m54173_180331_111252/52035731/0_17734929/33292754/0_27797" in LINE:64375142 in build_nodes_graph -- wtdbg.c:1034 --
-- Cannot find read "-" in LINE:64375513 in build_nodes_graph -- wtdbg.c:1034 --
64460000 -- Cannot find read "+" in LINE:64463256 in build_nodes_graph -- wtdbg.c:1034 --
66190000 -- Cannot find read "m54170_18030256" in LINE:66195795 in build_nodes_graph -- wtdbg.c:1034 --
67290000 -- Cannot find read "m54174_180319_171333/70254672/0_26/0_26562" in LINE:67294315 in build_nodes_graph -- wtdbg.c:1017 --
67940000 -- Cannot find read "m54173_18030_55439" in LINE:67945009 in build_nodes_graph -- wtdbg.c:1017 --
67950000 -- Cannot find read "m54174_180337/17499038/0_26853" in LINE:67957390 in build_nodes_graph -- wtdbg.c:1034 --
79350000 -- Cannot find read "m54173_180322_135952/29_27776" in LINE:79352414 in build_nodes_graph -- wtdbg.c:1034 --
83890000 -- Bad cigar '/' "3Md3m2452677/0_23760" in LINE:83890828 in build_nodes_graph -- wtdbg.c:1072 --

Portable way to get system/process information

The wtdbg2 algorithm is largely OS-independent. However, to collect system and process information, it assumes the OS is Linux. It would be good remove this dependency. Here are some portable ways to get the real time, CPU time and peak RAM of the current process, and the number of CPUs and total RAM of the system. These work for Mac and Linux.

#include <sys/resource.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>

double cputime(void) // return in CPU seconds, including both user and system CPU time
{
	struct rusage r;
	getrusage(RUSAGE_SELF, &r);
	return r.ru_utime.tv_sec + r.ru_stime.tv_sec + 1e-6 * (r.ru_utime.tv_usec + r.ru_stime.tv_usec);
}

double realtime(void) // return in seconds
{
	struct timeval tp;
	struct timezone tzp;
	gettimeofday(&tp, &tzp);
	return tp.tv_sec + tp.tv_usec * 1e-6;
}

long peakrss(void) // return in bytes
{
	struct rusage r;
	getrusage(RUSAGE_SELF, &r);
#ifdef __linux__
	return r.ru_maxrss * 1024;
#else
	return r.ru_maxrss;
#endif
}

int ncpucore(void)
{
	return sysconf(_SC_NPROCESSORS_ONLN);
}

long totalmem(void) // return in bytes
{
	long pages = sysconf(_SC_PHYS_PAGES);
	long page_size = sysconf(_SC_PAGE_SIZE);
	return pages * page_size;
}

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
	int i, n = 1<<20, *p;
	p = (int*)calloc(n, sizeof(int));
	for (i = 0; i < n; ++i) p[i] = i;
	fprintf(stderr, "ncpu: %d\n", ncpucore());
	fprintf(stderr, "peakrss: %ld\n", peakrss());
	fprintf(stderr, "totalmem: %ld\n", totalmem());
	return 0;
}

Unable to install on linux x64

Thanks for developing wtdbg2. I cloned the repository but am unable to compile the code. I'm on 64bit linux, gcc --version is 7.3.0. The following is the output of make

error.txt

I wonder if it's a missing library problem. In your Makefile you request

GLIBS=-lm -lrt -lpthread

Not sure if these are present in my system.

Any help appreciated.

Best wishes,

a bug for --ctg-min-length?

hello,
I found there were always one single sequence in the assembly lower than the minimal contig length(5k), such as 1.7k, 4k, ... for each project.

--ctg-min-length
Min length of contigs to be output, 5000

Thanks

Compiling with clang: fatal error: 'endian.h' file not found

Any ideas?

gcc -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DTIMESTAMP="Thu Oct 25 23:44:41 PDT 2018" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o wtpoa-cns wtpoa-cns.c ksw.c -lm -lrt -lpthread
In file included from In file included from wtdbg.cwtdbg-cns.c::2020:
In file included from In file included from :
wtpoa-cns.cIn file included from In file included from ./kswx.h./kbm.h:kbm.c:::2023:
In file included from ./kbm.h:23:
In file included from 2420:
:
In file included from :
./list.h./list.hIn file included from :In file included from ./tripoa.h28::
23:28:
:
In file included from ./poacns.h:23:
In file included from ./bit2vec.h:27:
./mem_share.h:26./mem_share.h::1026:: 10./mem_share.h:: 26fatal error:: 10:fatal error'endian.h' file not found: 
 fatal error: 'endian.h' file not found
'endian.h' file not found
./dna.h:27:
In file included from ./list.h:28:
./mem_share.h:26:#include <endian.h>#include <endian.h>10
:         ^~~~~~~~~~ 
fatal error: 
'endian.h' file not found
#include <endian.h>
         ^~~~~~~~~~
         ^~~~~~~~~~
#include <endian.h>
         ^~~~~~~~~~
1 error generated.

clustering overlapped reads based on their alignments

Hi Jue,
I am not sure it is suitable post my question here.
I am assembling a super-large genome (20Gb). Currently i have finished the alignment using paralleled kbm-1.2.8 method (#7) and would like to cluster overlapped reads to multiple small blocks for separate assembly as loading all of the alignment files in wtdbg-1.28 will require too much memory (which I do not have).
Therefore, my question is that how can I extract and cluster overlaped reads into different blocks? Any suggestion?
Thanks!

kmer distribution

Hi Jue,
When I run wtdbg it plots the kmer distribution and suggest that in case of a "not good" distribution I should adjust the -k, -p and -K parameters.
The kmer distribution would depend on many factors (e.g. repeat content, ploidy CNVs etc.), nevertheless can you show me plots of distributions you would consider good with some explanations?
It would be much appreciated.
Thanks,
Lel

A 7g genome

I used wtdbg-1.28 to assemble a 7g plant genome with about 70X pacbio data under default parameters (-L 5000) and got a ~6.5 g result.
But, now I am assembling this genome using the same data with wtdbg2.1. The software stoped at indexing k-mers. I have tried several times. I think maybe there are some bugs in this version. Please could you check it?
PS. I can get a poor results with 2.1 version when only used reads longer then 15 Kb.

polish error

I got an error in polish step using the new release of version 2.2. Is it a bug or caused by wrong parameters.
[@localhost canu1.8_wtdbg2]$ wtpoa-cns -t $cores -d tp.canu.wtdbg.ctg.lay1.fa -i tp.canu.wtdbg.ctg.lay2.map.sam -fo tp.canu.wtdbg.ctg.lay2.fa
-- total memory 396031892.0 kB
-- available 334581040.0 kB
-- 120 cores
-- Starting program: wtpoa-cns -t 24 -d tp.canu.wtdbg.ctg.lay1.fa -i tp.canu.wtdbg.ctg.lay2.map.sam -fo tp.canu.wtdbg.ctg.lay2.fa
-- pid 100642
-- date Mon Dec 3 15:16:59 2018
wtpoa-cns: wtpoa.h:472: init_samblock: Assertion `bstep <= bsize && 2 * bstep >= bsize' failed.

Compiling issue MACOSX. fatal error: sys/sysinfo.h

Hi. I am trying to compile in macOS Sierra 10.12.6
I keep getting an error message, do you have any idea how to solve it?

Some specs
x86_64-apple-darwin16.7.0
Apple LLVM version 9.0.0 (clang-900.0.37)
gcc (Homebrew) 7.2.0
command line tools are already installed
Xcode 9.0 Build version 9A235

$ make
gcc -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DTIMESTAMP="Tue Oct 10 13:45:10 CEST 2017" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o kbm-1.2.8 kbm.c -lm -lrt -lpthread
In file included from list.h:28:0,
                 from kbm.h:23,
                 from kbm.c:20:
mem_share.h:33:10: fatal error: sys/sysinfo.h: No such file or directory
 #include <sys/sysinfo.h>
          ^~~~~~~~~~~~~~~
compilation terminated.
make: *** [kbm-1.2.8] Error 1

$ make CC=clang
clang -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DTIMESTAMP="Tue Oct 10 13:45:16 CEST 2017" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o kbm-1.2.8 kbm.c -lm -lrt -lpthread
clang: warning: -O4 is equivalent to -O3 [-Wdeprecated]
warning: unknown warning option '-Wno-unused-but-set-variable'; did you mean
      '-Wno-unused-const-variable'? [-Wunknown-warning-option]
In file included from kbm.c:20:
In file included from ./kbm.h:23:
In file included from ./list.h:28:
./mem_share.h:33:10: fatal error: 'sys/sysinfo.h' file not found
#include <sys/sysinfo.h>
         ^~~~~~~~~~~~~~~
1 warning and 1 error generated.
make: *** [kbm-1.2.8] Error 1

How to run wtdbg-1.2.8 in multiple nodes (the usage of --load-xxx)

Hi Jue,

Very interesting project! I'm eager to test wtdbg and interested to make use of several nodes.

You write that this can be achieved by combining kbm and wtdbg --load-alignments. Could you write a few more words or give an example of commands?

Cheers,
Iggy

how to set -K for repeats

Hi Jue,
Can you advise what is the best way of setting -K to filter possibly repeat derived kmers?
E.g. with 30X read coverage if I want to filter repeats which are present at least 10 or more copies should I set -K to 300?
Also, assuming that 40% of the genome made up by repeats I assume that the fraction of the total repeat driven kmer count should be set to 0.4?
So would the two above be captured as -K 300.4?
Do I interpret the -K parameter correctly?
Thank you,
Lel

support for ".fasta"

I noticed that if you use the flag

-o output_file.fasta

No output is generated. But if you use either of the following:

-o output_file
-o output_file.fa

Then the corresponding file is generated.

Assembly graph in GFA format

Does wtdbg2 support outputting the assembly graph in GFA1 or GFA2 format?

Hanging when using one thread

Hi,

I am doing a really small local assembly and noticed that the program gets stuck when I specify -t 1, and will run if I choose any other value of -t.
For example

wtdbg2 -i group.0/WH.reads.fasta -f -o group.0/wtdbg.assembly/asm --ctg-min-length 10000 -L 5000 -t 1

hangs on the last line

--
-- total memory       49415848.0 kB
-- available          45630044.0 kB
-- 12 cores
-- Starting program: wtdbg2 -i group.0/WH.reads.fasta -f -o group.0/wtdbg.assembly/asm --ctg-min-length 10000 -L 5000 -t 1
-- pid                     26606
-- date         Wed Nov  7 11:23:57 2018
--
[Wed Nov  7 11:23:57 2018] loading reads
93 reads
[Wed Nov  7 11:23:57 2018] Done, 93 reads, 1674265 bp, 6492 bins
** PROC_STAT(0) **: real 0.013 sec, user 0.000 sec, sys 0.000 sec, maxrss 20372.0 kB, maxvsize 98308.0 kB
[Wed Nov  7 11:23:57 2018] generating nodes, 1 threads
[Wed Nov  7 11:23:57 2018] indexing bins[0,6492] (1661952 bp), 1 threads
[Wed Nov  7 11:23:57 2018] - scanning kmers (K0P21 subsampling 1/4) from 6492 bins
6492 bins
** PROC_STAT(0) **: real 0.113 sec, user 0.080 sec, sys 0.020 sec, maxrss 49528.0 kB, maxvsize 258436.0 kB
[Wed Nov  7 11:23:57 2018] - Total kmers = 17559
[Wed Nov  7 11:23:57 2018] - average kmer depth = 5
[Wed Nov  7 11:23:57 2018] - 163936 low frequency kmers (<2)
[Wed Nov  7 11:23:57 2018] - 0 high frequency kmers (>1000)
[Wed Nov  7 11:23:57 2018] - indexing 17559 kmers, 91195 instances (at most)
6492 bins
[Wed Nov  7 11:23:57 2018] - indexed  17559 kmers, 91015 instances
[Wed Nov  7 11:23:57 2018] - masked 635 bins as closed
[Wed Nov  7 11:23:57 2018] - sorting
** PROC_STAT(0) **: real 0.113 sec, user 0.080 sec, sys 0.020 sec, maxrss 49528.0 kB, maxvsize 258436.0 kB
[Wed Nov  7 11:23:57 2018] Done
93 reads|total hits 2348
** PROC_STAT(0) **: real 0.317 sec, user 0.270 sec, sys 0.030 sec, maxrss 51068.0 kB, maxvsize 259104.0 kB
[Wed Nov  7 11:23:58 2018] chainning ...  412 hits into 202
[Wed Nov  7 11:23:58 2018] picking best 500 hits for each read ... 2138 hits
[Wed Nov  7 11:23:58 2018] clipping ... 0.00% bases
[Wed Nov  7 11:23:58 2018] generated 34061 regs
[Wed Nov  7 11:23:58 2018] sorting regs ...  Done
[Wed Nov  7 11:23:58 2018] generating intervals ...  1471 intervals
[Wed Nov  7 11:23:58 2018] selecting important intervals from 1471 intervals
[Wed Nov  7 11:23:58 2018] Intervals: kept 29, discarded 1442
** PROC_STAT(0) **: real 0.317 sec, user 0.270 sec, sys 0.030 sec, maxrss 51068.0 kB, maxvsize 259104.0 kB
[Wed Nov  7 11:23:58 2018] Done, 29 nodes
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.1.nodes". Done.
[Wed Nov  7 11:23:58 2018] median node depth = 16
[Wed Nov  7 11:23:58 2018] masked 0 high coverage nodes (>200 or <3)
[Wed Nov  7 11:23:58 2018] masked 1 repeat-like nodes by local subgraph analysis
[Wed Nov  7 11:23:58 2018] generating edges
[Wed Nov  7 11:23:58 2018] Done, 423 edges
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.1.reads". Done.
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.1.dot". Done.
[Wed Nov  7 11:23:58 2018] graph clean
[Wed Nov  7 11:23:58 2018] rescued 0 low cov edges
[Wed Nov  7 11:23:58 2018] deleted 0 binary edges
[Wed Nov  7 11:23:58 2018] deleted 1 isolated nodes
[Wed Nov  7 11:23:58 2018] cut 21 transitive edges
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.2.dot". Done.
[Wed Nov  7 11:23:58 2018] 2 bubbles; 2 tips; 0 yarns;
[Wed Nov  7 11:23:58 2018] deleted 1 isolated nodes
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.3.dot". Done.
[Wed Nov  7 11:23:58 2018] cut 0 branching nodes
[Wed Nov  7 11:23:58 2018] deleted 0 isolated nodes
[Wed Nov  7 11:23:58 2018] building unitigs
[Wed Nov  7 11:23:58 2018] TOT 41472, CNT 1, AVG 41472, MAX 41472, N50 41472, L50 1, N90 41472, L90 1, Min 41472
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.frg.nodes". Done.
[Wed Nov  7 11:23:58 2018] generating links
[Wed Nov  7 11:23:58 2018] generated 1 links
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.frg.dot". Done.
[Wed Nov  7 11:23:58 2018] rescue 0 weak links
[Wed Nov  7 11:23:58 2018] deleted 2 binary links
[Wed Nov  7 11:23:58 2018] cut 0 transitive links
[Wed Nov  7 11:23:58 2018] remove 0 boomerangs
[Wed Nov  7 11:23:58 2018] detached 0 repeat-associated paths
[Wed Nov  7 11:23:58 2018] remove 0 weak branches
[Wed Nov  7 11:23:58 2018] cut 0 tips
[Wed Nov  7 11:23:58 2018] pop 0 bubbles
[Wed Nov  7 11:23:58 2018] cut 0 tips
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.ctg.dot". Done.
[Wed Nov  7 11:23:58 2018] building contigs
[Wed Nov  7 11:23:58 2018] searched 1 contigs
[Wed Nov  7 11:23:58 2018] Estimated: TOT 41472, CNT 1, AVG 41472, MAX 41472, N50 41472, L50 1, N90 41472, L90 1, Min 41472

However,

wtdbg2 -i group.0/WH.reads.fasta -f -o group.0/wtdbg.assembly/asm --ctg-min-length 10000 -L 5000 -t 2

completes:

--
-- total memory       49415848.0 kB
-- available          45653020.0 kB
-- 12 cores
-- Starting program: wtdbg2 -i group.0/WH.reads.fasta -f -o group.0/wtdbg.assembly/asm --ctg-min-length 10000 -L 5000 -t 2
-- pid                     29131
-- date         Wed Nov  7 11:27:23 2018
--
[Wed Nov  7 11:27:23 2018] loading reads
93 reads
[Wed Nov  7 11:27:23 2018] Done, 93 reads, 1674265 bp, 6492 bins
** PROC_STAT(0) **: real 0.020 sec, user 0.010 sec, sys 0.000 sec, maxrss 32736.0 kB, maxvsize 110772.0 kB
[Wed Nov  7 11:27:23 2018] generating nodes, 2 threads
[Wed Nov  7 11:27:23 2018] indexing bins[0,6492] (1661952 bp), 2 threads
[Wed Nov  7 11:27:23 2018] - scanning kmers (K0P21 subsampling 1/4) from 6492 bins
6492 bins
** PROC_STAT(0) **: real 0.121 sec, user 0.100 sec, sys 0.010 sec, maxrss 56000.0 kB, maxvsize 394128.0 kB
[Wed Nov  7 11:27:23 2018] - Total kmers = 17559
[Wed Nov  7 11:27:23 2018] - average kmer depth = 5
[Wed Nov  7 11:27:23 2018] - 163936 low frequency kmers (<2)
[Wed Nov  7 11:27:23 2018] - 0 high frequency kmers (>1000)
[Wed Nov  7 11:27:23 2018] - indexing 17559 kmers, 91195 instances (at most)
6492 bins
[Wed Nov  7 11:27:23 2018] - indexed  17559 kmers, 91015 instances
[Wed Nov  7 11:27:23 2018] - masked 635 bins as closed
[Wed Nov  7 11:27:23 2018] - sorting
** PROC_STAT(0) **: real 0.121 sec, user 0.100 sec, sys 0.010 sec, maxrss 56000.0 kB, maxvsize 394128.0 kB
[Wed Nov  7 11:27:23 2018] Done
93 reads|total hits 2348
** PROC_STAT(0) **: real 0.330 sec, user 0.290 sec, sys 0.020 sec, maxrss 58340.0 kB, maxvsize 394424.0 kB
[Wed Nov  7 11:27:23 2018] chainning ...  412 hits into 202
[Wed Nov  7 11:27:23 2018] picking best 500 hits for each read ... 2138 hits
[Wed Nov  7 11:27:23 2018] clipping ... 0.00% bases
[Wed Nov  7 11:27:23 2018] generated 34061 regs
[Wed Nov  7 11:27:23 2018] sorting regs ...  Done
[Wed Nov  7 11:27:23 2018] generating intervals ...  1471 intervals
[Wed Nov  7 11:27:23 2018] selecting important intervals from 1471 intervals
[Wed Nov  7 11:27:23 2018] Intervals: kept 29, discarded 1442
** PROC_STAT(0) **: real 0.330 sec, user 0.290 sec, sys 0.020 sec, maxrss 58340.0 kB, maxvsize 394424.0 kB
[Wed Nov  7 11:27:23 2018] Done, 29 nodes
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.1.nodes". Done.
[Wed Nov  7 11:27:23 2018] median node depth = 16
[Wed Nov  7 11:27:23 2018] masked 0 high coverage nodes (>200 or <3)
[Wed Nov  7 11:27:23 2018] masked 1 repeat-like nodes by local subgraph analysis
[Wed Nov  7 11:27:23 2018] generating edges
[Wed Nov  7 11:27:23 2018] Done, 423 edges
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.1.reads". Done.
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.1.dot". Done.
[Wed Nov  7 11:27:23 2018] graph clean
[Wed Nov  7 11:27:23 2018] rescued 0 low cov edges
[Wed Nov  7 11:27:23 2018] deleted 0 binary edges
[Wed Nov  7 11:27:23 2018] deleted 1 isolated nodes
[Wed Nov  7 11:27:23 2018] cut 21 transitive edges
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.2.dot". Done.
[Wed Nov  7 11:27:23 2018] 2 bubbles; 2 tips; 0 yarns;
[Wed Nov  7 11:27:23 2018] deleted 1 isolated nodes
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.3.dot". Done.
[Wed Nov  7 11:27:23 2018] cut 0 branching nodes
[Wed Nov  7 11:27:23 2018] deleted 0 isolated nodes
[Wed Nov  7 11:27:23 2018] building unitigs
[Wed Nov  7 11:27:23 2018] TOT 41472, CNT 1, AVG 41472, MAX 41472, N50 41472, L50 1, N90 41472, L90 1, Min 41472
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.frg.nodes". Done.
[Wed Nov  7 11:27:23 2018] generating links
[Wed Nov  7 11:27:23 2018] generated 1 links
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.frg.dot". Done.
[Wed Nov  7 11:27:23 2018] rescue 0 weak links
[Wed Nov  7 11:27:23 2018] deleted 2 binary links
[Wed Nov  7 11:27:23 2018] cut 0 transitive links
[Wed Nov  7 11:27:23 2018] remove 0 boomerangs
[Wed Nov  7 11:27:23 2018] detached 0 repeat-associated paths
[Wed Nov  7 11:27:23 2018] remove 0 weak branches
[Wed Nov  7 11:27:23 2018] cut 0 tips
[Wed Nov  7 11:27:23 2018] pop 0 bubbles
[Wed Nov  7 11:27:23 2018] cut 0 tips
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.ctg.dot". Done.
[Wed Nov  7 11:27:23 2018] building contigs
[Wed Nov  7 11:27:23 2018] searched 1 contigs
[Wed Nov  7 11:27:23 2018] Estimated: TOT 41472, CNT 1, AVG 41472, MAX 41472, N50 41472, L50 1, N90 41472, L90 1, Min 41472
[Wed Nov  7 11:27:23 2018] output 1 contigs
[Wed Nov  7 11:27:23 2018] Program Done
** PROC_STAT(TOTAL) **: real 0.430 sec, user 0.340 sec, sys 0.020 sec, maxrss 58340.0 kB, maxvsize 394424.0 kB
---

Here is the read file:
WH.reads.fasta.gz

Thanks in advance!

abnormal node depth ???

Hello,

I used the wtdbg2 to do the assembly for a genome(~2.8Gbp) with ~30X data(PacBio, length cutoff:7000).
The parameters for kbm2: -p 0 -k 15 -S 2 -m 300
the parameters for wtdbg2: --node-drop 0.25 --node-len 1024 --node-max 100 --aln-dovetail -1

and the log information:
Done, 5992448 reads (>=0 bp), 87876681500 bp, 340291433 bins
[Mon Dec 10 15:19:57 2018] chainning ... 1796935 hits into 896135, deleted 13977831 non-best hits between two reads
[Mon Dec 10 15:20:08 2018] picking best 500 hits for each read ... 178840586 hits
[Mon Dec 10 15:20:23 2018] clipping ... 14.39% bases
[Mon Dec 10 15:24:48 2018] generated 859464418 regs
[Mon Dec 10 15:25:00 2018] sorting regs ... Done
[Mon Dec 10 15:25:32 2018] generating intervals ... 30385993 intervals
[Mon Dec 10 15:25:39 2018] selecting important intervals from 30385993 intervals
[Mon Dec 10 15:29:02 2018] Intervals: kept 1146431, discarded 29239562
[Mon Dec 10 15:29:12 2018] median node depth = 7
[Mon Dec 10 15:29:12 2018] masked 19859 high coverage nodes (>100 or <3)
[Mon Dec 10 15:29:14 2018] masked 76516 repeat-like nodes by local subgraph analysis
[Mon Dec 10 15:29:14 2018] generating edges
[Mon Dec 10 15:29:26 2018] Done, 4335269 edges

[Mon Dec 10 15:30:25 2018] Estimated: TOT 1712349952, CNT 45608, AVG 37545, MAX 6525440, N50 73728, L50 2748, N90 13312, L90 26733, Min

The average node depth is around 7, which I think is abnormal and should be respond for the low N50 index.
Could you give me some advice to improve my genome assembly? Thanks!

Best

Reads longer than 256kb

Hi there

Would it be possible to implement support for reads longer than 256kb (nanopore) ? I expect such reads should contribute a great deal to assembly of difficult repeats. At the moment we have reads up to 2.5Mb, but I expect longer reads may be possible soon, so perhaps some head room can be built into wtdbg2 to allow for technology improvements.

Thank you very much for wtdbg2!

Best
Nick

Error when input fq and fa files simultaneously.

The wtdbg can accept files in *fa or *fq format and allows multiple files following the -i. But when I input files in different format, like -i s1.fa.gz -i s2.fq.gz , the program ended with "core dump" at the "loading reads" stage. Can wtdbg support the *fa and *fq files in the same run?

wtdbg generate more contigs than smartdenovo

when i use wtdbg to assemble a genome, it generate 1130 contigs.
when i use smartdenovo to assemlble the same data, i get 350 contigs.

wtdbg get smaller genome assembly

Hi Dr. Ruan,
I am using wtdbg to assemble a 500 Mb genome with CANU corrected reads as input. My command lines are:
$ wtdbg-1.2.8 -t 32 -i canu.correctedReads.fasta -fo dbg -S 2 --edge-min 2 --rescue-low-cov-edges
$ wtdbg-cns -t 32 -i dbg.ctg.lay -o dbg.ctg.lay.fa
CANU resulted in a assembly with 512 Mb in genome size and 780 Kb in contig N50, while wtdbg generated an assembly with only 369 Mb in genome size and 4.35 Mb in contig N50.
Wtdbg is much better than CANU in sequence continuity, but has a smaller assembly size. Which parameters should I adjust to get a much closer assembly without sacrificing continuity?
Thanks!

-h is returning an error code

If I ask for -h it should print it and return 0.

(i am writing tests for packaging system)

low N50 for pacbio ultra-long reads

Hi Jue,
I used 40X (canu corrected) -67X (uncorrected) pacbio long reads (N50 33-40K) for assembly using wtdbg. Pitfully, I got low N50 of contigs about 80-200K, with different settings (k, p).
It works good on normal pacbio reads (N50 12K).
Should I use special parameters for assembly of untra-long reads?
Thanks
shujun

reads self-correction

Hi Jue,
Should we run pacbio reads self-correction before using wtdbg?
Thanks

Segmentation fault

Hello,

I used kbm2 to do the alignment and then got Segmentation fault quickly . But I used wtdbg2 with the same data and it was ok. So is there something wrong with kbm2?

1: ./20190107/wtdbg2/kbm2 -i ./reads.2.fa -fo test

-- total memory 131861660.0 kB
-- available 121149056.0 kB
-- 32 cores
-- Starting program: /20190107/wtdbg2/kbm2 -i ./reads.2.fa -fo test
-- pid 23732
-- date Mon Jan 7 16:55:47 2019

[Mon Jan 7 16:55:47 2019] loading sequences
Segmentation fault

2: ./20190107/wtdbg2/wtdbg2 -i /reads.2.fa -fo test

-- total memory 131861660.0 kB
-- available 121256172.0 kB
-- 32 cores
-- Starting program: ./20190107/wtdbg2/wtdbg2 -i ./reads.2.fa -fo test
-- pid 23280
-- date Mon Jan 7 16:55:34 2019

[Mon Jan 7 16:55:34 2019] loading reads
40000

Error rate / haplotype collapsing

Hi Jue,

wtdbg2 produces very interesting results. I am particularly impressed with the low level of duplication (haplotigs) in the final assemblies.
Is there a combination of parameters that could reduce haplotype and repeat collapsing to produce an assembly with both alleles when you have heterozygous regions or structural variants? I am trying to replicate what Canu does, which is separating haplotypes that are 1-2% divergent.

Best,
Guilherme

Minor error in shell script creation

Hi,

Would like to say firstly that I've been trying basically every assembler I can find for my genome, and so far SMARTdenovo has given one of the best assemblies from initial statistics, so I'm very interested in the development of SMARTdenovo and/or this wtdbg assembler!

On topic: When running run_wtdbg_assembly.sh with "-T > sub_script.sh", the output script has a typo which causes an error in the first uncommented line (--rescure-low-cov-edges). Manually editing this line fixes the issue.

I'll let you know how the assembly goes, unsure exactly how long it should take with the resources I have available.

Zac.

-i with multiple files doesn't work properly / fails to throw error

Awesome program, thanks! I came across the following issue, though. The help reads like -i a.fq b.fq c.fq ... would be a valid call for multiple read files. And it doesn't throw an error. But if specified like that, wtdbg2 only reads the first read file, and ignores the other files without warning or exception.

-i a.fq -i b.fq ... works. So it would be helpful to have this documented more clearly, and/or have an error thrown on unused extra arguments.

a.fq and b.fq have on read each. But only one is read

~software/wtdbg2/wtdbg2 -i a.fq b.fq -o foo
--
-- total memory       65922936.0 kB
-- available          42977240.0 kB
-- 16 cores
-- Starting program: /nobackup1/chisholmlab/software/wtdbg2/wtdbg2 -i a.fq -o foo b.fq
-- pid                     10006
-- date         Wed Oct 24 10:54:02 2018
--
[Wed Oct 24 10:54:03 2018] loading reads
1 reads
[Wed Oct 24 10:54:03 2018] Done, 1 reads, 5000 bp, 19 bins

wtdbg2 hangs when creating *.1.dot.gz

Hi,

thanks for this very nice software! It is really fast and consumes (rather) little resources.

When I try the current version in the master-branch (f020cb6), it seems to hang at the output step of the `*.1.dot-gz' file:

Sun Nov  4 11:53:31 CET 2018
--
-- total memory      131916384.0 kB
-- available         115516468.0 kB
-- 28 cores
-- Starting program: apps/software/wtdbg2_git/wtdbg2 -t 28 -i results/binning/kraken2/reads/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae.fq -fo results/assembly/wtdbg2-L_1000/per_bin/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae -L 1000
-- pid                     18416
-- date         Sun Nov  4 11:53:31 2018
--
[Sun Nov  4 11:53:31 2018] loading reads
0 reads
[Sun Nov  4 11:53:31 2018] Done, 0 reads, 0 bp, 0 bins
** PROC_STAT(0) **: real 0.003 sec, user 0.000 sec, sys 0.000 sec, maxrss 972.0 kB, maxvsize 79856.0 kB
[Sun Nov  4 11:53:31 2018] generating nodes, 28 threads
0 reads|total hits 0
** PROC_STAT(0) **: real 0.003 sec, user 0.000 sec, sys 0.000 sec, maxrss 972.0 kB, maxvsize 79856.0 kB
[Sun Nov  4 11:53:31 2018] chainning ...  0 hits into 0
[Sun Nov  4 11:53:31 2018] picking best 500 hits for each read ... 0 hits
[Sun Nov  4 11:53:31 2018] clipping ... -nan% bases
[Sun Nov  4 11:53:31 2018] generated 0 regs
[Sun Nov  4 11:53:31 2018] sorting regs ...  Done
[Sun Nov  4 11:53:31 2018] generating intervals ...  0 intervals
[Sun Nov  4 11:53:31 2018] selecting important intervals from 0 intervals
[Sun Nov  4 11:53:31 2018] Intervals: kept 0, discarded 0
** PROC_STAT(0) **: real 0.003 sec, user 0.000 sec, sys 0.000 sec, maxrss 972.0 kB, maxvsize 79856.0 kB
[Sun Nov  4 11:53:31 2018] Done, 0 nodes
[Sun Nov  4 11:53:31 2018] output "results/assembly/wtdbg2-L_1000/per_bin/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae.1.nodes". Done.
[Sun Nov  4 11:53:31 2018] median node depth = 0
[Sun Nov  4 11:53:31 2018] masked 0 high coverage nodes (>200 or <3)
[Sun Nov  4 11:53:31 2018] masked 0 repeat-like nodes by local subgraph analysis
[Sun Nov  4 11:53:31 2018] generating edges
[Sun Nov  4 11:53:31 2018] Done, 1 edges
[Sun Nov  4 11:53:31 2018] output "results/assembly/wtdbg2-L_1000/per_bin/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae.1.reads". Done.
[Sun Nov  4 11:53:31 2018] output "results/assembly/wtdbg2-L_1000/per_bin/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae.1.dot.gz".

Clearly, there is something strange with this input, e.g.,

[Sun Nov 4 11:53:31 2018] loading reads
0 reads

which might explain this unexpected behavior.

Yet, I would expect a program to fail gracefully with some info message rather than just hang ;)
Since the compression appears to have been recently added (after v2.2, which is not yet tagged/released?), I assume this is an easy-to-fix bug/non-robust-feature.

The input is a set of reads from https://github.com/LomanLab/mockcommunity (most likely Release 1 as I downloaded it several weeks ago, and not the currently shown Release 2) binned by using Kraken2.

Hence, this is an attempt to see how per-bin assembly would work instead of a meta-assembly.
For several other bins, this step worked fine.

TIA for looking into this.

Best,

Cedric

Polish command obsolete ( wtpoa-cns missing -d parameter)

Hi,

I wish to polishing the assembly with bam generated from minimap2 bam, but wtpoa-cns is missing now missing -d option. Please advice.

$samtools view prefix.ctg.lay.map.srt.bam | /home/ijt/wtdbg2/wtpoa-cns -t 40 -d wtdbg2op1.ctg.lay.fa -i - -fo prefix.ctg.lay.2nd.fa

/home/ijt/wtdbg2/wtpoa-cns: invalid option -- 'd'
WTPOA-CNS: Consensuser for wtdbg using PO-MSA

Install failed

Dear Ruanjue,

I tried many times to install, but failed. I paste the command here, could you help me to check?

$git clone https://github.com/ruanjue/wtdbg-1.2.8.git
$cd wtdbg-1.2.8
~/wtdbg-1.2.8$ make
The error report is attached in the file.
wtdbg_install_error.txt

Best,
panpan

more details about the option "--no-read-length-sort"

Hello,

(A) I used Single node to run wtdbg with ath data, the I get the results:
total length: 132015737, Max_length: 1433722, N50_len: 198115

(B) I used Multi node to run the kbm alignment according the way mentioned beford and adding the parameter: --no-read-length-sort, and get the results:
total length: 132770392, Max_length: 889975, N50_len: 178479

(C) I used alignments from (A), which is generated with Single node, and I run the assembly with the parameter: --no-read-length-sort, and get the results:
total length: 132675424, Max_length: 838984, N50_len: 176245

It seems that Single node usually generate better results than Multi node probably because of running the assembly part without "--no-read-length-sort" option.
My question is that: Can I run the alignment with Multi node and then do something with the alignment, and then run the assembly part without "--no-read-length-sort" option?
Or can you give me more details about the option "--no-read-length-sort"?

Thanks

Optimisation of parameters

Hi,

I am wondering if you could give me a few tips for optimising the assembly. I have previously used your SMARTdenovo assembler, and received a quite good assembly. Parameters were default, except I reduced minimum length cut-off to 2500. Stats are:

Genome size: 274,508,993
Estimated genome size [by SMARTdenovo]: 311,266,064
Number of contigs: 1,096
Shortest contig: 8,918
Longest contig: 3,655,283

N50: 621,057
Median: 89,349.5
Mean: 250,464.40967153283

I have tried three different parameter combinations with wtdbg, but haven't been able to get an assembly as contiguous. The default parameters received the following stats:

Genome size: 308,848,834
Number of contigs: 7,249
Shortest contig: 3,148
Longest contig: 758,777

N50: 118,094
Median: 15,921
Mean: 42,605.715822872124

I next tried two variants of a "maximum sensitivity" combination (at least, as I understand it). The first variant included the arguments "-k 0 -p 17 -S 2 --edge-min 2 --rescue-low-cov-edges". The second was the same, but --tidy-reads was set to 2500. The statistics for these two in their respective order is below:

Genome size: 271,989,307
Number of contigs: 4,539
Shortest contig: 2,417
Longest contig: 2,258,713

N50: 458,530
Median: 11,060
Mean: 59,922.737827715355

Genome size: 269,423,173
Number of contigs: 5,036
Shortest contig: 2,038
Longest contig: 1,872,034

N50: 346,207
Median: 11,191.0
Mean: 53,499.43864177919

I was wondering if you had any ideas for how I might be able to improve the program's performance, or if SMARTdenovo might just be better suited for my particular genome?

Thanks,
Zac.

Please tag a release

When you are ready, it would be good to tag a stable release (i.e. create a release in the release page) – this is often a request to my projects as well. You may name it "v2.0" if you feel it is really ready for heavy public uses or "v2.0-rc1" if you are less confident. Up to you. Once you tag a release, I will create a bioconda recipe for wtdbg2. Thanks.

Is it necessary to further run consensus tools on the results of wtdbg or smartdenovo?

Hi Jue,

I'm sorry to bother you once anain.

I found a evaluation paper and it says (paragraph 11 of "Discussion"):

...Wtdbg assemblies, which always ranked last, mostly because no consensus procedure was executed, would need additional rounds of consensus polishing to effectively compete with other assemblers.

So I'm wondering that if it is necessary to further run consensus tools after running wtdbg1.1.006, wtdbg1.2.8 and smartdenovo now, such as Racon? I know all three tools have consensus modules, and have been updated since this paper was published.

I'm working on a de novo genome assemlby project and there are very limited genomic resources to evaluate the correctness. Except PacBio data, I also have several short reads libraries, so I want to perform scaffolding based on results of wtdbg. I don't know how the errors in contigs affect the scaffolding.

Any suggestions or thoughts would be appreciated. Thank you!

Bests,
Yiwei Niu

question about polishing in wtpoa-cns

Hello, I have the following question. Can I use wtpoa-cns also to polish my contigs using paired end Illumina reads?

More specifically, can I use this command

minimap2 -ax sr prefix.ctg.lay.fa read1.fq read2.fq

instead of this

minimap2 -t 16 -x map-pb -a prefix.ctg.lay.fa reads.fa.gz

in the polishing step
?

Thank you very much for your help.

"make install" should depend on $(PROGS)

install:
	cp -fvu $(PROGS) $(BIN)

should be

install: $(PROGS)
	cp -fvu $(PROGS) $(BIN)

so that make install will build the binaries.

For the "ont" preset, change -L to 5000

On the several ONT datasets at my hand, the N50 is not much longer than 10kb. Using "-L10000" would discard too many reads. BTW, I have changed "sq" preset to "-p0 -k15". You changed the text, but didn't change the actual setting. See 9ab7df6.

make install BIN=/new/path/bin fails

cp $(PROGS) $(BIN) fails if $(BIN) does not exist yet.
Maybe use install or add mkdir -p $(BIN) before the copy?

Installation problem

^Cmake: *** [kbm] Interrupt

I have such installation problems as follow.I hope you can help me! Thank you.

Error-free sequences file

I see wtdbg-1.2.8 has the parameter -I <string> Error-free sequences file, +. Can paired-end illumina reads in fastq format be passed as an argument (in addition to nanopore sequences passed to -i)?

new parameters for wtdbg2

Hello,
I noticed there are some new parameters in the latest version of wtdbg2:

(A) nanopore/ont: -p 19 -AS 2 -s 0.05 -L 10000
sequel/sq: -p 0 -k 15 -AS 2 -s 0.05 -L 10000

the parameter "A" was set in sequel and ONT reads. As mentioned before, the alignment of contained reads will have few affects on the assembly results, so why we have to keep all these alignments?

(B) -X Choose the best depth for layout (effective with -g) [50]

Does this parameter(-X 50) represent that we choose the longest 50 depth data to do the alignment and then perform the assembly, or do we use all the reads to do the alignment and then choose the best result of 50 depth alignment for assembly?
How is this best defined?

Thanks!

much worse assembly N50 in new version of wtdbg2

Hi,

I have around 30X of an 1.5Gb insect genome. When trying to reassemble using the latest version I have much worse N50 and bigger genome. I was wondering if anyone can comment on the parameters that I should tweak please? Thanks.

Version: 1.1.006
Assembly Size 1.671Gb
N50 3Mb
Largest contig 16.1Mb

Version 2.2
Assembly Size 2.237Gb
N50 132kb
Largest contig 1.8Mb

parameters: -p19 -AS2 -e2 in both cases.

Add -V option (or --version)

Should print wtdbg2 2.1 to stdout and return 0

Version 2.1 now in Homebrew Science

brew install brewsci/bio/wtdbg2

🎉 New formula wtdbg2 in Brewsci/bio for Linux!
ℹ️ Fuzzy de Bruijn Graph long read assembler
🍺 brew install brewsci/bio/wtdbg2
🏡 https://github.com/ruanjue/wtdbg2
🔬 https://github.com/brewsci/homebrew-bio
🐧 http://linuxbrew.sh #bioinformatics

install error

Hi,
I reported the ‘warning: assignment makes pointer from integer without a cast’ error when installing the software. How to solve it?

No contig output for local assembly

Hello,

I am trying to assemble the ONT reads overlapping a specific region of the human genome and unfortunately, in this region, I end up with only 4 reads (3 of which have a length below 5 kbp). After running wtdbg2 and wtpoa-cns, I get no contig.

The end of wtdbg2 log indicates:

searched 1 contigs
Estimated:
output 0 contigs

and wtpoa-cns log indicates:

0 contigs 1 edges

I am using the following parameters for the assembly:

-p 0 -k 15 -AS 1 --edge-min 1 --rescue-low-cov-edges

Do you know what is happening here (too low coverage?) and if I can do something to solve the issue (some parameters I didn't think about)? I also tried to decrease -k and -l without success.

Thank you for your help.

Guillaume

ruanjue / wtdbg2 Goto Github PK

wtdbg2's Introduction

Getting Started

Introduction

Installation

Usage

Limitations

Citing wtdbg2

Getting Help

wtdbg2's People

Contributors

Stargazers

Watchers

Forkers

wtdbg2's Issues

1: ./20190107/wtdbg2/kbm2 -i ./reads.2.fa -fo test

-- total memory 131861660.0 kB -- available 121149056.0 kB -- 32 cores -- Starting program: /20190107/wtdbg2/kbm2 -i ./reads.2.fa -fo test -- pid 23732 -- date Mon Jan 7 16:55:47 2019

2: ./20190107/wtdbg2/wtdbg2 -i /reads.2.fa -fo test

-- total memory 131861660.0 kB -- available 121256172.0 kB -- 32 cores -- Starting program: ./20190107/wtdbg2/wtdbg2 -i ./reads.2.fa -fo test -- pid 23280 -- date Mon Jan 7 16:55:34 2019

Recommend Projects

Recommend Topics

Recommend Org

Jobs

-- total memory 131861660.0 kB
-- available 121149056.0 kB
-- 32 cores
-- Starting program: /20190107/wtdbg2/kbm2 -i ./reads.2.fa -fo test
-- pid 23732
-- date Mon Jan 7 16:55:47 2019

-- total memory 131861660.0 kB
-- available 121256172.0 kB
-- 32 cores
-- Starting program: ./20190107/wtdbg2/wtdbg2 -i ./reads.2.fa -fo test
-- pid 23280
-- date Mon Jan 7 16:55:34 2019