GithubHelp home page GithubHelp logo

ruanjue / wtdbg2 Goto Github PK

View Code? Open in Web Editor NEW
500.0 31.0 91.0 832 KB

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly

License: GNU General Public License v3.0

Makefile 0.18% Perl 3.12% C 92.16% C++ 4.50% Shell 0.05%
pacbio nanopore assembly alignment

wtdbg2's Introduction

Getting Started

git clone https://github.com/ruanjue/wtdbg2
cd wtdbg2 && make
#quick start with wtdbg2.pl
./wtdbg2.pl -t 16 -x rs -g 4.6m -o dbg reads.fa.gz

# Step by step commandlines
# assemble long reads
./wtdbg2 -x rs -g 4.6m -i reads.fa.gz -t 16 -fo dbg

# derive consensus
./wtpoa-cns -t 16 -i dbg.ctg.lay.gz -fo dbg.raw.fa

# polish consensus, not necessary if you want to polish the assemblies using other tools
minimap2 -t16 -ax map-pb -r2k dbg.raw.fa reads.fa.gz | samtools sort -@4 >dbg.bam
samtools view -F0x900 dbg.bam | ./wtpoa-cns -t 16 -d dbg.raw.fa -i - -fo dbg.cns.fa

# Addtional polishment using short reads
bwa index dbg.cns.fa
bwa mem -t 16 dbg.cns.fa sr.1.fa sr.2.fa | samtools sort -O SAM | ./wtpoa-cns -t 16 -x sam-sr -d dbg.cns.fa -i - -fo dbg.srp.fa

Introduction

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output. Wtdbg2 is able to assemble the human and even the 32Gb Axolotl genome at a speed tens of times faster than CANU and FALCON while producing contigs of comparable base accuracy.

During assembly, wtdbg2 chops reads into 1024bp segments, merges similar segments into a vertex and connects vertices based on the segment adjacency on reads. The resulting graph is called fuzzy Bruijn graph (FBG). It is akin to De Bruijn graph but permits mismatches/gaps and keeps read paths when collapsing k-mers. The use of FBG distinguishes wtdbg2 from the majority of long-read assemblers.

Installation

Wtdbg2 only works on 64-bit Linux. To compile, please type make in the source code directory. You can then copy wtdbg2 and wtpoa-cns to your PATH.

Wtdbg2 also comes with an approxmimate read mapper kbm, a faster but less accurate consesus tool wtdbg-cns and many auxiliary scripts in the scripts directory.

Usage

Wtdbg2 has two key components: an assembler wtdbg2 and a consenser wtpoa-cns. Executable wtdbg2 assembles raw reads and generates the contig layout and edge sequences in a file "prefix.ctg.lay.gz". Executable wtpoa-cns takes this file as input and produces the final consensus in FASTA. A typical workflow looks like this:

./wtdbg2 -x rs -g 4.6m -t 16 -i reads.fa.gz -fo prefix
./wtpoa-cns -t 16 -i prefix.ctg.lay.gz -fo prefix.ctg.fa

where -g is the estimated genome size and -x specifies the sequencing technology, which could take value "rs" for PacBio RSII, "sq" for PacBio Sequel, "ccs" for PacBio CCS reads and "ont" for Oxford Nanopore. This option sets multiple parameters and should be applied before other parameters. When you are unable to get a good assembly, you may need to tune other parameters as follows.

Wtdbg2 combines normal k-mers and homopolymer-compressed (HPC) k-mers to find read overlaps. Option -k specifies the length of normal k-mers, while -p specifies the length of HPC k-mers. By default, wtdbg2 samples a fourth of all k-mers by their hashcodes. For data of relatively low coverage, you may increase this sampling rate by reducing -S. This will greatly increase the peak memory as a cost, though. Option -e, which defaults to 3, specifies the minimum read coverage of an edge in the assembly graph. You may adjust this option according to the overall sequencing depth, too. Option -A also helps relatively low coverage data at the cost of performance. For PacBio data, -L5000 often leads to better assemblies emperically, so is recommended. Please run wtdbg2 --help for a complete list of available options or consult README-ori.md for more help.

The following table shows various command lines and their resource usage for the assembly step:

Dataset GSize Cov Asm options CPU asm CPU cns Real tot RAM
E. coli 4.6Mb PB x20 -x rs -g4.6m -t16 53s 8m54s 42s 1.0G
C. elegans 100Mb PB x80 -x rs -g100m -t32 1h07m 5h06m 13m42s 11.6G
D. melanogaster A4 144m PB x120 -x rs -g144m -t32 2h06m 5h11m 26m17s 19.4G
D. melanogaster ISO1 144m ONT x32 -xont -g144m -t32 5h12m 4h30m 25m59s 17.3G
A. thaliana 125Mb PB x75 -x sq -g125m -t32 11h26m 4h57m 49m35s 25.7G
Human NA12878 3Gb ONT x36 -x ont -g3g -t31 793h11m 97h46m 31h03m 221.8G
Human NA19240 3Gb ONT x35 -x ont -g3g -t31 935h31m 89h17m 35h20m 215.0G
Human HG00733 3Gb PB x93 -x sq -g3g -t47 2114h26m 152h24m 52h22m 338.1G
Human NA24385 3Gb CCS x28 -x ccs -g3g -t31 231h25m 58h48m 10h14m 112.9G
Human CHM1 3Gb PB x60 -x rs -g3g -t96 105h33m 139h24m 5h17m 225.1G
Axolotl 32Gb PB x32 -x rs -g32g -t96 2806h40m 1456h13m 110h16m 1788.1G

The timing was obtained on three local servers with different hardware configurations. There are also run-to-run fluctuations. Exact timing on your machines may differ. The assembled contigs can be found at the following FTP:

ftp://ftp.dfci.harvard.edu/pub/hli/wtdbg/

Limitations

  • For Nanopore data, wtdbg2 may produce an assembly smaller than the true genome.

  • When inputing multiple files of both fasta and fastq format, please put fastq first, then fasta. Otherwise, program cannot find '>' in fastq, and append all fastq in one read.

Citing wtdbg2

If you use wtdbg2, please cite:

Ruan, J. and Li, H. (2019) Fast and accurate long-read assembly with wtdbg2. Nat Methods doi:10.1038/s41592-019-0669-3

Ruan, J. and Li, H. (2019) Fast and accurate long-read assembly with wtdbg2. bioRxiv. doi:10.1101/530972

Getting Help

Please use the GitHub's Issues page if you have questions. You may also directly contact Jue Ruan at [email protected].

wtdbg2's People

Contributors

colindaven avatar jvhaarst avatar lh3 avatar ruanjue avatar smattr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wtdbg2's Issues

wtdbg-dot2gfa.pl does not work with contig dot file

Hi there

I would like to produce a contig GFA to inspect in Bandage. I found the wtdbg-dot2gfa.pl script but it only seems to work on the sequence graph (3.dot) rather than the contig graph. Is it possible to modify it to work with contigs?

Best
Nick

wtpoa-cns not using all requested threads

I asked wtpoa-cns to use 16 threads. However, in average, only it only uses 500% CPU on my machine. I changed the default memory allocator and it seems to improve the multi-thread performance.

I am using the E. coli example from PBcR:

http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz

The command lines I was using:

wtdbg2 -i ecoli.fa.gz -t 16 -fo test -L5000 -e2
wtpoa-cns -i test.ctg.lay -t 16 -fo test.ctg.fa

You can override the system allocator with LD_PRELAD:

LD_PRELOAD=libtcmalloc.so wtpoa-cns -i test.ctg.lay -t 16 -fo test.ctg.fa

Here are some results:

Library Real time (sec) User time Sys time Max RSS (kb)
glibc-2.12 285.901 848.230 575.720 1660412.0
jemalloc 75.703 814.820 41.580 3274516.0
tcmalloc 72.275 1023.740 26.120 1765996.0
lockless 100.658 953.020 102.220 4018172.0

You can see that the default glibc allocator (I am using CentOS 6) is quite bad, spending lots of system time on thread scheduling. tcmalloc is much better. You get almost a 4-fold speedup. jemalloc is good, too, but it takes too much extra memory.

Typically, you see the effect of memory allocators when you frequently malloc/free in each thread. Bwa suffers from this problem, too. I think there are two ways to fix this:

  1. Use a custom memory allocator. tcmalloc has been quite good for the few examples I have tried. This solution doesn't require you to modify the C source code. However, it is a little difficult for general users to build performant binaries.

  2. Reorganize malloc/free calls. You allocate a buffer before spawning the workers and try to avoid frequent malloc/free in each worker. Minimap2 takes this approach with a thread-local buffer. With this buffer disabled, minimap2 will become noticeably slower on many threads.

cannot find reads

Dear Jue,
Could you please indicate me how to solve this problem?
Thanks!

[Tue Jun 26 17:19:15 2018] loading reads
8071186 reads
[Tue Jun 26 17:28:34 2018] Done, 8071186 reads, 149999992485 bp, 581944848 bins
** PROC_STAT(0) **: real 559.139 sec, user 484.810 sec, sys 149.680 sec, maxrss 44491600.0 kB, maxvsize 157162940.0 kB
[Tue Jun 26 17:28:34 2018] loading alignments from "1-1.kbmap","1-2.kbmap","1-3.kbmap","1-4.kbmap","1-5.kbmap","1-6.kbmap","1-7.kbmap","1-8.kbmap","1-9.kbmap","1-10.kbmap","2-1.kbmap","2-2.kbmap","2-3.kbmap","2-4.kbmap","2-5.kbmap","2-6.kbmap","2-7.kbmap","2-8.kbmap","2-9.kbmap","2-10.kbmap","3-1.kbmap","3-2.kbmap","3-3.kbmap","3-4.kbmap","3-5.kbmap","3-6.kbmap","3-7.kbmap","3-8.kbmap","3-9.kbmap","3-10.kbmap","4-1.kbmap","4-2.kbmap","4-3.kbmap","4-4.kbmap","4-5.kbmap","4-6.kbmap","4-7.kbmap","4-8.kbmap","4-9.kbmap","4-10.kbmap","5-1.kbmap","5-2.kbmap","5-3.kbmap","5-4.kbmap","5-5.kbmap","5-6.kbmap","5-7.kbmap","5-8.kbmap","5-9.kbmap","5-10.kbmap","6-1.kbmap","6-2.kbmap","6-3.kbmap","6-4.kbmap","6-5.kbmap","6-6.kbmap","6-7.kbmap","6-8.kbmap","6-9.kbmap","6-10.kbmap","7-1.kbmap","7-2.kbmap","7-3.kbmap","7-4.kbmap","7-5.kbmap","7-6.kbmap","7-7.kbmap","7-8.kbmap","7-9.kbmap","7-10.kbmap","8-1.kbmap","8-2.kbmap","8-3.kbmap","8-4.kbmap","8-5.kbmap","8-6.kbmap","8-7.kbmap","8-8.kbmap","8-9.kbmap","8-10.kbmap","9-1.kbmap","9-2.kbmap","9-3.kbmap","9-4.kbmap","9-5.kbmap","9-6.kbmap","9-7.kbmap","9-8.kbmap","9-9.kbmap","9-10.kbmap","10-1.kbmap","10-2.kbmap","10-3.kbmap","10-4.kbmap","10-5.kbmap","10-6.kbmap","10-7.kbmap","10-8.kbmap","10-9.kbmap","10-10.kbmap"
62610000 -- Cannot find read "m54173_18030852/49217650/1065_26917" in LINE:62611733 in build_nodes_graph -- wtdbg.c:1034 --
63650000 -- Cannot find read "m54174_180329_082736/7826476/7300_29818" in LINE:63653665 in build_nodes_graph -- wtdbg.c:1034 --
63740000 -- Cannot find read "m54170_180315_1329554172_180322_035319/49480042/167_25553" in LINE:63740153 in build_nodes_graph -- wtdbg.c:1034 --
64370000 -- Cannot find read "m54173_180331_111252/52035731/0_17734929/33292754/0_27797" in LINE:64375142 in build_nodes_graph -- wtdbg.c:1034 --
-- Cannot find read "-" in LINE:64375513 in build_nodes_graph -- wtdbg.c:1034 --
64460000 -- Cannot find read "+" in LINE:64463256 in build_nodes_graph -- wtdbg.c:1034 --
66190000 -- Cannot find read "m54170_18030256" in LINE:66195795 in build_nodes_graph -- wtdbg.c:1034 --
67290000 -- Cannot find read "m54174_180319_171333/70254672/0_26/0_26562" in LINE:67294315 in build_nodes_graph -- wtdbg.c:1017 --
67940000 -- Cannot find read "m54173_18030_55439" in LINE:67945009 in build_nodes_graph -- wtdbg.c:1017 --
67950000 -- Cannot find read "m54174_180337/17499038/0_26853" in LINE:67957390 in build_nodes_graph -- wtdbg.c:1034 --
79350000 -- Cannot find read "m54173_180322_135952/29_27776" in LINE:79352414 in build_nodes_graph -- wtdbg.c:1034 --
83890000 -- Bad cigar '/' "3Md3m2452677/0_23760" in LINE:83890828 in build_nodes_graph -- wtdbg.c:1072 --

Portable way to get system/process information

The wtdbg2 algorithm is largely OS-independent. However, to collect system and process information, it assumes the OS is Linux. It would be good remove this dependency. Here are some portable ways to get the real time, CPU time and peak RAM of the current process, and the number of CPUs and total RAM of the system. These work for Mac and Linux.

#include <sys/resource.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>

double cputime(void) // return in CPU seconds, including both user and system CPU time
{
	struct rusage r;
	getrusage(RUSAGE_SELF, &r);
	return r.ru_utime.tv_sec + r.ru_stime.tv_sec + 1e-6 * (r.ru_utime.tv_usec + r.ru_stime.tv_usec);
}

double realtime(void) // return in seconds
{
	struct timeval tp;
	struct timezone tzp;
	gettimeofday(&tp, &tzp);
	return tp.tv_sec + tp.tv_usec * 1e-6;
}

long peakrss(void) // return in bytes
{
	struct rusage r;
	getrusage(RUSAGE_SELF, &r);
#ifdef __linux__
	return r.ru_maxrss * 1024;
#else
	return r.ru_maxrss;
#endif
}

int ncpucore(void)
{
	return sysconf(_SC_NPROCESSORS_ONLN);
}

long totalmem(void) // return in bytes
{
	long pages = sysconf(_SC_PHYS_PAGES);
	long page_size = sysconf(_SC_PAGE_SIZE);
	return pages * page_size;
}

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
	int i, n = 1<<20, *p;
	p = (int*)calloc(n, sizeof(int));
	for (i = 0; i < n; ++i) p[i] = i;
	fprintf(stderr, "ncpu: %d\n", ncpucore());
	fprintf(stderr, "peakrss: %ld\n", peakrss());
	fprintf(stderr, "totalmem: %ld\n", totalmem());
	return 0;
}

Unable to install on linux x64

Hi

Thanks for developing wtdbg2. I cloned the repository but am unable to compile the code. I'm on 64bit linux, gcc --version is 7.3.0. The following is the output of make

error.txt

I wonder if it's a missing library problem. In your Makefile you request

GLIBS=-lm -lrt -lpthread

Not sure if these are present in my system.

Any help appreciated.

Best wishes,

a bug for --ctg-min-length?

hello,
I found there were always one single sequence in the assembly lower than the minimal contig length(5k), such as 1.7k, 4k, ... for each project.

--ctg-min-length
Min length of contigs to be output, 5000

Thanks

Compiling with clang: fatal error: 'endian.h' file not found

Any ideas?

gcc -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DTIMESTAMP="Thu Oct 25 23:44:41 PDT 2018" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o wtpoa-cns wtpoa-cns.c ksw.c -lm -lrt -lpthread
In file included from In file included from wtdbg.cwtdbg-cns.c::2020:
In file included from In file included from :
wtpoa-cns.cIn file included from In file included from ./kswx.h./kbm.h:kbm.c:::2023:
In file included from ./kbm.h:23:
In file included from 2420:
:
In file included from :
./list.h./list.hIn file included from :In file included from ./tripoa.h28::
23:28:
:
In file included from ./poacns.h:23:
In file included from ./bit2vec.h:27:
./mem_share.h:26./mem_share.h::1026:: 10./mem_share.h:: 26fatal error:: 10:fatal error'endian.h' file not found: 
 fatal error: 'endian.h' file not found
'endian.h' file not found
./dna.h:27:
In file included from ./list.h:28:
./mem_share.h:26:#include <endian.h>#include <endian.h>10
:         ^~~~~~~~~~ 
fatal error: 
'endian.h' file not found
#include <endian.h>
         ^~~~~~~~~~
         ^~~~~~~~~~
#include <endian.h>
         ^~~~~~~~~~
1 error generated.

clustering overlapped reads based on their alignments

Hi Jue,
I am not sure it is suitable post my question here.
I am assembling a super-large genome (20Gb). Currently i have finished the alignment using paralleled kbm-1.2.8 method (#7) and would like to cluster overlapped reads to multiple small blocks for separate assembly as loading all of the alignment files in wtdbg-1.28 will require too much memory (which I do not have).
Therefore, my question is that how can I extract and cluster overlaped reads into different blocks? Any suggestion?
Thanks!

kmer distribution

Hi Jue,
When I run wtdbg it plots the kmer distribution and suggest that in case of a "not good" distribution I should adjust the -k, -p and -K parameters.
The kmer distribution would depend on many factors (e.g. repeat content, ploidy CNVs etc.), nevertheless can you show me plots of distributions you would consider good with some explanations?
It would be much appreciated.
Thanks,
Lel

A 7g genome

I used wtdbg-1.28 to assemble a 7g plant genome with about 70X pacbio data under default parameters (-L 5000) and got a ~6.5 g result.
But, now I am assembling this genome using the same data with wtdbg2.1. The software stoped at indexing k-mers. I have tried several times. I think maybe there are some bugs in this version. Please could you check it?
PS. I can get a poor results with 2.1 version when only used reads longer then 15 Kb.

polish error

I got an error in polish step using the new release of version 2.2. Is it a bug or caused by wrong parameters.
[@localhost canu1.8_wtdbg2]$ wtpoa-cns -t $cores -d tp.canu.wtdbg.ctg.lay1.fa -i tp.canu.wtdbg.ctg.lay2.map.sam -fo tp.canu.wtdbg.ctg.lay2.fa
-- total memory 396031892.0 kB
-- available 334581040.0 kB
-- 120 cores
-- Starting program: wtpoa-cns -t 24 -d tp.canu.wtdbg.ctg.lay1.fa -i tp.canu.wtdbg.ctg.lay2.map.sam -fo tp.canu.wtdbg.ctg.lay2.fa
-- pid 100642
-- date Mon Dec 3 15:16:59 2018
wtpoa-cns: wtpoa.h:472: init_samblock: Assertion `bstep <= bsize && 2 * bstep >= bsize' failed.

Compiling issue MACOSX. fatal error: sys/sysinfo.h

Hi. I am trying to compile in macOS Sierra 10.12.6
I keep getting an error message, do you have any idea how to solve it?

Some specs
x86_64-apple-darwin16.7.0
Apple LLVM version 9.0.0 (clang-900.0.37)
gcc (Homebrew) 7.2.0
command line tools are already installed
Xcode 9.0 Build version 9A235

$ make
gcc -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DTIMESTAMP="Tue Oct 10 13:45:10 CEST 2017" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o kbm-1.2.8 kbm.c -lm -lrt -lpthread
In file included from list.h:28:0,
                 from kbm.h:23,
                 from kbm.c:20:
mem_share.h:33:10: fatal error: sys/sysinfo.h: No such file or directory
 #include <sys/sysinfo.h>
          ^~~~~~~~~~~~~~~
compilation terminated.
make: *** [kbm-1.2.8] Error 1

$ make CC=clang
clang -g3 -W -Wall -Wno-unused-but-set-variable -O4 -DTIMESTAMP="Tue Oct 10 13:45:16 CEST 2017" -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -mpopcnt -msse4.2 -o kbm-1.2.8 kbm.c -lm -lrt -lpthread
clang: warning: -O4 is equivalent to -O3 [-Wdeprecated]
warning: unknown warning option '-Wno-unused-but-set-variable'; did you mean
      '-Wno-unused-const-variable'? [-Wunknown-warning-option]
In file included from kbm.c:20:
In file included from ./kbm.h:23:
In file included from ./list.h:28:
./mem_share.h:33:10: fatal error: 'sys/sysinfo.h' file not found
#include <sys/sysinfo.h>
         ^~~~~~~~~~~~~~~
1 warning and 1 error generated.
make: *** [kbm-1.2.8] Error 1

How to run wtdbg-1.2.8 in multiple nodes (the usage of --load-xxx)

Hi Jue,

Very interesting project! I'm eager to test wtdbg and interested to make use of several nodes.

You write that this can be achieved by combining kbm and wtdbg --load-alignments. Could you write a few more words or give an example of commands?

Cheers,
Iggy

how to set -K for repeats

Hi Jue,
Can you advise what is the best way of setting -K to filter possibly repeat derived kmers?
E.g. with 30X read coverage if I want to filter repeats which are present at least 10 or more copies should I set -K to 300?
Also, assuming that 40% of the genome made up by repeats I assume that the fraction of the total repeat driven kmer count should be set to 0.4?
So would the two above be captured as -K 300.4?
Do I interpret the -K parameter correctly?
Thank you,
Lel

support for ".fasta"

I noticed that if you use the flag

-o output_file.fasta

No output is generated. But if you use either of the following:

-o output_file
-o output_file.fa

Then the corresponding file is generated.

Hanging when using one thread

Hi,

I am doing a really small local assembly and noticed that the program gets stuck when I specify -t 1, and will run if I choose any other value of -t.
For example

wtdbg2 -i group.0/WH.reads.fasta -f -o group.0/wtdbg.assembly/asm --ctg-min-length 10000 -L 5000 -t 1

hangs on the last line

--
-- total memory       49415848.0 kB
-- available          45630044.0 kB
-- 12 cores
-- Starting program: wtdbg2 -i group.0/WH.reads.fasta -f -o group.0/wtdbg.assembly/asm --ctg-min-length 10000 -L 5000 -t 1
-- pid                     26606
-- date         Wed Nov  7 11:23:57 2018
--
[Wed Nov  7 11:23:57 2018] loading reads
93 reads
[Wed Nov  7 11:23:57 2018] Done, 93 reads, 1674265 bp, 6492 bins
** PROC_STAT(0) **: real 0.013 sec, user 0.000 sec, sys 0.000 sec, maxrss 20372.0 kB, maxvsize 98308.0 kB
[Wed Nov  7 11:23:57 2018] generating nodes, 1 threads
[Wed Nov  7 11:23:57 2018] indexing bins[0,6492] (1661952 bp), 1 threads
[Wed Nov  7 11:23:57 2018] - scanning kmers (K0P21 subsampling 1/4) from 6492 bins
6492 bins
** PROC_STAT(0) **: real 0.113 sec, user 0.080 sec, sys 0.020 sec, maxrss 49528.0 kB, maxvsize 258436.0 kB
[Wed Nov  7 11:23:57 2018] - Total kmers = 17559
[Wed Nov  7 11:23:57 2018] - average kmer depth = 5
[Wed Nov  7 11:23:57 2018] - 163936 low frequency kmers (<2)
[Wed Nov  7 11:23:57 2018] - 0 high frequency kmers (>1000)
[Wed Nov  7 11:23:57 2018] - indexing 17559 kmers, 91195 instances (at most)
6492 bins
[Wed Nov  7 11:23:57 2018] - indexed  17559 kmers, 91015 instances
[Wed Nov  7 11:23:57 2018] - masked 635 bins as closed
[Wed Nov  7 11:23:57 2018] - sorting
** PROC_STAT(0) **: real 0.113 sec, user 0.080 sec, sys 0.020 sec, maxrss 49528.0 kB, maxvsize 258436.0 kB
[Wed Nov  7 11:23:57 2018] Done
93 reads|total hits 2348
** PROC_STAT(0) **: real 0.317 sec, user 0.270 sec, sys 0.030 sec, maxrss 51068.0 kB, maxvsize 259104.0 kB
[Wed Nov  7 11:23:58 2018] chainning ...  412 hits into 202
[Wed Nov  7 11:23:58 2018] picking best 500 hits for each read ... 2138 hits
[Wed Nov  7 11:23:58 2018] clipping ... 0.00% bases
[Wed Nov  7 11:23:58 2018] generated 34061 regs
[Wed Nov  7 11:23:58 2018] sorting regs ...  Done
[Wed Nov  7 11:23:58 2018] generating intervals ...  1471 intervals
[Wed Nov  7 11:23:58 2018] selecting important intervals from 1471 intervals
[Wed Nov  7 11:23:58 2018] Intervals: kept 29, discarded 1442
** PROC_STAT(0) **: real 0.317 sec, user 0.270 sec, sys 0.030 sec, maxrss 51068.0 kB, maxvsize 259104.0 kB
[Wed Nov  7 11:23:58 2018] Done, 29 nodes
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.1.nodes". Done.
[Wed Nov  7 11:23:58 2018] median node depth = 16
[Wed Nov  7 11:23:58 2018] masked 0 high coverage nodes (>200 or <3)
[Wed Nov  7 11:23:58 2018] masked 1 repeat-like nodes by local subgraph analysis
[Wed Nov  7 11:23:58 2018] generating edges
[Wed Nov  7 11:23:58 2018] Done, 423 edges
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.1.reads". Done.
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.1.dot". Done.
[Wed Nov  7 11:23:58 2018] graph clean
[Wed Nov  7 11:23:58 2018] rescued 0 low cov edges
[Wed Nov  7 11:23:58 2018] deleted 0 binary edges
[Wed Nov  7 11:23:58 2018] deleted 1 isolated nodes
[Wed Nov  7 11:23:58 2018] cut 21 transitive edges
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.2.dot". Done.
[Wed Nov  7 11:23:58 2018] 2 bubbles; 2 tips; 0 yarns;
[Wed Nov  7 11:23:58 2018] deleted 1 isolated nodes
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.3.dot". Done.
[Wed Nov  7 11:23:58 2018] cut 0 branching nodes
[Wed Nov  7 11:23:58 2018] deleted 0 isolated nodes
[Wed Nov  7 11:23:58 2018] building unitigs
[Wed Nov  7 11:23:58 2018] TOT 41472, CNT 1, AVG 41472, MAX 41472, N50 41472, L50 1, N90 41472, L90 1, Min 41472
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.frg.nodes". Done.
[Wed Nov  7 11:23:58 2018] generating links
[Wed Nov  7 11:23:58 2018] generated 1 links
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.frg.dot". Done.
[Wed Nov  7 11:23:58 2018] rescue 0 weak links
[Wed Nov  7 11:23:58 2018] deleted 2 binary links
[Wed Nov  7 11:23:58 2018] cut 0 transitive links
[Wed Nov  7 11:23:58 2018] remove 0 boomerangs
[Wed Nov  7 11:23:58 2018] detached 0 repeat-associated paths
[Wed Nov  7 11:23:58 2018] remove 0 weak branches
[Wed Nov  7 11:23:58 2018] cut 0 tips
[Wed Nov  7 11:23:58 2018] pop 0 bubbles
[Wed Nov  7 11:23:58 2018] cut 0 tips
[Wed Nov  7 11:23:58 2018] output "group.0/wtdbg.assembly/asm.ctg.dot". Done.
[Wed Nov  7 11:23:58 2018] building contigs
[Wed Nov  7 11:23:58 2018] searched 1 contigs
[Wed Nov  7 11:23:58 2018] Estimated: TOT 41472, CNT 1, AVG 41472, MAX 41472, N50 41472, L50 1, N90 41472, L90 1, Min 41472

However,

wtdbg2 -i group.0/WH.reads.fasta -f -o group.0/wtdbg.assembly/asm --ctg-min-length 10000 -L 5000 -t 2

completes:

--
-- total memory       49415848.0 kB
-- available          45653020.0 kB
-- 12 cores
-- Starting program: wtdbg2 -i group.0/WH.reads.fasta -f -o group.0/wtdbg.assembly/asm --ctg-min-length 10000 -L 5000 -t 2
-- pid                     29131
-- date         Wed Nov  7 11:27:23 2018
--
[Wed Nov  7 11:27:23 2018] loading reads
93 reads
[Wed Nov  7 11:27:23 2018] Done, 93 reads, 1674265 bp, 6492 bins
** PROC_STAT(0) **: real 0.020 sec, user 0.010 sec, sys 0.000 sec, maxrss 32736.0 kB, maxvsize 110772.0 kB
[Wed Nov  7 11:27:23 2018] generating nodes, 2 threads
[Wed Nov  7 11:27:23 2018] indexing bins[0,6492] (1661952 bp), 2 threads
[Wed Nov  7 11:27:23 2018] - scanning kmers (K0P21 subsampling 1/4) from 6492 bins
6492 bins
** PROC_STAT(0) **: real 0.121 sec, user 0.100 sec, sys 0.010 sec, maxrss 56000.0 kB, maxvsize 394128.0 kB
[Wed Nov  7 11:27:23 2018] - Total kmers = 17559
[Wed Nov  7 11:27:23 2018] - average kmer depth = 5
[Wed Nov  7 11:27:23 2018] - 163936 low frequency kmers (<2)
[Wed Nov  7 11:27:23 2018] - 0 high frequency kmers (>1000)
[Wed Nov  7 11:27:23 2018] - indexing 17559 kmers, 91195 instances (at most)
6492 bins
[Wed Nov  7 11:27:23 2018] - indexed  17559 kmers, 91015 instances
[Wed Nov  7 11:27:23 2018] - masked 635 bins as closed
[Wed Nov  7 11:27:23 2018] - sorting
** PROC_STAT(0) **: real 0.121 sec, user 0.100 sec, sys 0.010 sec, maxrss 56000.0 kB, maxvsize 394128.0 kB
[Wed Nov  7 11:27:23 2018] Done
93 reads|total hits 2348
** PROC_STAT(0) **: real 0.330 sec, user 0.290 sec, sys 0.020 sec, maxrss 58340.0 kB, maxvsize 394424.0 kB
[Wed Nov  7 11:27:23 2018] chainning ...  412 hits into 202
[Wed Nov  7 11:27:23 2018] picking best 500 hits for each read ... 2138 hits
[Wed Nov  7 11:27:23 2018] clipping ... 0.00% bases
[Wed Nov  7 11:27:23 2018] generated 34061 regs
[Wed Nov  7 11:27:23 2018] sorting regs ...  Done
[Wed Nov  7 11:27:23 2018] generating intervals ...  1471 intervals
[Wed Nov  7 11:27:23 2018] selecting important intervals from 1471 intervals
[Wed Nov  7 11:27:23 2018] Intervals: kept 29, discarded 1442
** PROC_STAT(0) **: real 0.330 sec, user 0.290 sec, sys 0.020 sec, maxrss 58340.0 kB, maxvsize 394424.0 kB
[Wed Nov  7 11:27:23 2018] Done, 29 nodes
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.1.nodes". Done.
[Wed Nov  7 11:27:23 2018] median node depth = 16
[Wed Nov  7 11:27:23 2018] masked 0 high coverage nodes (>200 or <3)
[Wed Nov  7 11:27:23 2018] masked 1 repeat-like nodes by local subgraph analysis
[Wed Nov  7 11:27:23 2018] generating edges
[Wed Nov  7 11:27:23 2018] Done, 423 edges
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.1.reads". Done.
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.1.dot". Done.
[Wed Nov  7 11:27:23 2018] graph clean
[Wed Nov  7 11:27:23 2018] rescued 0 low cov edges
[Wed Nov  7 11:27:23 2018] deleted 0 binary edges
[Wed Nov  7 11:27:23 2018] deleted 1 isolated nodes
[Wed Nov  7 11:27:23 2018] cut 21 transitive edges
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.2.dot". Done.
[Wed Nov  7 11:27:23 2018] 2 bubbles; 2 tips; 0 yarns;
[Wed Nov  7 11:27:23 2018] deleted 1 isolated nodes
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.3.dot". Done.
[Wed Nov  7 11:27:23 2018] cut 0 branching nodes
[Wed Nov  7 11:27:23 2018] deleted 0 isolated nodes
[Wed Nov  7 11:27:23 2018] building unitigs
[Wed Nov  7 11:27:23 2018] TOT 41472, CNT 1, AVG 41472, MAX 41472, N50 41472, L50 1, N90 41472, L90 1, Min 41472
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.frg.nodes". Done.
[Wed Nov  7 11:27:23 2018] generating links
[Wed Nov  7 11:27:23 2018] generated 1 links
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.frg.dot". Done.
[Wed Nov  7 11:27:23 2018] rescue 0 weak links
[Wed Nov  7 11:27:23 2018] deleted 2 binary links
[Wed Nov  7 11:27:23 2018] cut 0 transitive links
[Wed Nov  7 11:27:23 2018] remove 0 boomerangs
[Wed Nov  7 11:27:23 2018] detached 0 repeat-associated paths
[Wed Nov  7 11:27:23 2018] remove 0 weak branches
[Wed Nov  7 11:27:23 2018] cut 0 tips
[Wed Nov  7 11:27:23 2018] pop 0 bubbles
[Wed Nov  7 11:27:23 2018] cut 0 tips
[Wed Nov  7 11:27:23 2018] output "group.0/wtdbg.assembly/asm.ctg.dot". Done.
[Wed Nov  7 11:27:23 2018] building contigs
[Wed Nov  7 11:27:23 2018] searched 1 contigs
[Wed Nov  7 11:27:23 2018] Estimated: TOT 41472, CNT 1, AVG 41472, MAX 41472, N50 41472, L50 1, N90 41472, L90 1, Min 41472
[Wed Nov  7 11:27:23 2018] output 1 contigs
[Wed Nov  7 11:27:23 2018] Program Done
** PROC_STAT(TOTAL) **: real 0.430 sec, user 0.340 sec, sys 0.020 sec, maxrss 58340.0 kB, maxvsize 394424.0 kB
---

Here is the read file:
WH.reads.fasta.gz

Thanks in advance!

abnormal node depth ???

Hello,

I used the wtdbg2 to do the assembly for a genome(~2.8Gbp) with ~30X data(PacBio, length cutoff:7000).
The parameters for kbm2: -p 0 -k 15 -S 2 -m 300
the parameters for wtdbg2: --node-drop 0.25 --node-len 1024 --node-max 100 --aln-dovetail -1

and the log information:
Done, 5992448 reads (>=0 bp), 87876681500 bp, 340291433 bins
[Mon Dec 10 15:19:57 2018] chainning ... 1796935 hits into 896135, deleted 13977831 non-best hits between two reads
[Mon Dec 10 15:20:08 2018] picking best 500 hits for each read ... 178840586 hits
[Mon Dec 10 15:20:23 2018] clipping ... 14.39% bases
[Mon Dec 10 15:24:48 2018] generated 859464418 regs
[Mon Dec 10 15:25:00 2018] sorting regs ... Done
[Mon Dec 10 15:25:32 2018] generating intervals ... 30385993 intervals
[Mon Dec 10 15:25:39 2018] selecting important intervals from 30385993 intervals
[Mon Dec 10 15:29:02 2018] Intervals: kept 1146431, discarded 29239562
[Mon Dec 10 15:29:12 2018] median node depth = 7
[Mon Dec 10 15:29:12 2018] masked 19859 high coverage nodes (>100 or <3)
[Mon Dec 10 15:29:14 2018] masked 76516 repeat-like nodes by local subgraph analysis
[Mon Dec 10 15:29:14 2018] generating edges
[Mon Dec 10 15:29:26 2018] Done, 4335269 edges

[Mon Dec 10 15:30:25 2018] Estimated: TOT 1712349952, CNT 45608, AVG 37545, MAX 6525440, N50 73728, L50 2748, N90 13312, L90 26733, Min

The average node depth is around 7, which I think is abnormal and should be respond for the low N50 index.
Could you give me some advice to improve my genome assembly? Thanks!

Best

Reads longer than 256kb

Hi there

Would it be possible to implement support for reads longer than 256kb (nanopore) ? I expect such reads should contribute a great deal to assembly of difficult repeats. At the moment we have reads up to 2.5Mb, but I expect longer reads may be possible soon, so perhaps some head room can be built into wtdbg2 to allow for technology improvements.

Thank you very much for wtdbg2!

Best
Nick

Error when input *fq and *fa files simultaneously.

The wtdbg can accept files in *fa or *fq format and allows multiple files following the -i. But when I input files in different format, like -i s1.fa.gz -i s2.fq.gz , the program ended with "core dump" at the "loading reads" stage. Can wtdbg support the *fa and *fq files in the same run?

wtdbg get smaller genome assembly

Hi Dr. Ruan,
I am using wtdbg to assemble a 500 Mb genome with CANU corrected reads as input. My command lines are:
$ wtdbg-1.2.8 -t 32 -i canu.correctedReads.fasta -fo dbg -S 2 --edge-min 2 --rescue-low-cov-edges
$ wtdbg-cns -t 32 -i dbg.ctg.lay -o dbg.ctg.lay.fa
CANU resulted in a assembly with 512 Mb in genome size and 780 Kb in contig N50, while wtdbg generated an assembly with only 369 Mb in genome size and 4.35 Mb in contig N50.
Wtdbg is much better than CANU in sequence continuity, but has a smaller assembly size. Which parameters should I adjust to get a much closer assembly without sacrificing continuity?
Thanks!

low N50 for pacbio ultra-long reads

Hi Jue,
I used 40X (canu corrected) -67X (uncorrected) pacbio long reads (N50 33-40K) for assembly using wtdbg. Pitfully, I got low N50 of contigs about 80-200K, with different settings (k, p).
It works good on normal pacbio reads (N50 12K).
Should I use special parameters for assembly of untra-long reads?
Thanks
shujun

Segmentation fault

Hello,

I used kbm2 to do the alignment and then got Segmentation fault quickly . But I used wtdbg2 with the same data and it was ok. So is there something wrong with kbm2?

1: ./20190107/wtdbg2/kbm2 -i ./reads.2.fa -fo test

-- total memory 131861660.0 kB
-- available 121149056.0 kB
-- 32 cores
-- Starting program: /20190107/wtdbg2/kbm2 -i ./reads.2.fa -fo test
-- pid 23732
-- date Mon Jan 7 16:55:47 2019

[Mon Jan 7 16:55:47 2019] loading sequences
Segmentation fault

2: ./20190107/wtdbg2/wtdbg2 -i /reads.2.fa -fo test

-- total memory 131861660.0 kB
-- available 121256172.0 kB
-- 32 cores
-- Starting program: ./20190107/wtdbg2/wtdbg2 -i ./reads.2.fa -fo test
-- pid 23280
-- date Mon Jan 7 16:55:34 2019

[Mon Jan 7 16:55:34 2019] loading reads
40000

Error rate / haplotype collapsing

Hi Jue,

  • wtdbg2 produces very interesting results. I am particularly impressed with the low level of duplication (haplotigs) in the final assemblies.
  • Is there a combination of parameters that could reduce haplotype and repeat collapsing to produce an assembly with both alleles when you have heterozygous regions or structural variants? I am trying to replicate what Canu does, which is separating haplotypes that are 1-2% divergent.

Best,
Guilherme

Minor error in shell script creation

Hi,

Would like to say firstly that I've been trying basically every assembler I can find for my genome, and so far SMARTdenovo has given one of the best assemblies from initial statistics, so I'm very interested in the development of SMARTdenovo and/or this wtdbg assembler!

On topic: When running run_wtdbg_assembly.sh with "-T > sub_script.sh", the output script has a typo which causes an error in the first uncommented line (--rescure-low-cov-edges). Manually editing this line fixes the issue.

I'll let you know how the assembly goes, unsure exactly how long it should take with the resources I have available.

Zac.

-i with multiple files doesn't work properly / fails to throw error

Awesome program, thanks! I came across the following issue, though. The help reads like -i a.fq b.fq c.fq ... would be a valid call for multiple read files. And it doesn't throw an error. But if specified like that, wtdbg2 only reads the first read file, and ignores the other files without warning or exception.

-i a.fq -i b.fq ... works. So it would be helpful to have this documented more clearly, and/or have an error thrown on unused extra arguments.

a.fq and b.fq have on read each. But only one is read

~software/wtdbg2/wtdbg2 -i a.fq b.fq -o foo
--
-- total memory       65922936.0 kB
-- available          42977240.0 kB
-- 16 cores
-- Starting program: /nobackup1/chisholmlab/software/wtdbg2/wtdbg2 -i a.fq -o foo b.fq
-- pid                     10006
-- date         Wed Oct 24 10:54:02 2018
--
[Wed Oct 24 10:54:03 2018] loading reads
1 reads
[Wed Oct 24 10:54:03 2018] Done, 1 reads, 5000 bp, 19 bins

wtdbg2 hangs when creating *.1.dot.gz

Hi,

thanks for this very nice software! It is really fast and consumes (rather) little resources.

When I try the current version in the master-branch (f020cb6), it seems to hang at the output step of the `*.1.dot-gz' file:

Sun Nov  4 11:53:31 CET 2018
--
-- total memory      131916384.0 kB
-- available         115516468.0 kB
-- 28 cores
-- Starting program: apps/software/wtdbg2_git/wtdbg2 -t 28 -i results/binning/kraken2/reads/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae.fq -fo results/assembly/wtdbg2-L_1000/per_bin/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae -L 1000
-- pid                     18416
-- date         Sun Nov  4 11:53:31 2018
--
[Sun Nov  4 11:53:31 2018] loading reads
0 reads
[Sun Nov  4 11:53:31 2018] Done, 0 reads, 0 bp, 0 bins
** PROC_STAT(0) **: real 0.003 sec, user 0.000 sec, sys 0.000 sec, maxrss 972.0 kB, maxvsize 79856.0 kB
[Sun Nov  4 11:53:31 2018] generating nodes, 28 threads
0 reads|total hits 0
** PROC_STAT(0) **: real 0.003 sec, user 0.000 sec, sys 0.000 sec, maxrss 972.0 kB, maxvsize 79856.0 kB
[Sun Nov  4 11:53:31 2018] chainning ...  0 hits into 0
[Sun Nov  4 11:53:31 2018] picking best 500 hits for each read ... 0 hits
[Sun Nov  4 11:53:31 2018] clipping ... -nan% bases
[Sun Nov  4 11:53:31 2018] generated 0 regs
[Sun Nov  4 11:53:31 2018] sorting regs ...  Done
[Sun Nov  4 11:53:31 2018] generating intervals ...  0 intervals
[Sun Nov  4 11:53:31 2018] selecting important intervals from 0 intervals
[Sun Nov  4 11:53:31 2018] Intervals: kept 0, discarded 0
** PROC_STAT(0) **: real 0.003 sec, user 0.000 sec, sys 0.000 sec, maxrss 972.0 kB, maxvsize 79856.0 kB
[Sun Nov  4 11:53:31 2018] Done, 0 nodes
[Sun Nov  4 11:53:31 2018] output "results/assembly/wtdbg2-L_1000/per_bin/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae.1.nodes". Done.
[Sun Nov  4 11:53:31 2018] median node depth = 0
[Sun Nov  4 11:53:31 2018] masked 0 high coverage nodes (>200 or <3)
[Sun Nov  4 11:53:31 2018] masked 0 repeat-like nodes by local subgraph analysis
[Sun Nov  4 11:53:31 2018] generating edges
[Sun Nov  4 11:53:31 2018] Done, 1 edges
[Sun Nov  4 11:53:31 2018] output "results/assembly/wtdbg2-L_1000/per_bin/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae.1.reads". Done.
[Sun Nov  4 11:53:31 2018] output "results/assembly/wtdbg2-L_1000/per_bin/GridION-Zymo_CS_BB_LSK109.Saccharomyces_cerevisiae.1.dot.gz".

Clearly, there is something strange with this input, e.g.,

[Sun Nov 4 11:53:31 2018] loading reads
0 reads

which might explain this unexpected behavior.

Yet, I would expect a program to fail gracefully with some info message rather than just hang ;)
Since the compression appears to have been recently added (after v2.2, which is not yet tagged/released?), I assume this is an easy-to-fix bug/non-robust-feature.

The input is a set of reads from https://github.com/LomanLab/mockcommunity (most likely Release 1 as I downloaded it several weeks ago, and not the currently shown Release 2) binned by using Kraken2.

Hence, this is an attempt to see how per-bin assembly would work instead of a meta-assembly.
For several other bins, this step worked fine.

TIA for looking into this.

Best,

Cedric

Polish command obsolete ( wtpoa-cns missing -d parameter)

Hi,

I wish to polishing the assembly with bam generated from minimap2 bam, but wtpoa-cns is missing now missing -d option. Please advice.

$samtools view prefix.ctg.lay.map.srt.bam | /home/ijt/wtdbg2/wtpoa-cns -t 40 -d wtdbg2op1.ctg.lay.fa -i - -fo prefix.ctg.lay.2nd.fa

/home/ijt/wtdbg2/wtpoa-cns: invalid option -- 'd'
WTPOA-CNS: Consensuser for wtdbg using PO-MSA

more details about the option "--no-read-length-sort"

Hello,

(A) I used Single node to run wtdbg with ath data, the I get the results:
total length: 132015737, Max_length: 1433722, N50_len: 198115

(B) I used Multi node to run the kbm alignment according the way mentioned beford and adding the parameter: --no-read-length-sort, and get the results:
total length: 132770392, Max_length: 889975, N50_len: 178479

(C) I used alignments from (A), which is generated with Single node, and I run the assembly with the parameter: --no-read-length-sort, and get the results:
total length: 132675424, Max_length: 838984, N50_len: 176245

It seems that Single node usually generate better results than Multi node probably because of running the assembly part without "--no-read-length-sort" option.
My question is that: Can I run the alignment with Multi node and then do something with the alignment, and then run the assembly part without "--no-read-length-sort" option?
Or can you give me more details about the option "--no-read-length-sort"?

Thanks

Optimisation of parameters

Hi,

I am wondering if you could give me a few tips for optimising the assembly. I have previously used your SMARTdenovo assembler, and received a quite good assembly. Parameters were default, except I reduced minimum length cut-off to 2500. Stats are:

Genome size: 274,508,993
Estimated genome size [by SMARTdenovo]: 311,266,064
Number of contigs: 1,096
Shortest contig: 8,918
Longest contig: 3,655,283

N50: 621,057
Median: 89,349.5
Mean: 250,464.40967153283

I have tried three different parameter combinations with wtdbg, but haven't been able to get an assembly as contiguous. The default parameters received the following stats:

Genome size: 308,848,834
Number of contigs: 7,249
Shortest contig: 3,148
Longest contig: 758,777

N50: 118,094
Median: 15,921
Mean: 42,605.715822872124

I next tried two variants of a "maximum sensitivity" combination (at least, as I understand it). The first variant included the arguments "-k 0 -p 17 -S 2 --edge-min 2 --rescue-low-cov-edges". The second was the same, but --tidy-reads was set to 2500. The statistics for these two in their respective order is below:

Genome size: 271,989,307
Number of contigs: 4,539
Shortest contig: 2,417
Longest contig: 2,258,713

N50: 458,530
Median: 11,060
Mean: 59,922.737827715355
Genome size: 269,423,173
Number of contigs: 5,036
Shortest contig: 2,038
Longest contig: 1,872,034

N50: 346,207
Median: 11,191.0
Mean: 53,499.43864177919

I was wondering if you had any ideas for how I might be able to improve the program's performance, or if SMARTdenovo might just be better suited for my particular genome?

Thanks,
Zac.

Please tag a release

When you are ready, it would be good to tag a stable release (i.e. create a release in the release page) – this is often a request to my projects as well. You may name it "v2.0" if you feel it is really ready for heavy public uses or "v2.0-rc1" if you are less confident. Up to you. Once you tag a release, I will create a bioconda recipe for wtdbg2. Thanks.

Is it necessary to further run consensus tools on the results of wtdbg or smartdenovo?

Hi Jue,

I'm sorry to bother you once anain.

I found a evaluation paper and it says (paragraph 11 of "Discussion"):

...Wtdbg assemblies, which always ranked last, mostly because no consensus procedure was executed, would need additional rounds of consensus polishing to effectively compete with other assemblers.

So I'm wondering that if it is necessary to further run consensus tools after running wtdbg1.1.006, wtdbg1.2.8 and smartdenovo now, such as Racon? I know all three tools have consensus modules, and have been updated since this paper was published.

I'm working on a de novo genome assemlby project and there are very limited genomic resources to evaluate the correctness. Except PacBio data, I also have several short reads libraries, so I want to perform scaffolding based on results of wtdbg. I don't know how the errors in contigs affect the scaffolding.

Any suggestions or thoughts would be appreciated. Thank you!

Bests,
Yiwei Niu

question about polishing in wtpoa-cns

Hello, I have the following question. Can I use wtpoa-cns also to polish my contigs using paired end Illumina reads?

More specifically, can I use this command

minimap2 -ax sr prefix.ctg.lay.fa read1.fq read2.fq

instead of this

minimap2 -t 16 -x map-pb -a prefix.ctg.lay.fa reads.fa.gz

in the polishing step
?

Thank you very much for your help.

For the "ont" preset, change -L to 5000

On the several ONT datasets at my hand, the N50 is not much longer than 10kb. Using "-L10000" would discard too many reads. BTW, I have changed "sq" preset to "-p0 -k15". You changed the text, but didn't change the actual setting. See 9ab7df6.

Installation problem

^Cmake: *** [kbm] Interrupt

I have such installation problems as follow.I hope you can help me! Thank you.

Error-free sequences file

I see wtdbg-1.2.8 has the parameter -I <string> Error-free sequences file, +. Can paired-end illumina reads in fastq format be passed as an argument (in addition to nanopore sequences passed to -i)?

new parameters for wtdbg2

Hello,
I noticed there are some new parameters in the latest version of wtdbg2:

(A) nanopore/ont: -p 19 -AS 2 -s 0.05 -L 10000
sequel/sq: -p 0 -k 15 -AS 2 -s 0.05 -L 10000

the parameter "A" was set in sequel and ONT reads. As mentioned before, the alignment of contained reads will have few affects on the assembly results, so why we have to keep all these alignments?

(B) -X Choose the best depth for layout (effective with -g) [50]

Does this parameter(-X 50) represent that we choose the longest 50 depth data to do the alignment and then perform the assembly, or do we use all the reads to do the alignment and then choose the best result of 50 depth alignment for assembly?
How is this best defined?

Thanks!

much worse assembly N50 in new version of wtdbg2

Hi,

I have around 30X of an 1.5Gb insect genome. When trying to reassemble using the latest version I have much worse N50 and bigger genome. I was wondering if anyone can comment on the parameters that I should tweak please? Thanks.

Version: 1.1.006
Assembly Size 1.671Gb
N50 3Mb
Largest contig 16.1Mb

Version 2.2
Assembly Size 2.237Gb
N50 132kb
Largest contig 1.8Mb

parameters: -p19 -AS2 -e2 in both cases.

install error

Hi,
I reported the ‘warning: assignment makes pointer from integer without a cast’ error when installing the software. How to solve it?

No contig output for local assembly

Hello,

I am trying to assemble the ONT reads overlapping a specific region of the human genome and unfortunately, in this region, I end up with only 4 reads (3 of which have a length below 5 kbp). After running wtdbg2 and wtpoa-cns, I get no contig.

The end of wtdbg2 log indicates:

searched 1 contigs
Estimated:
output 0 contigs

and wtpoa-cns log indicates:

0 contigs 1 edges

I am using the following parameters for the assembly:

-p 0 -k 15 -AS 1 --edge-min 1 --rescue-low-cov-edges

Do you know what is happening here (too low coverage?) and if I can do something to solve the issue (some parameters I didn't think about)? I also tried to decrease -k and -l without success.

Thank you for your help.

Guillaume

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.