GithubHelp home page GithubHelp logo

marbl / winnowmap Goto Github PK

View Code? Open in Web Editor NEW
231.0 15.0 22.0 3.38 MB

Long read / genome alignment software

License: Other

Makefile 1.12% R 0.03% C 93.27% C++ 4.62% Shell 0.42% Perl 0.44% Python 0.10%
pacbio nanopore genome-analysis

winnowmap's Introduction

Winnowmap

Winnowmap is a long-read mapping algorithm optimized for mapping ONT and PacBio reads to repetitive reference sequences. Winnowmap development began on top of minimap2 codebase, and since then we have incorporated the following two ideas to improve mapping accuracy within repeats.

  • Winnowmap implements a novel weighted minimizer sampling algorithm (>=v1.0). This optimization was motivated by the need to avoid masking of frequently occurring k-mers during the seeding stage in an efficient manner, and achieve better mapping accuracy in complex repeats (e.g., long tandem repeats) of the human genome. Using weighted minimizers, Winnowmap down-weights frequently occurring k-mers, thus reducing their chance of getting selected as minimizers. Users can refer to this paper for more details. This idea is helpful to preserve the theoretical guarantee of minimizer sampling technique, i.e., if two sequences share a substring of a specified length, then they must be guaranteed to have a matching minimizer.

  • We noticed that the highest scoring alignment doesn't necessarily correspond to correct placement of reads in repetitive regions of T2T human chromosomes. In the presence of a non-reference allele within a repeat, a read sampled from that region could be mapped to an incorrect repeat copy because the standard pairwise sequence alignment scoring system penalizes true variants. This is also sometimes referred to as allelic bias. To address this bias, we introduced and implemented an idea of using minimal confidently alignable substrings (>=v2.0). These are minimal-length substrings in a read that align end-to-end to a reference with mapping quality score above a user-specified threshold. This approach treats each read mapping as a collection of confident sub-alignments, which is more tolerant of structural variation and more sensitive to paralog-specific variants (PSVs). Our most recent paper desribes this concept and benchmarking results.

Compile

Clone source code from master branch or download the latest release.

  git clone https://github.com/marbl/Winnowmap.git

Winnowmap compilation requires C++ compiler with c++11 and openmp, which are available by default in GCC >= 4.8.

  cd Winnowmap
  make -j8

Expect winnowmap and meryl executables in bin folder.

Usage

For either mapping long reads or computing whole-genome alignments, Winnowmap requires pre-computing high frequency k-mers (e.g., top 0.02% most frequent) in a reference. Winnowmap uses meryl k-mer counting tool for this purpose.

  • Mapping ONT or PacBio-hifi WGS reads
  meryl count k=15 output merylDB ref.fa
  meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

  winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont.fq.gz > output.sam  [OR]
  winnowmap -W repetitive_k15.txt -ax map-pb ref.fa hifi.fq.gz > output.sam
  • Mapping genome assemblies
  meryl count k=19 output merylDB asm1.fa
  meryl print greater-than distinct=0.9998 merylDB > repetitive_k19.txt

  winnowmap -W repetitive_k19.txt -ax asm20 asm1.fa asm2.fa > output.sam

For the genome-to-genome use case, it may be useful to visualize the dot plot. This perl script can be used to generate a dot plot from paf-formatted output. In both usage cases, pre-computing repetitive k-mers using meryl is quite fast, e.g., it typically takes 2-3 minutes for the human genome reference.

Benchmarking

When comparing Winnowmap (v1.0) to minimap2 (v2.17-r954), we observed a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome, and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. By avoiding masking, we show that Winnowmap maintains uniform minimizer density.


Minimizer sampling density using a human X chromosome as the reference, with the centromere positioned between 58 Mbp and 61 Mbp. ‘Standard’ method refers to the classic minimizer sampling algorithm from Roberts et al., without any masking or modification.

Publications

winnowmap's People

Contributors

1pakch avatar apregier avatar chris-rands avatar cjain7 avatar cjw85 avatar cvdelannoy avatar hasindu2008 avatar hyeshik avatar jmarshall avatar kevinxchan avatar lh3 avatar marcus1487 avatar markbicknellont avatar martinghunt avatar mcshane avatar mvdbeek avatar pickettbd avatar piezoid avatar rikuu avatar swvondeylen-ont avatar tseemann avatar xdu-diagnoa avatar zingdle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

winnowmap's Issues

Optimising gaps in whole genome assembly alignments

Hi Winnowmap Team,

I am trying to use winnowmap to do whole-genome alignments. In some cases, winnowmap is resulting in large unaligned regions like below:
image

However, when I align these unaligned sequence using clustal, they seem to be quite similar.
image

Do you have any suggestions on how to tweak winnowmap so that these regions can be aligned as well?

My alignment command is:

winnowmap -W TAIR10_k19.txt -x asm5 -t 10 -ac --eqx -r 10000 TAIR10_Filtered.fasta.gz qry.filtered.fa

Thanks
Manish

Compile issue: error: ‘lrint’ is not a member of ‘std’

Hi
Thank you for this tool.
I have some human WGS ONT data and want to analysis SV events.
I try to compile Winnowmap but I got an error:

utility/src/utility/types.H:162:51: error: ‘lrint’ is not a member of ‘std’

and only winnowmap in bin folder (meryl doesn't find.)

My OS is CentOS 7 and gcc version is 5.4.0

gcc --version
gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

How can I do?

Best,
Gray

how to keep Mm tags in bam

Hi all,

I extracted fastq(samtools fastq -T 1) from unmapped bam to align to the genome with winnowmap. However, it lost extra tags like qs and Mm after mapping. Is there a way to keep them in the results? Or to tag them again.

Thank you.
Jia

Winnowmappy

This project looks very promising. Is it possible/are there plans to provided a python interface (winnowmappy) to this software?

For some more detail, I am the developer of the Oxford Nanopore Technologies Megalodon software. I would be interested in seeing how this mapping could improve results in repeat regions. Within Megalodon, I currently use the mappy python interface to minimap2. I am wondering if it might be possible to port the mappy interface into this project? I'm not sure how diverged this project is from minimap2 to make this possible, but this would be very helpful to assist future research.

How to map with pre-built indexes ?

Is it the same as with minimap? Becuase this leads to segfaults in my case :
./winnowmap -W {repetitive_file} -w 10 -k 15 -t 50 -x map-pb-clr -H -d {index_file} {reference_file} -> for indexing
./winnowmap -W {repetitve_file} -w 10 -k 15 -t 50 -x map-pb-clr -H -a {index_file} {read_file} > align.sam -> for mapping

Winnowmap takes much long than Minimap2 - is this expected?

Hi,

I am using winnowmap commit d547331 and it seems like it is taking a very long time (my guess will be about 24 hours) and a lot of RAM (peak 265 GB) for mapping 1000000 simulated PacBio HiFi reads compared to minimap2-2.17 which took about 7 minutes and maybe ~9 GB RAM.

/genetics/elbers/Winnowmap/computeHighFreqKmers 19 1 1024 peregrine.fasta bad_Hk19_mers_peregrine.fasta.txt
minimap2 -t 75 -H -W bad_Hk19_mers_peregrine.fasta.txt -ax asm20 peregrine.fasta random_reads_ccs.fastq.gz > random_reads_ccs.fastq-against-peregrine.fasta.sam
/genetics/elbers/Winnowmap/winnowmap -t 75 -H -W bad_Hk19_mers_peregrine.fasta.txt -ax asm20 peregrine.fasta random_reads_ccs.fastq.gz > random_reads_ccs.fastq-against-peregrine.fasta.sam

The assembly peregrine.fasta is about 450 Mbp, and I was trying to compare the results of one round of Racon polishing with alignments made by minimap2 and winnowmap, but winnowmap seems to be taking quite a long time.

# minimap2 log

[M::mm_idx_gen::15.476*1.84] collected minimizers
[M::mm_idx_gen::17.823*2.88] sorted minimizers
[M::main::17.824*2.88] loaded/built the index for 1100 target sequence(s)
[M::mm_mapopt_update::19.182*2.75] mid_occ = 175
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 1100
[M::mm_idx_stat::20.765*2.62] distinct minimizers: 26605237 (71.77% are singletons); average occurrences: 1.995; average spacing: 8.319
[M::worker_pipeline::51.114*15.06] mapped 41606 sequences
[M::worker_pipeline::76.782*19.10] mapped 41608 sequences
[M::worker_pipeline::98.991*22.49] mapped 41605 sequences
[M::worker_pipeline::118.571*26.10] mapped 41594 sequences
[M::worker_pipeline::132.990*28.63] mapped 41595 sequences
[M::worker_pipeline::149.909*30.83] mapped 41612 sequences
[M::worker_pipeline::165.980*33.80] mapped 41612 sequences
[M::worker_pipeline::178.448*35.00] mapped 41601 sequences
[M::worker_pipeline::194.673*35.87] mapped 41605 sequences
[M::worker_pipeline::212.706*37.92] mapped 41615 sequences
[M::worker_pipeline::223.435*38.45] mapped 41612 sequences
[M::worker_pipeline::242.166*40.52] mapped 41603 sequences
[M::worker_pipeline::251.134*40.52] mapped 41601 sequences
[M::worker_pipeline::270.681*42.06] mapped 41598 sequences
[M::worker_pipeline::279.854*42.06] mapped 41611 sequences
[M::worker_pipeline::298.250*43.10] mapped 41604 sequences
[M::worker_pipeline::309.071*43.28] mapped 41589 sequences
[M::worker_pipeline::327.166*44.73] mapped 41618 sequences
[M::worker_pipeline::337.098*44.50] mapped 41618 sequences
[M::worker_pipeline::354.763*45.61] mapped 41602 sequences
[M::worker_pipeline::364.568*45.55] mapped 41601 sequences
[M::worker_pipeline::384.337*46.38] mapped 41624 sequences
[M::worker_pipeline::393.632*46.24] mapped 41599 sequences
[M::worker_pipeline::410.199*46.36] mapped 41604 sequences
[M::worker_pipeline::410.352*46.34] mapped 1463 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -t 75 -H -ax asm20 peregrine.fasta random_reads_ccs.fastq.gz
[M::main] Real time: 410.768 sec; CPU: 19018.148 sec; Peak RSS: 8.641 GB

real    7m0.411s
user    308m36.932s
sys     8m22.508s

# winnowmap log so far

[M::mm_idx_gen::0.000*35196.57] reading downweighted kmers
[M::mm_idx_gen::0.011*460.17] collected downweighted kmers, no. of kmers read=2052
[M::mm_idx_gen::0.011*459.31] saved the kmers in a bloom filter: hash functions=13 and size=39344
[M::mm_idx_gen::38.766*1.34] collected minimizers
[M::mm_idx_gen::41.233*1.80] sorted minimizers
[M::main::41.243*1.80] loaded/built the index for 1100 target sequence(s)
[M::mm_mapopt_update::41.243*1.80] mid_occ = 2147483647
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 1100
[M::mm_idx_stat::42.744*1.78] distinct minimizers: 26613163 (71.74% are singletons); average occurrences: 1.956; average spacing: 8.482
[M::worker_pipeline::2989.633*39.61] mapped 41606 sequences
...still going after 1.25 hours (I predict it might take at least 24 hours)

Is the much longer run time expected (obviously, run times would depend on the k-mers, assembly, and reads being used)?

uniq kmer anchoring mapping

In T2T chrX centromere article,they mapiing reads by using uniq kmer anchoring method.Do I still need to corrected centromere reads location by uniq kmer method if I use Winowmap2 ?

Is --split-prefix supported?

Hi, I see this option (--split-prefix) remains in the source code but has been removed from the help. --split-prefix is important to retain correct mappings on large indices that get broken into parts: lh3/minimap2#141. Is it supported and functional?

Edit: it appears it works when mapping to a FASTA. However, I also got a segfault when trying to dump the index -d and then map to the dumped index...is this implemented?

Thanks!

failed to parse the FASTA/FASTQ record

Dear @cjain7 ,
When I mapped the ONT reads to GRCh38 without alt sequence using winnowmap with version 2.0.3, it occured the warning message just like "failed to parse the FASTA/FASTQ record next to '783361e5-baa7-44de-9c84-4d1399aa5ceb". So what does this warning mean? Does it affect the final sam file. My command line shows as below:

winnowmap -W repetitive_k15.txt --cs --MD -L -t 32 -ax map-ont ${ref} ${fq} > out.sam

Thanks in advance
Jiao

winnowmap PacBio CCS alignment

To whom it may concern,

I am writing to inquire whether the authors of the program had compared the results from winnowmap -ax map-pb and -ax asm20 bam file.

Minimap2 recommends using asm20 for mapping CCS reads and when I compared germline mutation results from asm5, asm10 and asm20 bam files, asm20 parameter performed the best for germline mutation calling. I was also wondering if there was reason for recommending the alternative parameter (map-pb) instead of the standard parameter (asm20)

winnowmap recommendation

winnowmap -W repetitive_k15.txt -ax map-pb ref.fa hifi.fq.gz > output.sam

minimap2 recommendation

./minimap2 -ax asm20 ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio CCS genomic reads

Many thanks,

Regards,
Sangjin

Build of meryl fails (sys/sysctl.h missing in gcc 7.5)

Hi,

I tried building winnowmap on a linux 5.10 with gcc 7.5.0.
The winnowmap executable is build, but meryl fails. It is missing 'sys/sysctl.h'
utility/src/utility/system.C:37:10: fatal error: sys/sysctl.h: No such file or directory #include <sys/sysctl.h> ^~~~~~~~~~~~~~ compilation terminated.

I could successfully build the master of https://github.com/marbl/meryl giving me a functional meryl binary.
I wonder if this will work, are there any version dependencies between winnowmap and meryl?

Thank you for your help,
Pay

required read length for winnowmap

Hi,

I am using a normal human genome sequenced by Nanopore, to run winnowmap2.
Sequencing depth is 20x, read length N50 is 14 Kb.
I am wondering the recommended minimum read length N50, to obtain reasonable depth after mapping with winnowmap2, like the chromosome X figure in github page.
With my current data, there is no significant difference on depth and mapping rate, between minimap2 and winnowmap2.
I'm not sure if this is because my read length N50 is too short.

Or, should I perform some filtering based on read length ? e.g. only use read > 10 Kb.

Thanks

Finding X scaffolds in a genome assembly

Hi,
I would appreciate if you can help me. I am interested in demographic inference of bats. I assembled a reference genome, and I have downloaded mouse and human X and Y sequences; the human X chromosome is by far larger than any of my scaffolds because I used a reference that lacks centromeres and telomeres. The mus X and Y are not complete, just the X is as large as my largest scaffold. I am a beginner and I would like to know what should I map with meryl and what should be my query.
How should I retrieve the mapped reads - with samtools?

Thank you very much;

Segmentation fault

Hello,

I've been trying to get winnowmap to align some HiFi reads to the axolotl genome (which is rather giant at 29G) but I keep getting some segmentation fault. Any help would be greatly appreciated!

The command is:

winnowmap -ay -t 16 -L genome_index.mmi Revio_FC011.part_005.fq.gz | samtools view -F 4 -bSu -o Revio_FC011.bam

The output:

[...]
@SQ	SN:ptg000910l	LN:23331
@SQ	SN:ptg000911l	LN:21101
@PG	ID:Winnowmap	PN:Winnowmap	VN:2.03	CL:winnowmap -ay -t 16 -L genome_index.mmi Revio_FC011.part_005.fq.gz
[M::main::36.787*0.90] loaded/built the index for 911 target sequence(s)
[M::main::36.787*0.90] running winnowmap in SV-aware mode
[M::main::36.787*0.90] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::36.787*0.90] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 911
[M::mm_idx_stat::38.317*0.90] distinct minimizers: 54141373 (21.73% are singletons); average occurrences: 21.624; average spacing: 25.188
Segmentation fault

This is the index generation command and output:

meryl count k=15 output merylDB Amex8.0.asm.bp.p_ctg.gfa.fa
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

winnowmap -W repetitive_k15.txt -x map-pb -t 3 -I 100G -d genome_index.mmi Amex8.0.asm.bp.p_ctg.gfa.fa
[M::mm_idx_gen::0.028*0.56] reading downweighted kmers
[M::mm_idx_gen::0.056*0.72] collected downweighted kmers, no. of kmers read=105766
[M::mm_idx_gen::0.056*0.72] saved the kmers in a bloom filter: hash functions=2 and size=1520672 
[M::mm_idx_gen::1127.643*1.20] collected minimizers
[M::mm_idx_gen::1169.157*1.26] sorted minimizers
[M::main::1187.256*1.25] loaded/built the index for 911 target sequence(s)
[M::main::1187.256*1.25] running winnowmap in SV-aware mode
[M::main::1187.256*1.25] stage1-specific parameters minP:1000, incP:1.99, maxP:8000, sample:1000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::1187.256*1.25] stage2-specific parameters s2_bw:1000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 911
[M::mm_idx_stat::1188.086*1.25] distinct minimizers: 54141373 (21.73% are singletons); average occurrences: 21.624; average spacing: 25.188
[M::main] Version: 2.03, pthreads=3, omp_threads=3
[M::main] CMD: winnowmap -W repetitive_k15.txt -x map-pb -t 3 -I 100G -d genome_index.mmi Amex8.0.asm.bp.p_ctg.gfa.fa
[M::main] Real time: 1189.460 sec; CPU: 1490.543 sec; Peak RSS: 35.782 GB

Not compiling on MacOS ARM

Hello Chirag,

I noticed there are sse2neon folder there in the src but I still have the error on macOS ARM:

c++ -c -msse2 -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable -fno-tree-vectorize ksw2_ll_sse.c -o ksw2_ll_sse.o
c++ -c -msse4.1 -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable -fno-tree-vectorize -DKSW_CPU_DISPATCH ksw2_extz2_sse.c -o ksw2_extz2_sse41.o
c++ -c -msse4.1 -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable -fno-tree-vectorize -DKSW_CPU_DISPATCH ksw2_extd2_sse.c -o ksw2_extd2_sse41.o
c++ -c -msse4.1 -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable -fno-tree-vectorize -DKSW_CPU_DISPATCH ksw2_exts2_sse.c -o ksw2_exts2_sse41.o
c++: error: unrecognized command-line option '-msse2'
make[1]: *** [ksw2_ll_sse.o] Error 1
make[1]: *** Waiting for unfinished jobs....
c++: error: unrecognized command-line option '-msse4.1'
c++: error: unrecognized command-line option '-msse4.1'
make[1]: *** [ksw2_extd2_sse41.o] Error 1
make[1]: *** [ksw2_exts2_sse41.o] Error 1
c++: error: unrecognized command-line option '-msse4.1'
make[1]: *** [ksw2_extz2_sse41.o] Error 1
make: *** [winnowmap] Error 2

Any idea why?

Thanks,

Jianshu

How to change the mapQ and τ?

Dear @cjain7,
I would like to seek you help please! I want to change the mapQ and τ value to improve alignment accuracy, but I do not find how to change the parameters? Could you tell me how to modify the parameters,please? Thank you very much!
Best wishes,
ZhennanWang

Hard coded OMP threads?

Hi,
I need to run winnowmap for several samples in parallel. For this I limited the number of threads to 10 via the -t parameter to avoid interference between the jobs. However, each job then used up to 30 threads. So I went through the log files and realized that -t obviously only affects pthreads, but omp_threads are additionally set to 3:

[M::main] Version: 2.03, pthreads=10, omp_threads=3

I tried to limit this by explicitely exporting OMP_NUM_THREADS=1, but it did not have an effect. So I searched this repository for omp and found a hard coded part in /src/minimap.h:

image

Is there any way to get around this hard coding? If not, is it possible to mention in the README that winnowmap uses three times more threads than expected?

Thanks,
Johannes

Using Meryl on MacOS

Hi there,

I am a newbie to bioinformatics. I am looking for a software that would allow be to better map the RNA reads into the repeat region of the viral genome, compare to minimap2.

However, I am not sure if I am able to run winnowmap on MacOS. I tried to use meryl but it shows the following error: exec format error: meryl

I wonder do I need alternative method in order to use winnow map?

Thank you very much for helping I am looking forward to that!

Mapping setting advice

Hi,

I am trying to use Winnowmap to map long reads with an error rate around 1.5-3%. I tried both the map-ont and map-pb presets and the difference between the two can be quite big. Any recommendation on a preset or some specific settings?

Thank you.
Guillaume

Mapping using PacBio CCS BAM file

Hi, Is there any way to do this? This would be simpler than converting PacBio BAM files to fastq, and then somehow integrating the info from the original PacBio BAM file with the mapped BAM file.

Any other advice on best practices for aligning PacBio CCS data?
Thanks.

Mutilple references

Hi Team,

Thank you for this great tool. I was just wondering whether Winnowmap will allow multiple references while mapping to the samples.

Detect deletion between gene and pseudogenes

We are trying to detect a deletion using Nanopore long read data. We think that the size of the deletion is something around 10-15 kb in size, although we are not sure. We have used both minimap and winnowmap for the alignment and sniffles for the variant calling. We have not been able to detect the deletion.

We believe that this difficulty might be attributable to the fact that the deletion is near a gene and its highly homologous pseudogene. We have only used standard parameters thus far. Are there parameters that would make sense to alter for this case? Is there something completely different in terms of our alignment strategy that we should be doing here?

Thanks for your help,

Mapping simulated T2T reads

TL;DR: on simulated reads, winnowmap seems to produce more mapping errors than minimap2.

I simulated reads with pbsim2:

src/pbsim --depth 1 --hmm_model data/R94.model --length-mean 20000 --accuracy-mean 0.95 --length-min 5000 CHM13v1.fa
paftools.js pbsim2fq CHM13v1.fa.fai *.maf | pigz -p8 > pbsim-CHM13-R94.fa.gz

This supposedly simulates reads similar to nanopore of the current generation. The second command line converts the pbsim2 output to the FASTA format needed by paftools.js mapeval. I then mapped these reads to CHM13v1 with the following mappers/settings:

winnowmap -W CHM13-rep-k15.txt -t4 -cxmap-ont CHM13v1.fa pbsim-CHM13-R94.fa.gz > pbsim-wm-ont.paf
winnowmap -W CHM13-rep-k15.txt -t4 -cxmap-pb  CHM13v1.fa pbsim-CHM13-R94.fa.gz > pbsim-wm-pb.paf
minimap2 -t4 -cxmap-ont CHM13v1.fa pbsim-CHM13-R94.fa.gz > pbsim-mm-ont.paf

Here are the mapeval output for pbsim-wm-ont.paf:

Q       60      149867  32      0.000213523     149867
...
Q       5       49      4       0.000536197     151064
...
Q       1       838     55      0.000947051     152051
Q       0       203     98      0.001589449     152254

for pbsim-wm-pb.paf:

Q       60      149929  89      0.000593614     149929
...
Q       5       32      4       0.001474439     151244
...
Q       1       735     93      0.002215968     152078
Q       0       176     83      0.002758548     152254

and for pbsim-mm-ont.paf:

Q       60      149294  0       0.000000000     149294
Q       4       472     1       0.000013274     150675
Q       2       266     1       0.000019875     150941
Q       1       1165    36      0.000256400     152106
Q       0       216     107     0.000958496     152322

Notably, minimap2 maps slightly more reads and has higher accuracy at the Q60, Q5, Q1 and Q0 mapping quality thresholds.

I guess the large minimizer window size used by winnowmap is affecting its accuracy.

PacBio CLR reads

Hi there,

Can Winnowmap be used with PacBio CLR reads rather than HiFi reads?

Best,
Ollie

List of files / multiple files as input

Dear maintainers,

is it possible to add a possibility to specify a list of input files instead of a single file? I work with the axolotl genome and have quite a few long reads. Therefore, I have two possibilities

    1. either I zcat the input files into a single huge fastq file, which is a bit wasteful given the amount of data OR
    1. I zcat the input files and pipe the data to winnowmap.

However, since the genome is to huge, minimap2 has to split the index. Therefore, if I pipe the data, winnowmap ends up mapping the reads only to the first 5 scaffolds, which are included in the first index chunk. Other scaffolds are processed as well afterwards, but there are no more data in the pipe.
It would be nice to be able to specify multiple input files, which all can be read multiple times if necessary.

I also tried creating the index first by setting -d scaffolds.mmi, and then running winnowmap, but in this case I get a segmentation fault.

thanks!

No -I option

When running winnowmap, the -I option is not recognized.
e.g. after generating the repetitive_k15.txt with meryl:

winnowmap -W repetitive_k15.txt -a -x map-pb -Y -L --eqx --cs -I 32G ref.fa.gz reads.fastq.gz | samtools view -hb | samtools sort -@8 > alignment_sorted.bam

Yields the following error:

[ERROR] unknown option in "-I"

The -I option is needed for a multi-part index.
Thanks.

Much more unmapped reads compared to minimap2

Hi,
A fq file is mapped with minimap2 and Winnowmap simultaneously for removing host reads, but Winnowmap is nearly 5 times more reads left than Winnowmap, here is the command:

minimap2 -ax map-ont -t 32 ref.fa seq.fq | samtools fastq -n -f 4 - > host_clean_minimap.fastq # 22,998 reads left
winnowmap -W repetitive_k15.txt -ax map-ont ref.fa seq.fq | samtools fastq -n -f 4 - > host_clean_winnow.fast # 103,014 reads left

Could winnowmap do better if changing some parameters?

Thanks.

Segmentation fault when mapping from index

Hello,

I am trying to map simulated reads to an indexed reference.
To do this I take the following steps:

# Kmer counting
meryl count k=15 output merylDB reference.fasta
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

# Index creation
winnowmap -W repetitive_k15.txt -k 15 -d reference.index reference.fasta

# mapping
winnowmap -W repetititve_k15.txt -c reference.index reads.fasta > mapping.paf

The mapping starts and then is killed by a segmentation fault, here is the log:

[M::main::0.119*0.75] loaded/built the index for 1 target sequence(s)
[M::main::0.119*0.75] running winnowmap in SV-aware mode
[M::main::0.119*0.75] stage1-specific parameters minP:1000, incP:4.00, maxP:16000, sample:1000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::0.119*0.75] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.138*0.78] distinct minimizers: 1212310 (87.49% are singletons); average occurrences: 1.474; average spacing: 25.230
zsh: segmentation fault  winnowmap -W repetitive_k15.txt -c reference.index reads.fasta > mapping.paf

I have tried several configurations for the mapping stage:

  • with or withour the -W flag
  • on 1 thread or with -t 16

And I have tried on 2 different references, simulating 1000 reads with nanosim each time.

  • The T2T whole human genome assembly: CHM13 T2T v1.1
  • Only chromosome 21 of this assembly.

In all the cases I get a segmentation fault when mapping to the index.

I tried the same thing with minimap2 using the following commands and everything works as expected:

minimap2 -k 15 -d reference.minimap.index reference.fasta
minimap2 -c reference.minimap.index reads.fasta > mapping.minimap.paf

Software versions:

  • winnowmap: 2.02
  • meryl: meryl snapshot (v1.0)

Am I doing something wrong or is this a software issue ?
Thanks in advance

kmer size for meryl

I noticed that the suggested kmer size for merly (k=15) in the README matches the map-ont preset (at least according to the minimap2 man page); however, it does not match the map-pb preset (k=19). Is this intended? I just expected that the kmer sizes would need to be matched to the preset.

Thanks!
Mitchell

no SQ lines present in the header

To whom it may concern

In the process of using winnowmap, I encountered the following problems. Can you give me some help? thanks!

winnowmap -W repetitive_k9.txt -x map-pb -a -Y -L --eqx --MD --cs -t 60 ../hifi.fasta genome.fasta |samtools view -@ 20 -hb > read_alignment.bam

[M::mm_idx_gen::0.000*72.64] reading downweighted kmers
[M::mm_idx_gen::0.001*5.14] collected downweighted kmers, no. of kmers read=2270
[M::mm_idx_gen::0.001*5.12] saved the kmers in a bloom filter: hash functions=2 and size=32640
[M::mm_idx_gen::188.709*1.04] collected minimizers
[M::mm_idx_gen::202.242*1.07] sorted minimizers
**[WARNING] For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.**
[M::main::202.242*1.07] loaded/built the index for 234214 target sequence(s)
[M::main::202.242*1.07] running winnowmap in SV-aware mode
[M::main::202.242*1.07] stage1-specific parameters minP:1000, incP:1.99, maxP:8000, sample:1000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::202.242*1.07] stage2-specific parameters s2_bw:1000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 234214
[M::mm_idx_stat::202.337*1.07] distinct minimizers: 7146105 (31.16% are singletons); average occurrences: 22.090; average spacing: 25.344
[E::sam_parse1] **no SQ lines present in the header**
samtools view: error reading file "-"
[E::sam_parse1] no SQ lines present in the header
samtools view: error closing "-": -5

however,when I run with
winnowmap -W repetitive_k9.txt -x map-pb --split-prefix -a -Y -L --eqx --MD --cs -t 60 ../hifi.fasta genome.fasta |samtools view -@ 20 -hb > read_alignment.bam
[ERROR] --cs or --MD doesn't work with --split-prefix
then I run whith out --cs or --MD
winnowmap -W repetitive_k9.txt -x map-pb --split-prefix -a -Y -L --eqx -t 60 ../hifi.fasta genome.fasta |samtools view -@ 20 -hb > read_alignment.bam
[main_samview] fail to read the header from "-".

Try to understand the super long soft/hard clipping in the raw reads mapping

Hi! Thanks for this great tool.
I mapped the Nanopore raw reads to CHM13 and found many primary alignment reads have very long soft/hard clipping bases in the outputs. To my understanding, clipped bases are those can't be mapped to the reference. It would be acceptable if the clipped bases are within 100 bases but it really perplexes to have so much difference between the reads and the reference.
grep 'tp:A:P' chm13raw.sam | awk '{print $6}' | grep --color '[HS]'
Screen Shot 2022-08-14 at 2 06 08 PM

aligning HiFi reads to single pass long reads

Hello,

I would like to align PacBio HiFi reads to ONT reads with the goal of making a consensus of the alignment against the long read backbone (to correct the errors).
First, I wonder whether Winnowmap is suitable for that, or if I should use minimap2 instead.

Then, in this case the reference is of lower quality (median Q score 11) and the query has much higher accuracy (Q 31): will this affect the choice of the alignment parameters? Can I use -x map-ont or asm20 as presets?

Lastly, I have more than 200 Gb of raw ONT reads that I would like to error correct with ~60 Gb HiFi data: I thought of splitting the "reference" to have small jobs and shorter time to compute the index. Is this a good approach or will splitting affect the representativeness of the minimizers?

Thanks,
Dario

Seeing multiple primary alignments per read

I've just use winnowmap v2.03 (installed using conda from the bioconda channel) to align reads to a haplotype resolved human assembly.

I used winnowmap --sv-off option.

As perhaps expected, each read has alignments on two contigs, the two contigs being maternal/paternal homologous pairs.

What surprised me is that both of these alignments are flagged as primary alignments in the resulting bam file. Filtering the bam file with samtools view -F 2308 does not reduce the number of alignments per read to one.

If the first half of the read was aligning to one haplotype and the second half of the read to another, I'd expect one to be primary and one supplementary.

However I'm seeing that the full length of each read is aligned in two places, so I'd expect one alignment per read to be secondary.

I can post-process the bam to update the flags such that the alignment with the higher mapping score is primary and lower is secondary, but as this might catch lots of users out, it would be great if winnowmap did this.

paftools showing no accurate mappings after Winnowmap?

Hello,

I'm running pbsim with the following command to replicate Winnowmap results for D1 in the paper:

pbsim --data-type CLR --model_qc model_qc_clr --length-mean 15000 --accuracy-mean 0.9 chrX.fasta

Then running winnowmap with:

./winnowmap -W bad_Hk19_mers.txt -cx map-pb ~/chrX.fasta ~/simPB.fq > wOutput.paf

Then, running paftools mapeval wOutput.paf returns

Q 60 200926 200926 1.000000000 200926
Q 59 56 56 1.000000000 200982

for all quality thresholds.

Is this expected? Can you provide more comprehensive instructions to replicate the results in the paper?

Thanks!

not compiling on Mac OS

I have latest GCC installed on the Mac. Any thoughts to resolution?

ari:Winnowmap maurice$ make -j8
/Library/Developer/CommandLineTools/usr/bin/make -e -C src
c++ -c -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable main.c -o main.o
c++ -c -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable kthread.c -o kthread.o
c++ -c -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable kalloc.c -o kalloc.o
c++ -c -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable misc.c -o misc.o
c++ -c -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable bseq.c -o bseq.o
c++ -c -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable sketch.c -o sketch.o
c++ -c -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable sdust.c -o sdust.o
c++ -c -g -Wall -O2 -DHAVE_KALLOC -fopenmp -std=c++11 -Wno-sign-compare -Wno-write-strings -Wno-unused-but-set-variable options.c -o options.o
clangclangclang: : : clang: clang: clang: clang: clang: warningwarning: warning: : warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]

treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
warning: warningtreating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]:
treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
clangclang: errorclangclang: : : errorunsupported option '-fopenmp'error: :
unsupported option '-fopenmp'unsupported option '-fopenmp'

: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
make[1]: *** [kthread.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: *** [sketch.o] Error 1
make[1]: *** [kalloc.o] Error 1
make[1]: *** [sdust.o] Error 1
make[1]: *** [misc.o] Error 1
make[1]: *** [options.o] Error 1
make[1]: *** [bseq.o] Error 1
make[1]: *** [main.o] Error 1
make: *** [winnowmap] Error 2

Winnowmap for read overlapping

Is it possible to use Winnowmap to find overlaps between long reads, equivalent to "ava-ont" and "ava-pb" options of Minimap2? If not, do you plan to add this feature in the future?

--secondary=no

I see your recommed code havent include to this parameter about "--secondary=no".can it better handle repeat region if I add to this parameter ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.