GithubHelp home page GithubHelp logo

cmu-safari / blend Goto Github PK

View Code? Open in Web Editor NEW
38.0 12.0 3.0 13.98 MB

BLEND is a mechanism that can efficiently find fuzzy seed matches between sequences to significantly improve the performance and accuracy while reducing the memory space usage of two important applications: 1) finding overlapping reads and 2) read mapping. Described by Firtina et al. (published in NARGAB https://doi.org/10.1093/nargab/lqad004)

License: Other

Makefile 0.61% C 65.82% Shell 14.64% Python 5.33% Perl 0.20% JavaScript 13.07% Gnuplot 0.26% Dockerfile 0.06%
bioinformatics blend de-novo-assembly genome-analysis genome-assembly minimizers read-mapping strobemers fuzzy-seeds read-overlapping

blend's Introduction

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

BLEND is a mechanism that can generate the same hash value for highly similar seeds to find fuzzy (approximate) seed matches between sequences with a single lookup from their hash values. By replacing the hash functions with BLEND, any seeding technique can integrate BLEND to enable the fuzzy seed matching mechanism.

By efficiently finding fuzzy seed matches with a single lookup, BLEND can significantly improve the performance and accuracy while reducing the memory footprint of two important applications: 1) read overlapping and 2) read mapping. Apart from these two applications, we envision that any application that uses seeds can exploit BLEND. Latest version of BLEND is described in bioRxiv.

We strongly recommend using BLEND for overlapping and mapping long and highly accurate reads (e.g., PacBio HiFi). We demonstrate in our manuscript that BLEND can run significantly faster, generate more accurate results, and use less memory space than minimap2 when using these long and accurate reads.

For proof of work, we integrate the BLEND mechanism into minimap2. We show the benefits of BLEND when used with the minimizer and strobemer seeding techniques. We make the following changes in the original minimap2 implementation:

  • We modify the original minimap2 implementation so that minimap2 can assign the same hash values for highly similar seeds it finds. To this end, we change the sketch.c implementation of minimap2 to 1) generate the hash values of k-mers and 2) decide the minimizer k-mer based on the hash values BLEND generates.
  • We implement a simple version of the strobemer seeds in minimap2 in three steps. First, we find minimizer k-mers using the original hash function that minimap2 uses. Second, we link each n consecutive minimizer k-mer in a strobemer seeds. Third, we use the BLEND mechanism for generating the hash value of the strobemer seed based on the hash values of linked k-mers.
  • We enable the minimap2 implementation to use seeds longer than 256 characters so that it can store longer seeds when using BLEND. The current implementation of minimap2 allocates 8 bits to store seed lengths up to 256 characters. We change this requirement in various places of the implementation (e.g., line 112 in sketch.c and line 239 in index.c) so that BLEND can use 14 bits to store seed lengths up to 16384 characters. We do this because BLEND merges many k-mers into a single seed, which can be much larger than a 256 character-long seed.

Our code that we have used for generating the results in our manuscript is available at Zenodo: DOI

Installation

BLEND can be installed from its source code, Docker, or conda.

  • Download the code from its GitHub repository:
git clone https://github.com/CMU-SAFARI/BLEND.git blend

Compilation process is similar to minimap2's compilation as also explained in more detail here.

  • Compile (Make sure you have a C compiler and GNU make):
cd blend && make

If the compilation is successful, the binary will be in bin/blend.

  • Install BLEND from the bioconda channel
conda install -c bioconda blend-bio

Important Your docker version should be at least 20.10.12. For the older versions, unexpected behaviors may occur.

  • Build and running from the local Dockerfile:
#Build
docker build --rm -f "Dockerfile" -t blend "."

#Note: If your network connection is behind a proxy, you can define the following variables to set the proxy and build
# docker build --build-arg http_proxy="YOUR_HTTP_PROXY" --build-arg https_proxy="YOUR_HTTPS_PROXY" --no-cache --rm -f "Dockerfile" -t blend "."

#Example run
docker run -v $PWD/e.coli-pb-sequelii/:/input -v $PWD/output/:/output blend -x ava-hifi -o /output/output.paf /input/Ecoli.PB.HiFi.100X.fasta /input/Ecoli.PB.HiFi.100X.fasta

#You can also work from the docker image after executing the following (interactive usage):
docker run --rm -it --entrypoint /bin/bash blend
  • Build from DockerHub:
#Build
docker pull firtinac/blend

#Example run
docker run -v $PWD/e.coli-pb-sequelii/:/input -v $PWD/output/:/output firtinac/blend -x ava-hifi -o /output/output.paf /input/Ecoli.PB.HiFi.100X.fasta /input/Ecoli.PB.HiFi.100X.fasta

#You can also work from the docker image after executing the following (interactive usage):
docker run --rm -it --entrypoint /bin/bash firtinac/blend

Usage

You can print the help message to learn how to use blend:

blend -h

Below we show how to use blend for 1) finding overlapping reads and 2) read mapping when using the default preset parameters for each use application and genome.

BLEND provides the preset parameters depending on:

  • The application: 1) Finding overlapping reads and 2) read mapping.
  • Sequencing Technology: 1) Accurate long reads (e.g., PacBio HiFi reads), 2) erroneous long reads (e.g., PacBio CLR reads), and 2) short reads (i.e., Illumina paired-end reads).

Finding Overlapping Reads

Assume that you would like to perform all-vs-all overlapping between all pairs of HiFi reads from a human genome located in file reads.fastq. To find overlapping reads and store them in the PAF file output.paf:

blend -x ava-hifi reads.fastq reads.fastq > output.paf

Read Mapping

Assume that you would like to map PacBio CLR reads in file reads.fastq to a reference genome in file ref.fasta. To generate the read mapping with the CIGAR output in the SAM file output.sam:

blend -ax map-pb ref.fasta reads.fastq > output.sam

Getting Help

Since we integrate the BLEND mechanism into minimap2, most portion of the parameters are the same as explained in the man page of minimap2 or as explained in the public page of minimap2.1, which is subject to change as the new versions of minimap2 role out. We explain the parameters unique to the BLEND implementation below.

The following option (i.e., neighbors) defines the number of k-mers that BLEND uses to generate a seed.

--neighbors INT Combines INT amount of k-mers to generate a seed. [10]

The following option (i.e., fixed-bits) defines the number of bits that BLEND uses when generating the hash values of seeds. By default, it uses 2 bits per character of a k-mer and, thus, 2*k bits for a hash value of a seed. This value can be decreased to increase the collision rate for assigning the same hash values for similar seeds, but also may start assigning the same hash value for slightly dissimilar seeds.

--fixed-bits INT BLEND uses INT number of bits when generating hash values of seeds rather than using 2*k number of bits. Useful when collision rate needs to be decreased than 2*k bits. Setting this option to 0 uses 2*k bits for hash values. [0]

The following option (i.e., --strobemers) tells BLEND that it should link consecutive neighbors many minimizer k-mers to generate a strobemer sequence as seed and use the hash values of these minimizer k-mers to generate a hash value for the strobemer sequence using the SimHash hashing strategy as suggested in the BLEND paper.

----strobemers link minimizers rather than the preceding k-mers of a single minimizer. (Number of minimizers to link is defined by --neighbors.)

The following option (i.e., immediate) tells BLEND that it should link consecutive neighbors many overlapping k-mers to generate a seed sequence and use the hash values of these k-mers to generate a hash value for the seed sequence using the SimHash hashing strategy as suggested in the BLEND paper.

--immediate use the hash values of consecutive k-mers to generate the hash values of seeds (defualt behavior).

BLEND provides the following preset options:

-x map-ont (-k7 -w10 --fixed-bits=32 --neighbors=11)
-x ava-ont (-k15 -w10 --fixed-bits=30 --neighbors=5 -e0 -m100 -r2k)
-x map-pb (-Hk7 -w10 --fixed-bits=32 --neighbors=15)
-x ava-pb (-Hk19 -Xw10 --fixed-bits=38 --neighbors=5 -e0 -m100)
-x map-hifi (--strobemers -k19 -w50 --fixed-bits=38 --neighbors=5 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200)
-x ava-hifi (--strobemers -k25 -Xw200 --fixed-bits=50 --neighbors=7 -e0 -m100)

Reproducing the results in the paper

We explain how to reproduce the results we show in the BLEND paper in the test directory.

Citing BLEND

If you use BLEND in your work, please cite:

@article{firtina_blend_2023,
  title = {{BLEND}: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis},
  volume = {5},
  issn = {2631-9268},
  doi = {10.1093/nargab/lqad004},
  number = {1},
  journal = {NAR Genomics and Bioinformatics},
  author = {Firtina, Can and Park, Jisung and Alser, Mohammed and Kim, Jeremie S and Cali, Damla Senol and Shahroodi, Taha and Ghiasi, Nika Mansouri and Singh, Gagandeep and Kanellopoulos, Konstantinos and Alkan, Can and Mutlu, Onur},
  month = {mar},
  year = {2023},
  pages = {lqad004},
}

blend's People

Contributors

canfirtina avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

blend's Issues

undesired 'mm_map_frag rechain' in sam file

Dear Blend development team,
I was interesting in testing BLEND for short read mapping. The mapping of paired-end Illumina reads against a tomato genome work perfectly but the output sam file contained 252 lines with "mm_map_frag rechain" after the PG line:

@SQ     SN:17-PSC-SL_TK14181.1.0_Chr11  LN:53848686
@SQ     SN:17-PSC-SL_TK14181.1.0_Chr12  LN:68218429
@RG     ID:var1    SM:var1     LB:Solution     PL:illumina     PU:none
@PG     ID:blend        PN:blend        VN:1.0  CL:blend -ax sr -t 4 -R @RG\tID:var1\tSM:var1\tLB:Solution\tPL:illumina\tPU:none slycopersicum.fasta.ind reads_1.fastq.gz reads_2.fastq.gz
mm_map_frag rechain
mm_map_frag rechain
...

These lines seem to be problematic for further processing with samtools:

samtools flagstat tmp.sam
[W::sam_read1_sam] Parse error at line 16
samtools flagstat: error reading from "tmp.sam"

Best regards,

Thomas

Some questions about the article

Could you please answer some questions about the article (https://arxiv.org/pdf/2112.08687.pdf):

  1. For HiFi reads you used Minimap2 with the option --ava-pb that is intended for PacBio CLR reads and not PacBio HiFi reads (Table S1). Why didn't you try Minimap2 with some other parameters? For example you could have increased the window size and the minimizer size. I suppose this will make Minimap2 faster and decrease its RAM consumption, thus reducing the difference between BLEND and Minimap2 on HiFi reads.
  2. Why did you use N50 and not NGA50 (Table 2)? N50 may be inflated due to misassemblies that result in improper sequence junctions.
  3. Why did you measure k-mer completeness and average identity using unpolished assemblies (Table 3)? Miniasm assemblies require polishing, because the accuracy of its contigs is the same as the accuracy of the reads used for the assembly. The higher accuracy of BLEND in Table 3 means that contigs made with BLEND are composed of slightly more accurate reads than contigs made with Minimap2, but the difference in accuracy may disappear after polishing.
  4. Taking into account that you used only one non-HiFi long read dataset and BLEND performed on it worse than Minimap2 (N50 in Table 2), is it correct to say that BLEND is probably fit only for HiFi long reads, and not PacBio CLR or Nanopore reads?

With best wishes,
Mikhail Schelkunov

A test of BLEND on two real datasets of PacBio CLR and Nanopore reads

In the paper https://arxiv.org/abs/2112.08687 BLEND was tested on only one non-HiFi read dataset. That was a simulated read dataset for one of the smallest eukaryotic genomes โ€” the genome of Saccharomyces cerevisiae.

To test how well BLEND performs on real (non-simulated) datasets of genomes which have more typical sizes, I used it to assemble genomes from these two sets of reads:

  1. Caenorhabditis elegans, PacBio CLR reads used in the article https://www.sciencedirect.com/science/article/pii/S2589004220305770 . For polishing I also used Illumina reads from that article. The nematode genome size is approximately 100 Mbp.
  2. Arabidopsis thaliana, Nanopore reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR5530736 . For polishing I also used Illumina reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR2173372 . The size of arabidopsis' genome is approximately 120 Mbp.

I searched for overlaps, then assembled the genomes with Miniasm using default parameters, then polished the assemblies using long reads with Racon, and then polished the assemblies using both long and short reads with HyPo. The assemblies were compared with references using QUAST.

The search for overlaps was performed with Blend 1.0 and, for comparison, with Minimap 2.22, using 22 threads of Intel Xeon X5670.

For the nematode, results are as follows:

Minimap2 BLEND
Time to find overlaps 10m 3h 37m
Maximum RAM consumption 20G 44G
N50 2,056,511 1,915,190
NGA50 589,675 563,498
misassemblies 740 707
Genome fraction 99.692% 99.683%
Total length 109,516,352 108,958,103

So, the assemblies of the nematode genome made with Minimap2 and with BLEND are similar. However, Blend required 20x more time to find overlaps and 2x more RAM.

For arabidopsis Minimap found overlaps in 30 minutes using 29G RAM. I terminated BLEND because it didn't finish in 24 hours. At the moment I terminated it, BLEND was using 300G RAM.

So, it seems that on non-HiFi datasets for genomes not as small as the genome of Saccharomyces cerevisiae BLEND is slower than Minimap2 and uses more RAM. This may be so because BLEND doesn't deal efficiently with repetitive seeds.

Conda package not working in Mac M1

Hello!
I was interested in using the program. I installed the program following a step-by-step process in a Mac M1:

mamba create -n blend-bio
mamba install -c bioconda blend-bio
blend -h 

and the output was:

50665 illegal hardware instruction blend -h

I also tried using a environment with x86 architecture, but it produced the same error.

I hope you can help me.

Thanks!

Questions on running Blend on laptop

Hi Blend team, I am trying to run blend on my laptop to map recently released ONT duplex reads. I cut single fastq.gz into small chunks (each contains 20000 reads) and run below command. But after generating some .tmp file, the process was killed (I believe exceeding max mem ~9GB here).

blend -ax map-ont -t 6 --secondary=no -I 50M -a --split-prefix hg002 hg38.fa ont_small_chunk.fq.gz

I am not quite sure about "-I 50M" just assuming blend will map reads to part of the whole index to save memory. Am I right? Any advice to run blend on platform with restrained resources? Or maybe it should not be run this way. Thanks a lot!

Original fastq is here:
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE--UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_1_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.