GithubHelp home page GithubHelp logo

ropebwt3's Introduction

Getting Started

# Compile
git clone https://github.com/lh3/ropebwt3
cd ropebwt3
make  # use "make omp=0" if your compiler doesn't suport OpenMP

# Toy examples
echo -e 'AGG\nAGC' | ./ropebwt3 build -LR -
echo TGAACTCTACACAACATATTTTGTCACCAAG | ./ropebwt3 build -Ldo idx.fmd -
echo ACTCTACACAAgATATTTTGTCA | ./ropebwt3 mem -Ll10 idx.fmd -

# Download the prebuilt FM-index for 152 M. tuberculosis genomes
wget -O- https://zenodo.org/records/12803206/files/mtb152.tar.gz?download=1 | tar -zxf -

# Count super-maximal exact matches (no contig positions)
echo ACCTACAACACCGGTGGCTACAACGTGG  | ./ropebwt3 mem -L mtb152.fmd -
# Local alignment
echo ACCTACAACACCGGTaGGCTACAACGTGG | ./ropebwt3 sw -Lm20 mtb152.fmd -
# Retrieve R15311, the 46th genome in the collection. 90=(46-1)*2
./ropebwt3 get mtb152.fmd 90 > R15311.fa

Table of Contents

Introduction

Ropebwt3 constructs the FM-index of a large DNA sequence set and searches for matches against the FM-index. It is optimized for highly redundant sequence sets such as a pangenome or sequence reads at high coverage. Ropebwt3 can losslessly compress 7.3Tb of common bacterial genomes into a 30GB run-length encoded BWT file and report supermaximal exact matches (SMEMs) or local alignments with mismatches and gaps.

Prebuilt ropebwt3 indices can be downloaded from Zenodo.

Usage

A full ropebwt3 index consists of three files:

  • <base>.fmd: run-length encoded BWT that supports the rank operation. It is generated by the build command. By default, the $i$-th sequence in the input is the $2i$-th sequence in the BWT and its reverse complement is the $(2i+1)$-th sequence. Some commands assume such ordering.

  • <base>.fmd.ssa: sampled suffix array, generated by the ssa command. For now, it is only needed for reporting coordinates in the PAF output of the sw command.

  • <base>.fmd.len.gz: list of sequence names and lengths. It is generated with third-party tools/scripts, for example, with seqtk comp input.fa | cut -f1,2 | gzip. This file is needed for reporting sequence names and lengths in the PAF output.

Counting maximal exact matches

A maximal exact match (MEM) is an exact alignment between the index and a query that cannot be extended in either direction. A super MEM (SMEM) is a MEM that is not contained in any other MEM on the query sequence. You can find the SMEMs with

ropebwt3 mem -t4 -l31 bwt.fmd query.fa > matches.bed

In the output, the first three columns give the query sequence name, start and end of a match and the fourth column gives the number of hits. Option -l specifies the minimum MEM length. A larger value helps performance.

Local alignment

Ropebwt3 implements a revised BWA-SW algorithm to align query sequences against an FM-index:

ropebwt3 sw -t4 -N25 -k11 bwt.fmd query.fa > aln.paf

Option -N effectively sets the bandwidth during alignment. A larger value improves alignment accuracy at the cost of performance. Option -k initiates alignments with an exact match.

Given a complete ropebwt3 index with sampled suffix array and sequence names, the sw command outputs standard PAF but it only outputs one hit per query even if there are multiple equally best hits. The number of hits in BWT is written to the rh tag.

Local alignment is tens of times slower than finding SMEMs. It is not designed for aligning high-throughput sequence reads.

Constructing a BWT

Ropebwt3 implements three algorithms for BWT construction. For the best performance, you need to choose an algorithm based on the input date types.

  1. If you are not sure, use the general command line

    ropebwt3 build -t24 -bo bwt.fmr file1.fa file2.fa filen.fa

    You can also append another file to an existing index

    ropebwt3 build -t24 -bo bwt-new.fmr -i bwt-old.fmr filex.fa
  2. For a set of large genomes (e.g. a human pangenome), you may generate the BWT of each individual genome on a cluster and merge them togather. This parallelizes sub-BWT construction and speeds up the overall process.

    ropebwt3 build -t8 -bo genome1.fmr genome1.fa.gz
    ropebwt3 build -t8 -bo genome2.fmr genome2.fa.gz
    ropebwt3 build -t8 -bo genomen.fmr genomen.fa.gz
    ropebwt3 merge -t24 -bo bwt.fmr genome1.fmr genome2.fmr genomen.fmr
  3. For a set of small genomes, it is better to concatenate them together:

    cat file1.fa file2.fa filen.fa | ropebwt3 build -t24 -m2g -bo bwt.fmr -
  4. For short reads, use the ropebwt2 algorithm and enable the RCLO sorting:

    ropebwt3 build -r -bo bwt.fmr reads.fq.gz
  5. Use grlBWT, which is faster than ropebwt3 for large pangenomes:

    ropebwt3 fa2line genome1.fa genome2.fa genomen.fa > all.txt
    grlbwt-cli all.txt -t 32 -T . -o bwt.grl
    grl2plain bwt.rl_bwt bwt.txt
    ropebwt3 plain2fmd -o bwt.fmd bwt.txt

These command lines construct a BWT for both strands of the input sequences. You can skip the reverse strand by adding option -R.

When you use the build command for one genome, the peak memory by default is $B+11\cdot\min\{2S,7{\rm g}\}$ where $S$ is the input file size and $B$ is the final BWT size in run-length encoding. If you have more than 65536 sequences in a batch, factor 11 will be increased to 17 due to the use of a different algorithm. You can reduce the peak memory by reducing the batch size via option -m.

The peak memory for the merge command is $B+\max\{B_1,\ldots,B_n\}+8\max\{L_1,\ldots,L_n\}$, where $B$ is the final BWT size in run-length encoding, $B_i$ is the size of the $i$-th input BWT to be merged and $L_i$ is the number of symbols in the $i$-th BWT.

If you provide multiple files on a build command line, ropebwt3 internally will run build on each input file and then incrementally merge each individual BWT to the final BWT. The peak memory will be the higher one between the build step and the merge step.

Binary BWT file formats

Ropebwt3 uses two binary formats to store run-length encoded BWTs: the ropebwt2 FMR format and the fermi FMD format. The FMR format is dynamic in that you can add new sequences or merge BWTs to an existing FMR file. The same BWT does not necessarily lead to the same FMR. The FMD format is simpler in structure, faster to load, smaller in memory and can be memory-mapped. The two formats can be used interchangeably in ropebwt3, but it is recommended to use FMR for BWT construction and FMD for finding exact matches. You can explicitly convert between the two formats with:

ropebwt3 build -i in.fmd -bo out.fmr
ropebwt3 build -i in.fmr -do out.fmd

Performance

The following table shows the time to construct the BWT for three datasets:

  1. human100 (300Gb): 100 human genomes assembled with long reads from the pangene paper
  2. ecoli315k (1.6Tb): 315k E. coli genomes from AllTheBacteria v0.2
  3. CommonBacteria (7.3Tb): genomes from AllTheBacteria excluding those in the "dustbin" and "unknown" categories

BWTs are constructed from both strands, so the size of each BWT doubles the number of input bases.

Dataset Algorithm Elapsed CPU time Peak RAM
human100 rb3 build 33.7 h 803.6 h 82.3 G
rb3 merge 24.2 h 757.2 h 70.7 G
grlBWT 8.3 h 29.6 h 84.8 G
pfp-thres 51.7 h 51.5 h 788.1 G
ecoli315k rb3 build 128.7 h 3826.8 h 20.5 G
CommonBacteria rb3 build 26.5 d 830.3 d 67.3 G

For human100, the following methods were evaluated:

  • rb3 build: construct BWT from input FASTA files with ropebwt3 build -t48 (using up to 48 threads). This is the only method here that does not use working disk space.

  • rb3 merge: merge 100 BWTs constructed from 100 FASTA files, respectively. Constructing the BWT for one human genome takes around 10 minutes, which is not counted in the table.

  • grlBWT: construct BWT using grlBWT. We need to concatenate all input FASTA files and convert them to the one-sequence-per-line format with ropebwt3 fa2line. Conversion time is not counted.

  • pfp-thresholds: launched via the Movi indexing script. It was run on a slower machine with more RAM. The time for prepare_ref is not counted, either.

grlBWT is clearly the winner for BWT construction and it also works for non-DNA alphabet. Ropebwt3 has acceptable performance and its support of incremental build may be helpful for large datasets.

Limitations

  • Ropebwt3 is slow on the "locate" operation.

ropebwt3's People

Contributors

lh3 avatar

Stargazers

Davide Bolognini avatar Xiaofei Zeng avatar Zhigui Bao avatar Erik Garrison avatar Matthew Wells avatar Wei Shen avatar  avatar Asan Emirsaleh avatar  avatar Richard avatar Li Song avatar jqh avatar Karel Břinda avatar

Watchers

Erik Garrison avatar  avatar Wei Shen avatar  avatar

Forkers

hlilab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.