The ropebwt3 from lh3

Getting Started

# Compile
git clone https://github.com/lh3/ropebwt3
cd ropebwt3
make  # use "make omp=0" if your compiler doesn't suport OpenMP

# Toy examples
echo -e 'AGG\nAGC' | ./ropebwt3 build -LR -
echo TGAACTCTACACAACATATTTTGTCACCAAG | ./ropebwt3 build -Ldo idx.fmd -
echo ACTCTACACAAgATATTTTGTCA | ./ropebwt3 mem -Ll10 idx.fmd -

# Download the prebuilt FM-index for 152 M. tuberculosis genomes
wget -O- https://zenodo.org/records/12803206/files/mtb152.tar.gz?download=1 | tar -zxf -

# Count super-maximal exact matches (no contig positions)
echo ACCTACAACACCGGTGGCTACAACGTGG  | ./ropebwt3 mem -L mtb152.fmd -
# Local alignment
echo ACCTACAACACCGGTaGGCTACAACGTGG | ./ropebwt3 sw -Lm20 mtb152.fmd -
# Retrieve R15311, the 46th genome in the collection. 90=(46-1)*2
./ropebwt3 get mtb152.fmd 90 > R15311.fa

Getting Started
Introduction
Usage
Performance
Limitations

Introduction

Ropebwt3 constructs the FM-index of a large DNA sequence set and searches for matches against the FM-index. It is optimized for highly redundant sequence sets such as a pangenome or sequence reads at high coverage. Ropebwt3 can losslessly compress 7.3Tb of common bacterial genomes into a 30GB run-length encoded BWT file and report supermaximal exact matches (SMEMs) or local alignments with mismatches and gaps.

Prebuilt ropebwt3 indices can be downloaded from Zenodo.

Usage

A full ropebwt3 index consists of three files:

<base>.fmd: run-length encoded BWT that supports the rank operation. It is generated by the build command. By default, the $i$-th sequence in the input is the $2i$-th sequence in the BWT and its reverse complement is the $(2i+1)$-th sequence. Some commands assume such ordering.
<base>.fmd.ssa: sampled suffix array, generated by the ssa command. For now, it is only needed for reporting coordinates in the PAF output of the sw command.
<base>.fmd.len.gz: list of sequence names and lengths. It is generated with third-party tools/scripts, for example, with seqtk comp input.fa | cut -f1,2 | gzip. This file is needed for reporting sequence names and lengths in the PAF output.

Counting maximal exact matches

A maximal exact match (MEM) is an exact alignment between the index and a query that cannot be extended in either direction. A super MEM (SMEM) is a MEM that is not contained in any other MEM on the query sequence. You can find the SMEMs with

ropebwt3 mem -t4 -l31 bwt.fmd query.fa > matches.bed

In the output, the first three columns give the query sequence name, start and end of a match and the fourth column gives the number of hits. Option -l specifies the minimum MEM length. A larger value helps performance.

Local alignment

Ropebwt3 implements a revised BWA-SW algorithm to align query sequences against an FM-index:

ropebwt3 sw -t4 -N25 -k11 bwt.fmd query.fa > aln.paf

Option -N effectively sets the bandwidth during alignment. A larger value improves alignment accuracy at the cost of performance. Option -k initiates alignments with an exact match.

Given a complete ropebwt3 index with sampled suffix array and sequence names, the sw command outputs standard PAF but it only outputs one hit per query even if there are multiple equally best hits. The number of hits in BWT is written to the rh tag.

Local alignment is tens of times slower than finding SMEMs. It is not designed for aligning high-throughput sequence reads.

Constructing a BWT

Ropebwt3 implements three algorithms for BWT construction. For the best performance, you need to choose an algorithm based on the input date types.

If you are not sure, use the general command line

ropebwt3 build -t24 -bo bwt.fmr file1.fa file2.fa filen.fa

You can also append another file to an existing index

ropebwt3 build -t24 -bo bwt-new.fmr -i bwt-old.fmr filex.fa

For a set of large genomes (e.g. a human pangenome), you may generate the BWT of each individual genome on a cluster and merge them togather. This parallelizes sub-BWT construction and speeds up the overall process.
```
ropebwt3 build -t8 -bo genome1.fmr genome1.fa.gz
ropebwt3 build -t8 -bo genome2.fmr genome2.fa.gz
ropebwt3 build -t8 -bo genomen.fmr genomen.fa.gz
ropebwt3 merge -t24 -bo bwt.fmr genome1.fmr genome2.fmr genomen.fmr
```

For a set of small genomes, it is better to concatenate them together:

cat file1.fa file2.fa filen.fa | ropebwt3 build -t24 -m2g -bo bwt.fmr -

For short reads, use the ropebwt2 algorithm and enable the RCLO sorting:
```
ropebwt3 build -r -bo bwt.fmr reads.fq.gz
```

Use grlBWT, which is faster than ropebwt3 for large pangenomes:

ropebwt3 fa2line genome1.fa genome2.fa genomen.fa > all.txt
grlbwt-cli all.txt -t 32 -T . -o bwt.grl
grl2plain bwt.rl_bwt bwt.txt
ropebwt3 plain2fmd -o bwt.fmd bwt.txt

These command lines construct a BWT for both strands of the input sequences. You can skip the reverse strand by adding option -R.

When you use the build command for one genome, the peak memory by default is $B+11\cdot\min\{2S,7{\rm g}\}$ where $S$ is the input file size and $B$ is the final BWT size in run-length encoding. If you have more than 65536 sequences in a batch, factor 11 will be increased to 17 due to the use of a different algorithm. You can reduce the peak memory by reducing the batch size via option -m.

The peak memory for the merge command is $B+\max\{B_1,\ldots,B_n\}+8\max\{L_1,\ldots,L_n\}$, where $B$ is the final BWT size in run-length encoding, $B_i$ is the size of the $i$-th input BWT to be merged and $L_i$ is the number of symbols in the $i$-th BWT.

If you provide multiple files on a build command line, ropebwt3 internally will run build on each input file and then incrementally merge each individual BWT to the final BWT. The peak memory will be the higher one between the build step and the merge step.

Binary BWT file formats

Ropebwt3 uses two binary formats to store run-length encoded BWTs: the ropebwt2 FMR format and the fermi FMD format. The FMR format is dynamic in that you can add new sequences or merge BWTs to an existing FMR file. The same BWT does not necessarily lead to the same FMR. The FMD format is simpler in structure, faster to load, smaller in memory and can be memory-mapped. The two formats can be used interchangeably in ropebwt3, but it is recommended to use FMR for BWT construction and FMD for finding exact matches. You can explicitly convert between the two formats with:

ropebwt3 build -i in.fmd -bo out.fmr
ropebwt3 build -i in.fmr -do out.fmd

Performance

The following table shows the time to construct the BWT for three datasets:

human100 (300Gb): 100 human genomes assembled with long reads from the pangene paper
ecoli315k (1.6Tb): 315k E. coli genomes from AllTheBacteria v0.2
CommonBacteria (7.3Tb): genomes from AllTheBacteria excluding those in the "dustbin" and "unknown" categories

BWTs are constructed from both strands, so the size of each BWT doubles the number of input bases.

Dataset	Algorithm	Elapsed	CPU time	Peak RAM
human100	rb3 build	33.7 h	803.6 h	82.3 G
	rb3 merge	24.2 h	757.2 h	70.7 G
	grlBWT	8.3 h	29.6 h	84.8 G
	pfp-thres	51.7 h	51.5 h	788.1 G
ecoli315k	rb3 build	128.7 h	3826.8 h	20.5 G
CommonBacteria	rb3 build	26.5 d	830.3 d	67.3 G

For human100, the following methods were evaluated:

rb3 build: construct BWT from input FASTA files with ropebwt3 build -t48 (using up to 48 threads). This is the only method here that does not use working disk space.
rb3 merge: merge 100 BWTs constructed from 100 FASTA files, respectively. Constructing the BWT for one human genome takes around 10 minutes, which is not counted in the table.
grlBWT: construct BWT using grlBWT. We need to concatenate all input FASTA files and convert them to the one-sequence-per-line format with ropebwt3 fa2line. Conversion time is not counted.
pfp-thresholds: launched via the Movi indexing script. It was run on a slower machine with more RAM. The time for prepare_ref is not counted, either.

grlBWT is clearly the winner for BWT construction and it also works for non-DNA alphabet. Ropebwt3 has acceptable performance and its support of incremental build may be helpful for large datasets.

Limitations

Ropebwt3 is slow on the "locate" operation.

lh3 / ropebwt3 Goto Github PK

ropebwt3's Introduction

Getting Started

Table of Contents

Introduction

Usage

Counting maximal exact matches

Local alignment

Constructing a BWT

Binary BWT file formats

Performance

Limitations

ropebwt3's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs