This little tool accept a fasta file (or from stdin), and out put a filtered fasta file (or to stdout). It is (relatively) fast, see comparison between a simple python implementation.
fasta_filter 0.1.4
Haogao Gu <[email protected]>
A tool for filtering fasta sequences with threshold of specific bases (e.g. 'N'), written in Rust.
USAGE:
fasta_filter [OPTIONS] --file <FILE> --num_base <NUMBER>
OPTIONS:
-f, --file <FILE>
Path of fasta file or use '-' as stdin.
-b, --base <STRING>
Bases to be accounted for. Examples: "N,-". Please note that this is case sensitive.
[default: N]
-n, --num_base <NUMBER>
Frequency of specified bases, any sequences with bases count over this threshold will
not be print out. Use 0 to skip this step if you only want to use the specified_pos
filter.
-s, --specified_pos_file <FILE>
Path to a txt file specifying genomic positions of interest, each line should contain
one integer specifying nucleotide position. Positions are 1-based rather than 0-based.
-m, --specified_num_base <NUMBER>
The num_base threshold for the specified positions.
-o, --out_file <NUMBER>
Path to write to the outfile, if "-" will write to stdout. [default: -]
-v, --verbose
Add this flap to print parameters to stderr.
-h, --help
Print help information
-V, --version
Print version information
Directly download executables from Releases.
- Install Rust from here.
- Download source code by
git clone https://github.com/Koohoko/fasta_filter.git
. - Install with
cargo install --path fasta_filter
. - You are ready to go.
Example input:
>seq1_8N_5del
NNNAAAAAAAAAAAAACCCCCCCCCCCTTTTTTTTGGGGGGGGGGGGGGGGAAACCC-----AAAAAANNNNNTTTTT
>seq2_20N_10del
NNNAAAAAAAAA-----CCCCCCCCCTTTTTTTTGGGGGGGNNNNNNNGGAAACCC-----AAAAAANNNNNNNNNNT
>seq1_5N_5del
TTTAAAAAAAAAAAAACCCCCCCCCCCTTTTTTTTGGGGGGGGGGGGGGGGAAACCC-----AAAAAANNNNNTTTTT
Example usage:
- Drop the sequences with > 5 "N" bases:
✗ fasta_filter -b N -n 5 -f data/small.fasta
>seq3_5N_5del
TTTAAAAAAAAAAAAACCCCCCCCCCCTTTTTTTTGGGGGGGGGGGGGGGGAAACCC-----AAAAAANNNNNTTTTT
- Drop the sequences with > 20 "N"+"-" bases:
✗ fasta_filter -b N,- -n 20 -f data/small.fasta
>seq1_8N_5del
NNNAAAAAAAAAAAAACCCCCCCCCCCTTTTTTTTGGGGGGGGGGGGGGGGAAACCC-----AAAAAANNNNNTTTTT
>seq3_5N_5del
TTTAAAAAAAAAAAAACCCCCCCCCCCTTTTTTTTGGGGGGGGGGGGGGGGAAACCC-----AAAAAANNNNNTTTTT
- Drop the sequences with > 2 "N bases within specified positions (positions are specified in a txt file). Here we use "-n 0" to skip the full genome filter:
✗ fasta_filter -b N -n 0 -f data/small.fasta -s ./data/mut_pos.txt -m 2
>seq2_20N_10del
NNAAAAAAAAAA-----CCCCCCCCCTTTTTTTTGGGGGGGNNNNNNNGGAAACCC-----AAAAAANNNNNNNNNNT
>seq3_5N_5del
TTTAAAAAAAAAAAAACCCCCCCCCCCTTTTTTTTGGGGGGGGGGGGGGGGAAACCC-----AAAAAANNNNNTTTTT
- filtering both specified positions and the full genome with different threshold:
✗ fasta_filter -b N -n 10 -f data/small.fasta -s ./data/mut_pos.txt -m 2
>seq3_5N_5del
TTTAAAAAAAAAAAAACCCCCCCCCCCTTTTTTTTGGGGGGGGGGGGGGGGAAACCC-----AAAAAANNNNNTTTTT
- Input compressed files and output to a regular fasta file, showing verbose info:
✗ fasta_filter -b N -n 10 -f data/small.fasta.xz -s ./data/mut_pos.txt -m 2 -v -o ./data/test_output.fasta
### Job started! ###
fasta file: data/small.fasta.xz
Output file: ./data/test_output.fasta
bases: ['N']
num_base: 10
allow_iupac: true
specified_pos_file: ./data/mut_pos.txt
specified_num_base: 2
### Job finished! ###
✗ cat ./data/test_output.fasta
>seq1_8N_5del
TTTAAAAAAAAAAAAACCCCCCCCCCCTTTTTTTTGGGGGGGGGGGGGGGGAAACCC-----AAAAAANNNNNTTTTT
-
Runing on plain fasta file containing SASR-CoV-2 sequences (it is 1.5GB in file size, and contains 50,000 sequences (length of each sequence ~ 29900)). When output to
/dev/null
, fasta_filter used ~0.6s and the python3 one used ~12s on my poor computer (Intel NUC8i5beh). Details can be found here (Rust, Python). -
Using filter_fasta with double filters for a big fasta file (302GB in plain text, multiple sequence alignment of SARS-CoV-2 downloaded from GISAID). IO seems to be the major bottleneck.
✗ time -hl fasta_filter -f /Volumes/SSD_480G/Downloads/msa_2022-04-04/2022-04-04_unmasked.fa -b n,-,N -n 4500 -m 10 -s ./data/BA1_BA2_pos.txt -o /Users/koohoko/Downloads/2022-04-04_unmasked_filtered.fasta
10m7.61s real 2m52.32s user 4m44.53s sys
1458176 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
521 page reclaims
1 page faults
0 swaps
0 block input operations
7 block output operations
0 messages sent
0 messages received
0 signals received
27170 voluntary context switches
1002349 involuntary context switches
2358390178277 instructions retired
1400439339433 cycles elapsed
618496 peak memory footprint
- Test pipe streams. Stdin and Stdout work as expected.
- Test zip files. gz and xz inputs are also supported.
- Benchmark against python implementation.
- Add installation instruction.
- Work in multithread mode?