This is the home of most of my code. Take a look around--something may catch your eye!
If you are extraordinarily bold and curious, check out this cool project that has been piquing my interest recently.
SIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.
Home Page: https://crates.io/crates/block_aligner
License: MIT License
This is the home of most of my code. Take a look around--something may catch your eye!
If you are extraordinarily bold and curious, check out this cool project that has been piquing my interest recently.
When including block aligner as a dependency, the file at
https://github.com/Daniel-Liu-c0deb0t/block-aligner/blob/main/c/Cargo.toml
gives this warning:
warning: skipping duplicate package `block-aligner` found at `~/.local/share/cargo/git/checkouts/block-aligner-253559227b640923/4014339/c`
It's probably best to rename the package to something else there.
Should not be to hard to keep track of when traceback pointers converge, saving memory during alignment.
If I try to use
let gap_penalties = Gaps { open: -1, extend: -1 };
I get an assertion error. I understand that the way scoring works that gap-open needs to include the gap extension, since a gap of length one is scored as just the open penalty. However, I would like to be able to do alignments where e.g. a 1bp gap costs -1 and a 2bp gap costs -2, etc. which is not possible with the current check.
Is there any reason to require that open > extend
instead of open >= extend
?
Hi!
I am newer to Rust so I may be misunderstanding something, but I was wondering if using an extend penalty of -0.5 was supported. I noticed that the Gaps struct takes i8 so I am guessing this is not possible?
Also, just curious, what led you to choose the scoring setup for proteins that you did in the paper?
"For scoring, we use BLOSUM62 with (Gopen = -11, Gext = -1) for proteins"
Thanks for this library!
hello Daniel,
Finally I need to use block aligner for real-world metagenomic applications. I have long reads from PacBio (1k to 20k), I want to align them in some pairs and they can be highly similar, e.g., > 90% sequence identity but only for the overlapped region, meaning, 2 sequences might only have 50% aligned, and the aligned region is very high identity, which means I need semi-global alignment, I need to know the aligned length (so that I can calculate the alignment ratio). It would be nice if I can also detect a little bit below 90% identity, e.g., 85%. I do not recall where block aligner can do semi-global alignment, that is the gaps opened at both ends will not be penalized (or a very small penalty score like in vsearch --all_pairs_global, but those low penalty gaps will not be in the final alignment). Also read the adaptive banded aligner paper, still not sure how to compute the aligned ratio (e.g., aligned positions/query length). I guess both adaptive banded DB and block aligner can be adapted to compute the semi-global alignment alignment ratio and identity for the aligned region? This is what needed in real world application. Any suggestions? Let me know if I am not clear.
Thanks,
Jianshu
Hi,
I was wondering if it is possible to define a custom AAMatrix from C API?
Thanks,
It should be pretty easy to add a realignment feature based on a known traceback path (can be approximate). The block will just shift to follow that path.
Profiling shows that this could save maybe 20% of time when block aligner is repeatedly created and destroyed. One way to reuse memory could be to allow an old block aligner instance to be moved into a new one, and the new one can repurpose the allocated memory regions of the old one.
Hello Daniel,
I have a question about the global alignment score of block aligner, is it metric (https://en.wikipedia.org/wiki/Metric_space)? Especially whether score (a, b) = score (b, a) for global alignment.
Also, what is the lowest identity block aligner can handle, I saw the protein example you have down to ~50% identity, how about even lower and for DNA since it is not full dynamic programming. Wavefront is fast only when 2 sequences are highly identical like, 95% identity or above. I am wondering whether this is also the case for block aligner.
Another question is for semi-global alignment, both vsearch and usearch use a much smaller penalty for gaps at both ends of the 2 sequences compare to gaps in the sequences to approximate "semi-global" because in practice semi-global does not penalize gaps at both ends.
Thanks,
Jianshu
The WFA (and related BiWFA) algorithm have provided some really nice asymptotic and practical advances for pairwise alignment under a set of realistic scoring functions. Do you have any thoughts on the relation of the block alignment idea with WFA? Could ideas from these two approaches be combined fruitfully? How would we expect their performance to compare?
I'm using the C API of block-aligner
to align protein sequences from UniProt database. There are *
s in some protein sequences. Currently using block-aligner
to align sequences containing *
will cause a Segmentation Fault. Although the users can resolve it by mapping *
to other supported char
s, it would be nice if we can support *
internally! :)
Hello Daniel,
Can you please show some example on how to support Neon on MacOS M1 and M1 Pro/Max with this? Another question is related: the simdeez library only supports avx2 and sse2. How to allow support for neon, which library?
THanks,
Jianshu
It seems that rand_mutate
is slightly limited in the kinds of mutations it can generate, since it decides up front whether to generate a match/substitution/insertion/deletion for each position. This way, it can never insert two or more consecutive characters.
See this line: For each generated insertion, it directly pushes the relevant character from a
after pushing the inserted character.
Something like an insertion followed by a substitution also can't be generated in this model.
Probably this doesn't matter too much in practice, but it's a slight deviation from the common model of generating the mutations one by one, as done in e.g. wfa.
I'd like to ask if it would be possible to support the alignment of DNA sequences using the C API? I would like to integrate block-aligner
into an existing C/C++ tool but cannot use the exisiting C API to do alignments of DNA sequences.
Hi Daniel,
I confess I haven't quite figured out how to customize scoring for each nucleotide character, but I think I'm getting close. As you'll recall from my issue on triple_accel
, I'm looking to compute nucleotide distances for pairs of sequences that ignore characters other than A, T, G, and C. Do I have it right that you can control that in block-aligner
with the Matrix trait's set
method? For example:
let mut scoring_matrix = NucMatrix::new_simple(1, -1);
scoring_matrix.set(b'N', b'A', 0);
scoring_matrix.set(b'N', b'T', 0);
scoring_matrix.set(b'N', b'G', 0);
scoring_matrix.set(b'N', b'C', 0);
If so, could you provide some guidance on how to use an updated NucMatrix
in this portion of your readme example?
// Note that PaddedBytes, Block, and Cigar can be initialized with sequence length
// and block size upper bounds and be reused later for shorter sequences, to avoid
// repeated allocations.
let r = PaddedBytes::from_bytes::<NucMatrix>(b"TTAAAAAAATTTTTTTTTTTT", max_block_size);
let q = PaddedBytes::from_bytes::<NucMatrix>(b"TTTTTTTTAAAAAAATTTTTTTTT", max_block_size);
Thanks for the advice and for the excellent crate!
--Nick
I have a strange use case that I was hoping to use your project for. Specifically, I would like to use it for text matching rather than DNA sequencing. I understand that this may not be a typical use case for your project, but I was wondering if you could provide a simple example for how I could use it for this purpose.
Below is an example of a similar library in python for my use case.
import sys
import parasail
# extension_penalty must be <= open_penalty
open_penalty = 1
extension_penalty = 1
def calculate_score(ayah, search_term, profile=None):
if profile is not None:
result = parasail.sw_scan_profile_16(profile, ayah, open_penalty, extension_penalty)
else:
result = parasail.sw_scan_16(ayah, search_term, open_penalty, extension_penalty, parasail.blosum62)
return result
query = 'hello world'
search_in = 'large-file.txt'
current_profile = parasail.profile_create_16(query, parasail.blosum62)
results = []
for line in open(search_in):
line = line[:-1]
result = calculate_score(line, query, current_profile)
results.append((line, result.score, result.__dict__))
sorted_results = sorted(results, key=lambda item: item[1], reverse=True)
for i in range(0, 5):
print(sorted_results[i])
Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.