daniel-liu-c0deb0t / block-aligner Goto Github PK

SIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.

Home Page: https://crates.io/crates/block_aligner

License: MIT License

Rust 46.28% Python 0.75% Shell 0.87% Makefile 0.05% C 3.04% Jupyter Notebook 49.02%

algorithms alignment avx2 bioinformatics neon rust simd wasm webassembly

block-aligner's Introduction

Greetings, curious internet dweller! 👋

This is the home of most of my code. Take a look around--something may catch your eye!

If you are extraordinarily bold and curious, check out this cool project that has been piquing my interest recently.

block-aligner's People

Contributors

Stargazers

Watchers

Forkers

jianshu93 schaudge milot-mirdita ragnargrootkoerkamp animesh thedegeneratedev5150

block-aligner's Issues

`cargo.toml` in `c` directory is detected as duplicate package

When including block aligner as a dependency, the file at
https://github.com/Daniel-Liu-c0deb0t/block-aligner/blob/main/c/Cargo.toml
gives this warning:

warning: skipping duplicate package `block-aligner` found at `~/.local/share/cargo/git/checkouts/block-aligner-253559227b640923/4014339/c`

It's probably best to rename the package to something else there.

Try out convergence of traceback pointers

Should not be to hard to keep track of when traceback pointers converge, saving memory during alignment.

Is there a reason gap_open has to be > gap_extend and not >= gap_extend?

If I try to use

let gap_penalties = Gaps { open: -1, extend: -1 };

I get an assertion error. I understand that the way scoring works that gap-open needs to include the gap extension, since a gap of length one is scored as just the open penalty. However, I would like to be able to do alignments where e.g. a 1bp gap costs -1 and a 2bp gap costs -2, etc. which is not possible with the current check.

Is there any reason to require that open > extend instead of open >= extend?

Is using -0.5 as extend penalty supported?

Hi!

I am newer to Rust so I may be misunderstanding something, but I was wondering if using an extend penalty of -0.5 was supported. I noticed that the Gaps struct takes i8 so I am guessing this is not possible?

Also, just curious, what led you to choose the scoring setup for proteins that you did in the paper?
"For scoring, we use BLOSUM62 with (Gopen = -11, Gext = -1) for proteins"

Thanks for this library!

semi-global alignment

hello Daniel,

Finally I need to use block aligner for real-world metagenomic applications. I have long reads from PacBio (1k to 20k), I want to align them in some pairs and they can be highly similar, e.g., > 90% sequence identity but only for the overlapped region, meaning, 2 sequences might only have 50% aligned, and the aligned region is very high identity, which means I need semi-global alignment, I need to know the aligned length (so that I can calculate the alignment ratio). It would be nice if I can also detect a little bit below 90% identity, e.g., 85%. I do not recall where block aligner can do semi-global alignment, that is the gaps opened at both ends will not be penalized (or a very small penalty score like in vsearch --all_pairs_global, but those low penalty gaps will not be in the final alignment). Also read the adaptive banded aligner paper, still not sure how to compute the aligned ratio (e.g., aligned positions/query length). I guess both adaptive banded DB and block aligner can be adapted to compute the semi-global alignment alignment ratio and identity for the aligned region? This is what needed in real world application. Any suggestions? Let me know if I am not clear.

Thanks,

Jianshu

Custom AAMatrix from C API?

Hi,

I was wondering if it is possible to define a custom AAMatrix from C API?

Thanks,

Realignment

It should be pretty easy to add a realignment feature based on a known traceback path (can be approximate). The block will just shift to follow that path.

Option to reuse memory when block aligner is reallocated

Profiling shows that this could save maybe 20% of time when block aligner is repeatedly created and destroyed. One way to reuse memory could be to allow an old block aligner instance to be moved into a new one, and the new one can repurpose the allocated memory regions of the old one.

question on block aligner score and lowest identity allowed to be accurate

Hello Daniel,

I have a question about the global alignment score of block aligner, is it metric (https://en.wikipedia.org/wiki/Metric_space)? Especially whether score (a, b) = score (b, a) for global alignment.

Also, what is the lowest identity block aligner can handle, I saw the protein example you have down to ~50% identity, how about even lower and for DNA since it is not full dynamic programming. Wavefront is fast only when 2 sequences are highly identical like, 95% identity or above. I am wondering whether this is also the case for block aligner.

Another question is for semi-global alignment, both vsearch and usearch use a much smaller penalty for gaps at both ends of the 2 sequences compare to gaps in the sequences to approximate "semi-global" because in practice semi-global does not penalize gaps at both ends.

Thanks,

Jianshu

Thoughts on relation to / comparison with WFA

The WFA (and related BiWFA) algorithm have provided some really nice asymptotic and practical advances for pairwise alignment under a set of realistic scoring functions. Do you have any thoughts on the relation of the block alignment idea with WFA? Could ideas from these two approaches be combined fruitfully? How would we expect their performance to compare?

Investigate performance regressions

For some reason, the prefix scan benchmarks are slower than before. No major changes were made to it.
Traceback is a lot slower after changes to make it more correct when tracing back gap open/extend.

Support alignment of protein sequences containing "*"

I'm using the C API of block-aligner to align protein sequences from UniProt database. There are *s in some protein sequences. Currently using block-aligner to align sequences containing * will cause a Segmentation Fault. Although the users can resolve it by mapping * to other supported chars, it would be nice if we can support * internally! :)

Neon and related

Hello Daniel,

Can you please show some example on how to support Neon on MacOS M1 and M1 Pro/Max with this? Another question is related: the simdeez library only supports avx2 and sse2. How to allow support for neon, which library?

THanks,

Jianshu

`rand_mutate` cannot generate consecutive insertions

It seems that rand_mutate is slightly limited in the kinds of mutations it can generate, since it decides up front whether to generate a match/substitution/insertion/deletion for each position. This way, it can never insert two or more consecutive characters.

See this line: For each generated insertion, it directly pushes the relevant character from a after pushing the inserted character.

Something like an insertion followed by a substitution also can't be generated in this model.

Probably this doesn't matter too much in practice, but it's a slight deviation from the common model of generating the mutations one by one, as done in e.g. wfa.

Support Alignment of DNA sequences with C API

I'd like to ask if it would be possible to support the alignment of DNA sequences using the C API? I would like to integrate block-aligner into an existing C/C++ tool but cannot use the exisiting C API to do alignments of DNA sequences.

Question: Syntax for modifying the nucleotide scoring matrix

Hi Daniel,

I confess I haven't quite figured out how to customize scoring for each nucleotide character, but I think I'm getting close. As you'll recall from my issue on triple_accel, I'm looking to compute nucleotide distances for pairs of sequences that ignore characters other than A, T, G, and C. Do I have it right that you can control that in block-aligner with the Matrix trait's set method? For example:

let mut scoring_matrix = NucMatrix::new_simple(1, -1);
scoring_matrix.set(b'N', b'A', 0);
scoring_matrix.set(b'N', b'T', 0);
scoring_matrix.set(b'N', b'G', 0);
scoring_matrix.set(b'N', b'C', 0);

If so, could you provide some guidance on how to use an updated NucMatrix in this portion of your readme example?

// Note that PaddedBytes, Block, and Cigar can be initialized with sequence length
// and block size upper bounds and be reused later for shorter sequences, to avoid
// repeated allocations.
let r = PaddedBytes::from_bytes::<NucMatrix>(b"TTAAAAAAATTTTTTTTTTTT", max_block_size);
let q = PaddedBytes::from_bytes::<NucMatrix>(b"TTTTTTTTAAAAAAATTTTTTTTT", max_block_size);

Thanks for the advice and for the excellent crate!
--Nick

Use staticlib

fuzzy text finding

I have a strange use case that I was hoping to use your project for. Specifically, I would like to use it for text matching rather than DNA sequencing. I understand that this may not be a typical use case for your project, but I was wondering if you could provide a simple example for how I could use it for this purpose.

Below is an example of a similar library in python for my use case.

import sys
import parasail

# extension_penalty must be <= open_penalty
open_penalty = 1
extension_penalty = 1


def calculate_score(ayah, search_term, profile=None):
    if profile is not None:
        result = parasail.sw_scan_profile_16(profile, ayah, open_penalty, extension_penalty)
    else:
        result = parasail.sw_scan_16(ayah, search_term, open_penalty, extension_penalty, parasail.blosum62)
    return result


query = 'hello world'
search_in = 'large-file.txt'

current_profile = parasail.profile_create_16(query, parasail.blosum62)

results = []
for line in open(search_in):
    line = line[:-1]
    result = calculate_score(line, query, current_profile)
    results.append((line, result.score, result.__dict__))

sorted_results = sorted(results, key=lambda item: item[1], reverse=True)

for i in range(0, 5):
    print(sorted_results[i])

Thank you!

daniel-liu-c0deb0t / block-aligner Goto Github PK

block-aligner's Introduction

Greetings, curious internet dweller! 👋

block-aligner's People

Contributors

Stargazers

Watchers

Forkers

block-aligner's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs