Comments (7)
I actually designed this library to support operating on arbitrary bytes! I assume you are doing "global" comparison (each line is compared end-to-end to the query). You can modify the example here, but replace NucMatrix
and NW1
with a custom byte scoring matrix. I don't think you should be using blosum62
with parasail
because that score is only meaningful for amino acids. For arbitrary bytes, having a simple match bonus and mismatch penalty makes more sense.
from block-aligner.
I see. So is this how I would go about it?
I noticed it's not very fast so I'm sure I must be doing something wrong here.
If there is a more efficient way please let me know. Thank you, Daniel
fn main() {
let filename = "raw/transliteration.txt";
let transliteration_bytes: &[u8] = &fs::read(filename).expect("No file found with the name {filename}");
println!("File byte length : {}", transliteration_bytes.len());
let input = get_user_input();
// Setup the block alignment
let block_size = 16;
let gaps = Gaps {
open: -2,
extend: -1,
};
let query = PaddedBytes::from_bytes::<ByteMatrix>(input.as_bytes(), block_size);
let reference = PaddedBytes::from_bytes::<ByteMatrix>(transliteration_bytes, block_size);
// Align with traceback, but no x drop threshold.
let mut a = Block::<true, false>::new(query.len(), reference.len(), block_size);
a.align(&query, &reference, &BYTES1, gaps, block_size..=block_size, 0);
let res = a.res();
println!("Result is {:?}", res);
}
fn get_user_input() -> String {
let mut line = String::new();
stdin().read_line(&mut line).expect("unable to read input");
return line;
}
from block-aligner.
What is the file byte length? You seem to be using block aligner correctly, but you aren't comparing each line of the file to the query input, you are comparing the whole file. Is that intentional?
from block-aligner.
Byte size : 609444.
The file has many lines. 6236 to be exact.
I guess searching line by line is the best option here. Also running in release mode significantly sped things up so I'm optimistic about using your great library to accomplish this task.
Thanks
from block-aligner.
Set it up to run line by line but I'm getting identical scores on lines that include the query compared to ones that don't include it.
Am I missing a step here?
// Setup the block alignment
let block_size = 16;
let gaps = Gaps {
open: -2,
extend: -1
};
let query = PaddedBytes::from_bytes::<ByteMatrix>(input.as_bytes(), block_size);
for line in &transliteration_lines {
let reference = PaddedBytes::from_bytes::<ByteMatrix>(line.as_bytes(), block_size);
// Align with traceback, but no x drop threshold.
let mut a = Block::<true, false>::new(query.len(), reference.len(), block_size);
a.align(
&query,
&reference,
&BYTES1,
gaps,
block_size..=block_size,
0,
);
let res = a.res();
// add res score and line number to list
...
}
// sort result list by score and print the top results
...
from block-aligner.
Oh yes release mode is very important in Rust.
I'm a little confused as to what you are trying to do. Do you want to search for some query within each line? Or do you want to match the entire line against the query? It seems like you want to search for some query within the line (eg., searching for "hi"
in the line "hello hi world"
).
Block aligner's scoring method is for global alignment (match the entire line against the query, end-to-end). If you want to search for a query that could be anywhere within the line, you should use triple_accel
(my other library) that has a search feature. This search will also return the locations where the query appears in the line.
from block-aligner.
Thank you for the explanation. I tested out your tiple_accel library and it's great: it's very performant. I am using this for arabic text matching which doesn't do very well with levenshtein. Do you know of any libraries that make use of the waterman algorithm for text matching?
from block-aligner.
Related Issues (20)
- Support alignment of protein sequences containing "*" HOT 1
- Custom AAMatrix from C API? HOT 7
- Thoughts on relation to / comparison with WFA HOT 1
- Use staticlib
- Investigate performance regressions
- question on block aligner score and lowest identity allowed to be accurate HOT 6
- Realignment
- Option to reuse memory when block aligner is reallocated
- `cargo.toml` in `c` directory is detected as duplicate package HOT 1
- Question: Syntax for modifying the nucleotide scoring matrix HOT 3
- semi-global alignment HOT 6
- Try out convergence of traceback pointers
- Support Alignment of DNA sequences with C API HOT 1
- Advice on setting for short query in long reference search HOT 4
- [BUG] Incorrect index calculation for score matrix setter HOT 6
- Neon and related HOT 4
- Is using -0.5 as extend penalty supported? HOT 2
- Is there a reason gap_open has to be > gap_extend and not >= gap_extend? HOT 3
- `rand_mutate` cannot generate consecutive insertions HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from block-aligner.