smarco / biwfa-paper Goto Github PK

View Code? Open in Web Editor NEW

40.0 40.0 3.0 418.69 MB

Bidirectional WFA (Paper)

License: Other

Makefile 1.70% C 89.80% C++ 2.44% Python 0.86% R 5.14% Shell 0.06%

biwfa-paper's People

Contributors

Stargazers

Watchers

Forkers

bkbonde schaudge ragnargrootkoerkamp

biwfa-paper's Issues

Alignment ranges

In the WFALM library there is an API that produces a seeded local alignment ( ends-free extension ) specified using two anchor points:

aligner_sw.wavefront_align_local_low_mem(local_seq1.c_str(), local_seq1.size(),
                                                               local_seq2.c_str(), local_seq2.size(),
                                                               anchor_begin_1, anchor_end_1,
                                                               anchor_begin_2, anchor_end_2,
                                                               false);

I assume, but haven't verified that this is roughly equivalent to the following in the BiWFA/WFA2 API:

    // Right extension from anchor ends
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = anchor_end_1;
    attributes.alignment_form.pattern_end_free = strlen(seq1);
    attributes.alignment_form.text_begin_free = anchor_end_2;
    attributes.alignment_form.text_end_free = strlen(seq2);

    // Followed by Left extension from anchor starts
    attributes.alignment_form.span = alignment_endsfree;
    attributes.alignment_form.pattern_begin_free = anchor_start_1;
    attributes.alignment_form.pattern_end_free = 0;
    attributes.alignment_form.text_begin_free = anchor_start_2;
    attributes.alignment_form.text_end_free = 0

The use of begin/end_free are a bit confusing in your left/right extension example as they pertain to this use case. Specification aside, are these roughly equivalent operations between the libraries? Would it be possible to request that in future releases you include the alignment ranges in the output as was done in wfalm? It could be gleaned from the CIGAR but that is less efficient to process and in many cases the alignment details may not even be needed, but the range of the extension is.

Data sources

Hi! Nice to see this out in the open! Looks like some nice results, so I will now replace WFA and variants with just BiWFA in my evals.

Since you now also test on longer sequences, I'd again like to run on the same data.
I see you posted a set of instructions to run the evals given the data, but I don't see links to the original data sources, and your paper doesn't contain any accession numbers.

For purposes of reproducability and efficiency, I would much appreciate if you could again share your input data, ideally with some scripts to generate them from raw downloads. (I'm completely inexperienced with any kind of data manipulation tools for fasta/sam files for now, so the probability of making a mistake while I try to reproduce your work is high.)

Also, could you provide a bit more details on the data? It would be nice to know the average error rate and model, and the average s for the two datasets, since this matters a lot for e.g. the relative performance of WFA and edlib (looking at the table in the WFA paper).

Cheers!

Incorrect score and alignment under 2-piece penalty (specific to BiWFA)

I am using x=4, o1=4, e1=2, o2=15 and e2=1. The code to call BiWFA is here.

First example – correct alignment but incorrect score

TAGGGGCAGACTGACACCTCACACGGCCGGGTACTCCAACAGACCTGCAGCTGAGGGTCCT

TAGGGGCAGACTGACACCTCACACGGCCGGGTACTCCTCTGAGACAAAACTTCCAGAGGAACGATCAGACAGCAGCATTCGCGGTTCATGAAAATCCGCTGTTCTGCAGCCACCGCTGCTGGTACCCAGGCAAACAGGGTCTAGAGTGGACCTTTAGCAAACTCCAACAGACCTGCAGCTGAGGGTCCT

The CIGAR should be 37=128D24=. The correct alignment should thus be 128+15=143. BiWFA gives a score of 158. The CIGAR string is correct.

Second example – incorrect alignment and score

TAGGGGCAGACTGACACCTCACACGGCCGGGTACTCCAACAGACCTGCAGCTGAGGGTCCT

TAGGGGCAGACTGACACCTCACACGGCCGGGTACTCCTCTGAGACAAAACTTCCAGAGGAACGATCAGACAGCAGCATTCGCGGTTCATGAAAATCCGCTGTTCTGCAGCCACCGCTGCTGGTACCCAGGCAAACAGGGTCTAGAGTGGACCTTTAGCAAACTCCAACAGACCTGCAGCTGAGGGTCCT

The optimal CIGAR is 1X16=1X14=128D4=1X24= with a score 155. BiWFA gives a score 166 and a wrong CIGAR 1X16M1X18M58D1M70D24M. In this example, the score is consistent with the CIGAR.

I have also seen incorrect score and/or CIGAR for other pairs. The (1-piece) affine gap mode is correct on all pairs I have seen.

By the way, per SAM spec, the character for sequence match in CIGAR is =, not M.

Possible bug in Algorithm 1

Hey @smarco @ekg ,

I was implementing BiWFA myself, and as expected I re-discovered most of the technical details you mention in section 2.3 😄 🙈.

One particular test case that I ran into is the following (first a conceptual explanation, then an example):

Conceptual case:

Suppose the mismatch cost is x=1, and the gap cost is e=10, without open cost (o=0).

The optimal path A has length 100, with an indel in the middle, from s_f=45 to s_f=55.
This path is found after expanding s_f=s_r=55 from the start and end.

There is also a suboptimal path B of length 101 with a substitution in the middle, from s_f=50 to s_f=51. This path is found after expanding s_f=51 from the start and s_r=50 from the end.

In Algorithm 1 in the paper, you run BIWFA_BREAKPOINT while s_f + s_r - o < s_b. Path B will match af s_f=51 and s_r=50, giving s_b=101. At this point, further increasing either s_f or s_r will terminate the algorithm.
However, the optimal path A of cost 100 is only found at s_f = s_r = 55.

Example:

x=1, o=0, e=3:
seqs:

CGC
CACG

The exact alignment cost is 4 via 1M1X1M1I, where the meeting is at s_f = 1, s_r = 3.

My implementation of BiWFA, when iterating till s_f+s_r-o<s_b finds cost 5 via 1M1I2X at s_f = 3 and s_r = 2.

Resolving it:
Simply iterating while s_f + s_r - p < s_b (where p=max(x, o+e)) should resolve this, which also corresponds to section 2.3.

~~Actually, I think it is sufficient to iterate while s_f + s_r - floor(p/2) < s_b, but this would require a bit more formal argumentation.~~ No that's probably not true; increasing both s_f and s_r by p/2 is already captured by doing -p.

Conclusion
Two things were wrong:

The paper was indeed missing the extra -p in the check to terminate. The code already had it though.
I was testing BiWFA without explicitly disabling heuristics, which made it give wrong results on large inputs.

Simplify meeting condition

I think the DP formula can be modified to use a gap-close cost instead of a gap-open cost. This should benefit the backwards WFA, so that you don't need the o overlap in the distance from the start and end when doing the meet-in-the-middle step.

I've quickly written this down here: https://research.curiouscoding.nl/posts/affine-gap-close-cost/

For completeness, here's the differences in the cap-close-cost formula (red: removed, green: added)

Using this, the forward and backward version will have a slightly different formula, but that could also be solved by incurring a cost of o/2 both when opening and closing the gap (see the linked post).

I haven't fully thought out the details yet of the stitching with this modified formula, but my feeling is that this should actually just work.

Since your stopping condition already seems to be to fall back to normal WFA once the score drops below 100, the total time spent computing the o extra layers won't be significant, but this could possibly reduce the theoretical complexity of the meeting condition and simplify code there. (Oh the other hand, you would get more code because you now need different implementations for the forward and backward pass.)

Some small remarks re the preprint introduction

Here are some remarks I had while reading your paper. I'm bundling them in a single issue. Feel free to disagree with them though, and I may well be wrong myself once or twice; some others of these will probably be more my personal opinion than objectively wrong.

abstract

demand for

I think it should be just demand here

classical pairwise alignment algorithms based on DP are limited by quadratic time and memory

I disagree. E.g. Myers&Miller'88 could be considered 'classic' and uses linear memory, (and quadratic time, but WFA doesn't solve this for a fixed relative error rate.) (edit: i.e. some definition of 'classic' would be good.)

You may be talking about implementations instead of algorithms here, but e.g. edlib I would probably not consider 'classic'.

The recently proposed WFA introduces an efficient algorithm to perform exact alignment in O(ns) time.

Sure, but Ukkonen'85 and Myers'86 already did this 35 years ago. Also, what does 'efficient' mean here?
(Edit: I think for such a claim (especially in abstract) , it would be best to qualify it with 'for gap affine costs'. We're having to make similar conditionals ourselves and it's really annoying, but in the end the clearest.)

1 Mbp long

The results section doesn't actually mention this number, or the actual upperbound on sequence length.

noisy

what kind or error rate and model? 10%/20%, uniform errors or big indels?

introduction

Classical approaches ... these methods often require a matrix

Drop 'often'? depending on your definition of 'classical'. (Later you write 'the DP matrix', assuming its presence anyway.)

multiple variations have been proposed over the years

It would be nice to cite the diagonal-transition algorithms here (i.e. the other ukkonen'85 paper, and myers86), instead of (I think) more review-oriented papers.

'variations': clarify what kind of variations (implementation? algorithm?)

notable optimizatoins include bit parallel techniques (Myers 1986...

Myers '86 doesn't talk about bit parallel at all.

among other methods (ukkonen 85)

Again, I think you want the other Ukkonen85 paper, Algorithms for Approximate String Matching, not Finding approximate patterns in strings, which is about k-approximate-string-matching, not global pairwise alignment.

nonetheless, all these exact methods retain the quadratic requirements of the original DP algorithm

time or space requirements?
Existing diagonal transition already for using O(ns) (subquadratic?) time or O(n) space.

heuristic methods

(personal request:) consider using approximate methods. heuristic is confusing in the context of A*, e.g. https://github.com/eth-sri/astarix

O(n) (myers and miller 88)

More precise would be O(min(n,m) + lg max(n,m)) space usage.

O(s) memory

I'm always a bit on the edge about using less memory than the size of the input, O(n+m). maybe clarify this? You should highlight that the output is compressed, and that return D x m means exactly that (as in CIGAR strings), and not return DDD..DDD, containing m copies of D (which was my first interpretation), since in that case the space requirement would become O(max(n,m)) to store the output alignment.

Regarding the space analysis in section 2.6:
Currently you analize the memory usage of algorithm 1, but don't say anything about algorithm 3.
Myers86 has a nice discussion on this, and also includes the log(s) = o(s) term, and writes that his algorithm uses O(d) working memory.

state-of-the-art

could you give one or two examples already here? It's convenient to see what are considered the best tools at the time of publication when looking back.

general

You mix BiWFA and BiWFA algorithm. I understand that WFA = WaveFront Alignment, but I'm not exactly sure what this means without it being followed by algorithm. It seems like you use just WFA for the theoretical algorithm, but then what does the 'algorithm' suffix mean?

In the discussion you write we expect the BiWFA to enable .... Is this talking about the algorithm? In general this leads (for me) to some confusion between when you mean the (theoretical) algorithm, and when you're talking about your particular implementation, also called BiWFA. (But saying the BiWFA to me implies algorithm, not your implementation.)

edit re terminology:

It would be good if we have consistent usage of the following terms:

diagonal-transition method/algorithm: the general idea of the O(ns) (or O(n+s^2)) algorithm. WFA is a diagonal transition method.
WFA: either your extension of this concept to gap-affine, or more specifically your implementation
WFA algorithm: same as WFA? Or different? Myself for the longest time I understood WFA as WaveFront Algorithm, without Alignment at all.

(Actually we're struggling with the naming problem ourselves -- not sure whether to give a name to the idea/method, the implementation, or both.)

There's also some inconsistency between the WFA and BiWFA papers now:
WFA paper: The wavefront alignment algorithm (WFA)
BiWFA paper: The wavefront alignment (WFA) algorithm

Max antidiagonal optimization

Hey,

From your code it seems that you skip computing WF_OVERLAP when the furthest forward and backward antidiagonal do not reach each other yet. I think it's worth mentioning this in the paper -- I hadn't thought of it myself yet and it's a pretty simple optimisation that I may also put in my own code.

O(n+s^2) expected time

Myers'86 shows that the diagonal transition method has O(n+s^2) expected time on random strings, and similarly Ukkonen'85 mentions that the O(s * min(n,m)) time is a worst case and at best it uses O(s^2 + min(n,m)) time.

It would be nice if you included a similar statement in your paper, since in practice, it will be very rare to take more than O(n+s^2) time, and I would expect the random-string analysis of Myers'86 to still apply.

I spent some time thinking about this, and while I can indeed come up with cases that would actually need O(ns) time, they are extremely contrived.

Make error

Hello!
I just clone the repository and try it but I have met a make error in 'example' folder.
I have followed the instruction in README.md:

$> git clone https://github.com/smarco/BiWFA-paper
$> cd BiWFA-paper
$> make clean all

and I did:

$> cd example
$> make

the error is:

cc  -L../lib -I.. wfa_basic.c -o bin/wfa_basic -lwfa -lm
/usr/bin/ld: ../lib/libwfa.a(wavefront_compute_affine.o): in function `wavefront_compute_affine._omp_fn.0':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_affine.c:230: undefined reference to `omp_get_num_threads'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_affine.c:230: undefined reference to `omp_get_thread_num'
/usr/bin/ld: ../lib/libwfa.a(wavefront_compute_affine.o): in function `wavefront_compute_affine':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_affine.c:227: undefined reference to `GOMP_parallel'
/usr/bin/ld: ../lib/libwfa.a(wavefront_compute_affine2p.o): in function `wavefront_compute_affine2p._omp_fn.0':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_affine2p.c:341: undefined reference to `omp_get_num_threads'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_affine2p.c:341: undefined reference to `omp_get_thread_num'
/usr/bin/ld: ../lib/libwfa.a(wavefront_compute_affine2p.o): in function `wavefront_compute_affine2p':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_affine2p.c:338: undefined reference to `GOMP_parallel'
/usr/bin/ld: ../lib/libwfa.a(wavefront_compute_edit.o): in function `wavefront_compute_edit._omp_fn.0':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_edit.c:336: undefined reference to `omp_get_num_threads'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_edit.c:336: undefined reference to `omp_get_thread_num'
/usr/bin/ld: ../lib/libwfa.a(wavefront_compute_edit.o): in function `wavefront_compute_edit':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_edit.c:333: undefined reference to `GOMP_parallel'
/usr/bin/ld: ../lib/libwfa.a(wavefront_compute_linear.o): in function `wavefront_compute_linear._omp_fn.0':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_linear.c:165: undefined reference to `omp_get_num_threads'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_linear.c:165: undefined reference to `omp_get_thread_num'
/usr/bin/ld: ../lib/libwfa.a(wavefront_compute_linear.o): in function `wavefront_compute_linear':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_compute_linear.c:162: undefined reference to `GOMP_parallel'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_endsfree_check_termination':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:109: undefined reference to `GOMP_critical_start'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:109: undefined reference to `GOMP_critical_end'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_end2end._omp_fn.0':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:337: undefined reference to `omp_get_num_threads'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:337: undefined reference to `omp_get_thread_num'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_end2end_max._omp_fn.0':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:290: undefined reference to `omp_get_num_threads'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:290: undefined reference to `omp_get_thread_num'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:294: undefined reference to `GOMP_critical_start'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:294: undefined reference to `GOMP_critical_end'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_endsfree_check_termination':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:125: undefined reference to `GOMP_critical_start'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:125: undefined reference to `GOMP_critical_end'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_endsfree._omp_fn.0':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:383: undefined reference to `omp_get_num_threads'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:383: undefined reference to `omp_get_thread_num'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_endsfree_check_termination':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:125: undefined reference to `GOMP_critical_start'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:125: undefined reference to `GOMP_critical_end'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:109: undefined reference to `GOMP_critical_start'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_custom._omp_fn.0':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:430: undefined reference to `omp_get_num_threads'
/usr/bin/ld: /mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:430: undefined reference to `omp_get_thread_num'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_end2end_max':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:287: undefined reference to `GOMP_parallel'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_end2end':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:334: undefined reference to `GOMP_parallel'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_endsfree':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:380: undefined reference to `GOMP_parallel'
/usr/bin/ld: ../lib/libwfa.a(wavefront_extend.o): in function `wavefront_extend_custom':
/mnt/d/WSL/opt/BiWFA-paper/wavefront/wavefront_extend.c:427: undefined reference to `GOMP_parallel'
collect2: error: ld returned 1 exit status
make: *** [Makefile:16: examples_c] Error 1

Could you help me solve this issue?
Thanks for your help!