For efficiency, several view updates to the same view partition are batched together t

CC <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

So before I can get rid of the _op_count++</cod

Another user discovered a similar issue - <a class="issue-link js-issue-link" data-err

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Backported to next-5.4 (<a class="commit-link" data-hovercard-type="commit" data-hover

Range tombstones can be missed in view if many deletions get sent to the same view partition about scylladb HOT 12 CLOSED

nyh commented on May 29, 2024

Range tombstones can be missed in view if many deletions get sent to the same view partition

from scylladb.

Comments (12)

nyh commented on May 29, 2024 1

Good news, I think I figured out exactly how to fix the problem, without ripping out the existing logic :-) It turns out removing the op_count++ is important, but it's also important where the count is checked - before or after processing the row. I'll clean up my patch and send a PR shortly which will explain what I had to fix.

from scylladb.

mykaul commented on May 29, 2024

CC @tgrabiec, @michoecho, @avikivity

from scylladb.

nyh commented on May 29, 2024

I'm worried that this bug is just a tip of the iceberg... Could counting and stopping a batch at exactly count 100 cause additional problems of stopping in the middle of a batch (perhaps involving some deletion of an old row and creation of a new row), or this bug could only happen for range tombstones?

I've done some more thinking about this topic. When this "max_rows_for_view_updates" was introduced in commit bf0777e of merge commit 7d21480, it was named max rows and the commit mesage talked about "large allocations and too large mutations" - all pointing to the goal of splitting the update after 100 actual output rows (to avoid the output buffer growing too large), and not after 100 input rows (to theoretically prevent stalls). Yet, the actual implementation calls the counter "op_count" and splits the update after 100 "operations" (basically, input rows, tombstones, etc.) - the strange _op_count++;; in the code basically add 1 for each input element.

By the way, I now understand that #12297 (which we already fixed by #12305) was indirectly caused by this same unintentional (?) op_count confusion. When this code was written, it was evidentally assumed that on_results() would continue until it collects 100 output rows, so if we got back zero rows, it meant there was nothing else to do. However, in practice, on_results() only continued until it processed 100 input rows, which meant it was possible for it to process 100 input rows, see there is nothing to do for any of these rows, and end with zero output rows - which the code misjudged to mean the entire work is done (which it wasn't).

So before I can get rid of the _op_count++ I need to convince myself that it's not needed to avoid stalls. For example, if we delete a very long partition with an old timestamp, and could have thousands of don't-need-to-be-done view-row deletions, could this cause a stall? I need to verify that it can't.

Finally, there is another thing I need to test before concluding this issue: Commit 7d21480 explains that the 100-row feature was designed for "deleting a large base partition, which results in creating a view update per each deleted row". This issue reminded me that there is another very similar case: Where we have no a partition deletion, but rather a range deletion, e.g., when we have two clustering columns c1 and c2 and delete where c1=... which can delete a thousand rows with 1000 values of c2. Do we handle this case correctly? We need to 1. split the update after 100 rows (avoid allocating a huge mutation with 1000 view updates), but 2. correctly resume the iteration after first 100 deletions to produce the next 100 deletions. I think the test test_secondary_index.py::test_range_deletion introduced in the aforementioned commit 7d21480 checks this case, but I need to make sure.

from scylladb.

nyh commented on May 29, 2024

So before I can get rid of the _op_count++ I need to convince myself that it's not needed to avoid stalls.

Luckily, it turns out that the _op_count++ is not - or perhaps more accurately no longer - necessary to avoid stalls:

The test test_long_skipped_view_update_delete_with_timestamp (originally written for reproducing #12297) creates a long base partition of length N rows and by manipulating the timestamp, ensures that when this partition is deleted only the last N/10 view-row deletions generate any view updates: The first 90% of the base rows above get read (this is what I called above "input rows"), but decides to generate no output rows. By adding prinouts to on_results() I confirmed that after dropping the _op_count++, on_results() keeps the "op_count" at 0 all through these iterations, and doesn't stop after 100 input rows (like happened in #12297) - rather it continues to read all 90% of the base rows, and only only after reaching the final 10% rows and starting to generate useful view updates - it stops after 100.

I confirmed by setting N to 500000 that despire this very long loop generating no output, there are no stalls. The explanation is simple: on_results() is not itself a loop - it only does one iteration, and asks to do the next iteration (or to stop, if it generated 100 output rows). The code that calls it does:

    while (co_await on_results() == stop_iteration::no) {};

The thing is, Seastar's co_await checks for preemption by default, since 3b8903d three years ago (you can use co_await(coroutine::without_premption_check(f)) to avoid check for preempion). So this loop will NOT stall! It's inefficient to do a loop over a partition in this way, but this is what we have in the code.

So it seems that removing the _op_count++ (and perhaps renaming this variable, or at least adding a comment explaining it) is exactly the right fix:

The loop will not stall anyway because of all the preemption checks after every input row
The large allocations are avoided by ensuring that the number of output rows in the view mutation that we're building is limited to 100 (note that _op_count is still incremented when real rows are added to the mutation).

from scylladb.

nyh commented on May 29, 2024

Unfortunately, it turns out that fixing the counting not to quadruple-count range tombstones (removing the "op_count++") does not fix the underlying bug. I'm not sure how I missed this earlier, but the test I already wrote yesterday (test_many_range_tombstone_base_update) still fails: When fixing the counting, we collect 100 range tombstones instead of just 25, but we still lose the 101st range tombstone (instead of the 26th)!

Clearly, there is some bug in the view_update_builder::on_results() which forgets some of its state after resuming the loop later, and I'm not sure yet exactly where. The code there is super-complex, and I really want to avoid rewriting it from scratch but it's very hard to understand.

from scylladb.

nyh commented on May 29, 2024

This is an important bug fix, and I want to backport it as much as we can (start with 5.4 and 5.2).

from scylladb.

nyh commented on May 29, 2024

Another user discovered a similar issue - #17469 - which my patch solved as well. Some of the details on how the problem was reached in the new issue are different from what I reported here - range tombstones were not involved, and neither were "many deletions to the same view partition". Rather, the new use case had a large number of different views, and a base update that was a batch involving multiple different rows, to that the single base update translated to more than 100 view updates and caused the same problem.

Issue #17469 shows a cqlsh reproducer for that other scenario, but I haven't wrote a cql-pytest regression test for it yet.

from scylladb.

nyh commented on May 29, 2024

@mykaul we need to decide if you want to put the backport/5.2 etc. tags on the issue (here) or the PR (not here). I'm confused how the new backport plan being proposed will work.
In any case please note that the older "backport needed" tag, from the existing workflow, is on the issue (here), not on the PR.

from scylladb.

mykaul commented on May 29, 2024

@mykaul we need to decide if you want to put the backport/5.2 etc. tags on the issue (here) or the PR (not here). I'm confused how the new backport plan being proposed will work. In any case please note that the older "backport needed" tag, from the existing workflow, is on the issue (here), not on the PR.

Should be on the issue, at least with the existing flow. We wish to backport 'a fix' - it may be a cherry-pick of a PR, it may be a different PR. The intention is on the issue.

from scylladb.

nyh commented on May 29, 2024

Backported to next-5.4 (72e8043) and next-5.2 (6a6115c).

from scylladb.

nyh commented on May 29, 2024

Should be on the issue, at least with the existing flow. We wish to backport 'a fix' - it may be a cherry-pick of a PR, it may be a different PR. The intention is on the issue.

Ok, maybe I miss-understood. I was under the impression that there is a new automated bot that will force creators of PRs (not issues) to use the backport tags on those PRs....

from scylladb.

mykaul commented on May 29, 2024

Should be on the issue, at least with the existing flow. We wish to backport 'a fix' - it may be a cherry-pick of a PR, it may be a different PR. The intention is on the issue.

Ok, maybe I miss-understood. I was under the impression that there is a new automated bot that will force creators of PRs (not issues) to use the backport tags on those PRs....

There is a new shiny process (or so I've heard) - but I expect it to run side-by-side the 'manual vintage' mode for a while, until we see it's mature and works well.

from scylladb.

Range tombstones can be missed in view if many deletions get sent to the same view partition about scylladb HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs