Comments (12)
Good news, I think I figured out exactly how to fix the problem, without ripping out the existing logic :-) It turns out removing the op_count++ is important, but it's also important where the count is checked - before or after processing the row. I'll clean up my patch and send a PR shortly which will explain what I had to fix.
from scylladb.
CC @tgrabiec, @michoecho, @avikivity
from scylladb.
I'm worried that this bug is just a tip of the iceberg... Could counting and stopping a batch at exactly count 100 cause additional problems of stopping in the middle of a batch (perhaps involving some deletion of an old row and creation of a new row), or this bug could only happen for range tombstones?
I've done some more thinking about this topic. When this "max_rows_for_view_updates" was introduced in commit bf0777e of merge commit 7d21480, it was named max rows and the commit mesage talked about "large allocations and too large mutations" - all pointing to the goal of splitting the update after 100 actual output rows (to avoid the output buffer growing too large), and not after 100 input rows (to theoretically prevent stalls). Yet, the actual implementation calls the counter "op_count" and splits the update after 100 "operations" (basically, input rows, tombstones, etc.) - the strange _op_count++;;
in the code basically add 1 for each input element.
By the way, I now understand that #12297 (which we already fixed by #12305) was indirectly caused by this same unintentional (?) op_count confusion. When this code was written, it was evidentally assumed that on_results() would continue until it collects 100 output rows, so if we got back zero rows, it meant there was nothing else to do. However, in practice, on_results() only continued until it processed 100 input rows, which meant it was possible for it to process 100 input rows, see there is nothing to do for any of these rows, and end with zero output rows - which the code misjudged to mean the entire work is done (which it wasn't).
So before I can get rid of the _op_count++
I need to convince myself that it's not needed to avoid stalls. For example, if we delete a very long partition with an old timestamp, and could have thousands of don't-need-to-be-done view-row deletions, could this cause a stall? I need to verify that it can't.
Finally, there is another thing I need to test before concluding this issue: Commit 7d21480 explains that the 100-row feature was designed for "deleting a large base partition, which results in creating a view update per each deleted row". This issue reminded me that there is another very similar case: Where we have no a partition deletion, but rather a range deletion, e.g., when we have two clustering columns c1
and c2
and delete where c1=...
which can delete a thousand rows with 1000 values of c2
. Do we handle this case correctly? We need to 1. split the update after 100 rows (avoid allocating a huge mutation with 1000 view updates), but 2. correctly resume the iteration after first 100 deletions to produce the next 100 deletions. I think the test test_secondary_index.py::test_range_deletion
introduced in the aforementioned commit 7d21480 checks this case, but I need to make sure.
from scylladb.
So before I can get rid of the
_op_count++
I need to convince myself that it's not needed to avoid stalls.
Luckily, it turns out that the _op_count++
is not - or perhaps more accurately no longer - necessary to avoid stalls:
The test test_long_skipped_view_update_delete_with_timestamp
(originally written for reproducing #12297) creates a long base partition of length N rows and by manipulating the timestamp, ensures that when this partition is deleted only the last N/10 view-row deletions generate any view updates: The first 90% of the base rows above get read (this is what I called above "input rows"), but decides to generate no output rows. By adding prinouts to on_results()
I confirmed that after dropping the _op_count++
, on_results()
keeps the "op_count" at 0 all through these iterations, and doesn't stop after 100 input rows (like happened in #12297) - rather it continues to read all 90% of the base rows, and only only after reaching the final 10% rows and starting to generate useful view updates - it stops after 100.
I confirmed by setting N to 500000 that despire this very long loop generating no output, there are no stalls. The explanation is simple: on_results()
is not itself a loop - it only does one iteration, and asks to do the next iteration (or to stop, if it generated 100 output rows). The code that calls it does:
while (co_await on_results() == stop_iteration::no) {};
The thing is, Seastar's co_await checks for preemption by default, since 3b8903d three years ago (you can use co_await(coroutine::without_premption_check(f)) to avoid check for preempion
). So this loop will NOT stall! It's inefficient to do a loop over a partition in this way, but this is what we have in the code.
So it seems that removing the _op_count++
(and perhaps renaming this variable, or at least adding a comment explaining it) is exactly the right fix:
- The loop will not stall anyway because of all the preemption checks after every input row
- The large allocations are avoided by ensuring that the number of output rows in the view mutation that we're building is limited to 100 (note that _op_count is still incremented when real rows are added to the mutation).
from scylladb.
Unfortunately, it turns out that fixing the counting not to quadruple-count range tombstones (removing the "op_count++") does not fix the underlying bug. I'm not sure how I missed this earlier, but the test I already wrote yesterday (test_many_range_tombstone_base_update
) still fails: When fixing the counting, we collect 100 range tombstones instead of just 25, but we still lose the 101st range tombstone (instead of the 26th)!
Clearly, there is some bug in the view_update_builder::on_results()
which forgets some of its state after resuming the loop later, and I'm not sure yet exactly where. The code there is super-complex, and I really want to avoid rewriting it from scratch but it's very hard to understand.
from scylladb.
This is an important bug fix, and I want to backport it as much as we can (start with 5.4 and 5.2).
from scylladb.
Another user discovered a similar issue - #17469 - which my patch solved as well. Some of the details on how the problem was reached in the new issue are different from what I reported here - range tombstones were not involved, and neither were "many deletions to the same view partition". Rather, the new use case had a large number of different views, and a base update that was a batch involving multiple different rows, to that the single base update translated to more than 100 view updates and caused the same problem.
Issue #17469 shows a cqlsh reproducer for that other scenario, but I haven't wrote a cql-pytest regression test for it yet.
from scylladb.
@mykaul we need to decide if you want to put the backport/5.2 etc. tags on the issue (here) or the PR (not here). I'm confused how the new backport plan being proposed will work.
In any case please note that the older "backport needed" tag, from the existing workflow, is on the issue (here), not on the PR.
from scylladb.
@mykaul we need to decide if you want to put the backport/5.2 etc. tags on the issue (here) or the PR (not here). I'm confused how the new backport plan being proposed will work. In any case please note that the older "backport needed" tag, from the existing workflow, is on the issue (here), not on the PR.
Should be on the issue, at least with the existing flow. We wish to backport 'a fix' - it may be a cherry-pick of a PR, it may be a different PR. The intention is on the issue.
from scylladb.
Backported to next-5.4 (72e8043) and next-5.2 (6a6115c).
from scylladb.
Should be on the issue, at least with the existing flow. We wish to backport 'a fix' - it may be a cherry-pick of a PR, it may be a different PR. The intention is on the issue.
Ok, maybe I miss-understood. I was under the impression that there is a new automated bot that will force creators of PRs (not issues) to use the backport tags on those PRs....
from scylladb.
Should be on the issue, at least with the existing flow. We wish to backport 'a fix' - it may be a cherry-pick of a PR, it may be a different PR. The intention is on the issue.
Ok, maybe I miss-understood. I was under the impression that there is a new automated bot that will force creators of PRs (not issues) to use the backport tags on those PRs....
There is a new shiny process (or so I've heard) - but I expect it to run side-by-side the 'manual vintage' mode for a while, until we see it's mature and works well.
from scylladb.
Related Issues (20)
- docs: in case of majority loss, restore-from-backup is the only remaining option. HOT 3
- topology_experimental_raft/test_tablets is flaky HOT 1
- Docs: Document hard and soft ScyllaDB limits HOT 2
- `raft::request_aborted` gives no information what exactly was aborted HOT 1
- Seamless transition to inter-node encryption HOT 10
- heap-use-after-free in stream_session vs. cleanup_tablet HOT 3
- the baseline of text in monospace is lower than that of text in regular font HOT 1
- table::calculate_tablet_count() can potentially stall with large tablet count HOT 1
- docs: Issue on page ScyllaDB Fails to Start - SSTable Corruption Problem
- docs: Issue on page Replace a Dead Node in a Scylla Cluster HOT 1
- Scylla 5.4 nodetool status inaccurate statistics issue
- Scylla 5.4 nodetool status inaccurate statistics issue
- raft.replication_test.backpressure_drops fail on timeout
- cql: a crash lurking in ks_prop_defs::get_initial_tablets
- [tablets, MV]: `test_changes_while_node_down`: write failures to view metadata tables during node shutdown HOT 4
- docs: Issue on page Backup your Data - need to use Describe Schema with Internals HOT 3
- test_auth_v2_migration flaky due to auth-v1 known inconsistency problem
- Nodetool rebuild failed with 'rebuild failed: streaming failed' with large partitions and partition scans in parallel HOT 3
- docs: Live updateable configuration parameters
- [x86_64, debug] topology_custom/test_mv_topology_change failed with <Task HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scylladb.