Comments (5)
If CDC is enabled for a table, can Scylla guarantee that both the data in the base table and the CDC log table are immediately visible after a request returns successfully to the user (using Alternator for PUT or UPDATE with LWT), or is there an asynchronous flushing of the CDC log?
Both writes (base and log) are synchronous -- once driver gets ACK, they were both made. So, assuming that you did the write with QUORUM, for example, then a QUORUM read from CDC log will observe the entry.
If writing to the CDC log fails, will the user still receive a successful response?
No. Both base and log must achieve CL of the write for the write to get ACK
Before proceeding with the rest of my answer, first read:
https://opensource.docs.scylladb.com/stable/using-scylla/cdc/cdc-log-table.html#digression-write-timestamps-in-scylla
and
https://opensource.docs.scylladb.com/stable/using-scylla/cdc/cdc-log-table.html#write-timestamps-in-cdc
Indeed, as you observed, different sources of timestamps for writes cause rows appearing out-of-order in the CDC log table.
Our CDC connectors (for example https://github.com/scylladb/scylla-cdc-java) deal with this by introducing the notion of confidence window.
The connector is querying subsequent time windows in the CDC log table. So, let's say that in a world with perfectly synchronized clock, the connector wants to query the changes from past minute; it would query the time window [now() - 60s, now()]
, for example.
But we're not living in a perfect world, so we have to take into account that rows are still appearing in this window even though our local clock shows we're past it.
So we're making an assumption that with sufficiently synchronized clocks, no data will appear anymore with timestamps below now() - confidence_window
for some value of confidence_window
.
How to pick the confidence window?
Suppose that
- the max difference between any of your clocks is bounded by
δ
- every write that doesn't succeed within
ε
is considered failed (it times out) - clocks are monotonic
Then any write with timestamp below now() - ε - δ
is already acked by now()
.
So if you start a read at now()
, and query up to now() - ε - δ
, you will (in theory) not miss any writes (other than failed writes or writes that will eventually fail).
Proof:
- suppose you read at machine
M
atnow_M
- suppose a successful write has timestamp
T <= now_M - ε - δ
- on machine
M'
, where the timestamp of this write was generated, the read starts atnow_M'
now_M
is within the window[now_M' - δ, now_M' + δ]
by our clock sync assumption. In particular,now_M <= now_M' + δ
- therefore
T <= now_M' + δ - ε - δ = now_M' - ε
- the write started before its timestamp was generated, so at some
T0 <= T < now_M' - ε
(T0
as measured byM'
) - the write was acked by
T0 + ε
(otherwise it would timeout -- by our assumption) T0 + ε <= now_M' - ε + ε = now_M'
- in other words, the write was acked by the moment you start this read.
So a theoretically sufficient confidence_window
is ε + δ
. But we're engineers, so just in case let's take ε + δ + 5s
or something ;) ε
should be more or less the write timeout that you configured, but with processes pausing, the decision to timeout not being atomic with checking for timeout, etc., you never know -- hence the 5s
for safety. In practice, even more might be needed.
IIRC the default confidence_window
used by java connector is 60s
so we take a pretty large margin of safety (for healthy clusters/networks).
Notice that there's a catch here: "any of your clocks" includes:
- the clocks used to generate timestamps for your writes
- and the clock used to calculate the time window that you're querying
The clock used to generate timestamps for your writes is actually, by default, not the Scylla node. It is the client -- specifically, the Scylla/Cassandra driver library which sends the INSERT
/UPDATE
. So one reason why you might be missing some writes is because the clock of the machine which sends these writes (not the Scylla node) is off.
It is also possible to disable client-generated timestamps, for example, in Python driver there is the use_client_timestamp
session property. Then Scylla node will create timestamps.
from scylladb.
About the confidence window:
Had you used Alternator also to read the CDC, i.e., the DynamoDB Streams API, you would have got the confidence window defaulting to 10 seconds (configurable by alternator_streams_time_window_s
). Which means that the confidence window that you used, 15 seconds, "should" have been enough to avoid the out-of-order problems.
You should stop reading the log when you reach to data newer than this confidence window, I'm not sure if that's what you're doing?
If it really does takes 60 seconds for data to move between Scylla nodes, maybe there is some other problems somewhere. Are some of your nodes or shards at 100% CPU? How is your "background writes" metric? Are you using materialized views (Alternator GSI or LSI)? Maybe some of these writes do get queued for 60 seconds causing the updates to arrive to different nodes at vary different times.
from scylladb.
@nyh @kbr-scylla If I can sync cluster time, guaranteed time drift less then confidence window. I won't lose any records?
If scylla load is too high. Some request timeout, or take a long time return. will this affect the confidence window?
from scylladb.
If time drifts too far, can I know through scylla?
from scylladb.
@zey1996 as I mentioned in my post, you need to sync the source of your write timestamp with the source of our query time windows. By default the source of write timestamps is not Scylla nodes -- it's the driver.
The larger timeouts you configured, the larger time windows you need:
Suppose that
- the max difference between any of your clocks is bounded by
δ
- every write that doesn't succeed within
ε
is considered failed (it times out)- clocks are monotonic
(...) a theoretically sufficient
confidence_window
isε + δ
. But we're engineers, so just in case let's takeε + δ + 5s
or something ;)ε
should be more or less the write timeout that you configured, but with processes pausing, the decision to timeout not being atomic with checking for timeout, etc., you never know -- hence the5s
for safety. In practice, even more might be needed.
from scylladb.
Related Issues (20)
- Failed to add node in parallel (no tablets) HOT 15
- MV write preventing topology change from progressing for 5 minutes HOT 2
- Make sure S3 DELETE is idempotent HOT 2
- Make sure S3 upload completion is idempotent
- Make sure S3 part upload is idempotent
- docs: improve https://github.com/scylladb/scylladb/blob/master/docs/dev/reader-concurrency-semaphore.md HOT 1
- Raft_group0 - failed to modify config when rebuild more than one node using replace_node_first_boot HOT 1
- mutation_partition_v2::apply_monotonically requires allocation in deferred function
- Alternator's handling of max_concurrent_requests_per_shard is different from CQL (and confusing) HOT 1
- artifact tests failing to setup scylla on python 3.12 [packaging.version.InvalidVersion: Invalid version] HOT 1
- Failed to rollback to OSS version during upgrade OSS->Enterprise when SLA was set HOT 1
- doc: remove information about outdated ScyllaDB versions from CDC
- No schema agreement after creating sequentially multiple keyspaces HOT 16
- Docs: Remove JMX part from docker hub docs
- "ALTER keyspace" for tablets-enabled keyspace executed after it was dropped results in `data_dictionary::no_such_keyspace` loop in topology coordinator HOT 17
- Stalls with large batches
- docs: Issue on page Data Distribution with Tablets HOT 2
- [raft] Running stress with large partitions happened `token_metadata - topology version 18 held for 643.167 [s] past expiry ...` and `raft operation [read_barrier] timed out` errors causing stress failures
- raft topology: nodes waiting in queue for topology operations could log a message periodically HOT 7
- assertion failure when collecting coverage data
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scylladb.