Installation details Scylla version (or git commit hash): scylla-5.2.18<b

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Data Synchronization Error between scylla and another database due to the Out-of-Order CDC Log Data about scylladb HOT 5 OPEN

ytypy commented on July 3, 2024

Data Synchronization Error between scylla and another database due to the Out-of-Order CDC Log Data

from scylladb.

Comments (5)

kbr-scylla commented on July 3, 2024

If CDC is enabled for a table, can Scylla guarantee that both the data in the base table and the CDC log table are immediately visible after a request returns successfully to the user (using Alternator for PUT or UPDATE with LWT), or is there an asynchronous flushing of the CDC log?

Both writes (base and log) are synchronous -- once driver gets ACK, they were both made. So, assuming that you did the write with QUORUM, for example, then a QUORUM read from CDC log will observe the entry.

If writing to the CDC log fails, will the user still receive a successful response?

No. Both base and log must achieve CL of the write for the write to get ACK

Before proceeding with the rest of my answer, first read:
https://opensource.docs.scylladb.com/stable/using-scylla/cdc/cdc-log-table.html#digression-write-timestamps-in-scylla
and
https://opensource.docs.scylladb.com/stable/using-scylla/cdc/cdc-log-table.html#write-timestamps-in-cdc

Indeed, as you observed, different sources of timestamps for writes cause rows appearing out-of-order in the CDC log table.

Our CDC connectors (for example https://github.com/scylladb/scylla-cdc-java) deal with this by introducing the notion of confidence window.

The connector is querying subsequent time windows in the CDC log table. So, let's say that in a world with perfectly synchronized clock, the connector wants to query the changes from past minute; it would query the time window [now() - 60s, now()], for example.

But we're not living in a perfect world, so we have to take into account that rows are still appearing in this window even though our local clock shows we're past it.
So we're making an assumption that with sufficiently synchronized clocks, no data will appear anymore with timestamps below now() - confidence_window for some value of confidence_window.

How to pick the confidence window?

Suppose that

the max difference between any of your clocks is bounded by δ
every write that doesn't succeed within ε is considered failed (it times out)
clocks are monotonic

Then any write with timestamp below now() - ε - δ is already acked by now().
So if you start a read at now(), and query up to now() - ε - δ, you will (in theory) not miss any writes (other than failed writes or writes that will eventually fail).

Proof:

suppose you read at machine M at now_M
suppose a successful write has timestamp T <= now_M - ε - δ
on machine M', where the timestamp of this write was generated, the read starts at now_M'
now_M is within the window [now_M' - δ, now_M' + δ] by our clock sync assumption. In particular, now_M <= now_M' + δ
therefore T <= now_M' + δ - ε - δ = now_M' - ε
the write started before its timestamp was generated, so at some T0 <= T < now_M' - ε (T0 as measured by M')
the write was acked by T0 + ε (otherwise it would timeout -- by our assumption)
T0 + ε <= now_M' - ε + ε = now_M'
in other words, the write was acked by the moment you start this read.

So a theoretically sufficient confidence_window is ε + δ. But we're engineers, so just in case let's take ε + δ + 5s or something ;) ε should be more or less the write timeout that you configured, but with processes pausing, the decision to timeout not being atomic with checking for timeout, etc., you never know -- hence the 5s for safety. In practice, even more might be needed.

IIRC the default confidence_window used by java connector is 60s so we take a pretty large margin of safety (for healthy clusters/networks).

Notice that there's a catch here: "any of your clocks" includes:

the clocks used to generate timestamps for your writes
and the clock used to calculate the time window that you're querying

The clock used to generate timestamps for your writes is actually, by default, not the Scylla node. It is the client -- specifically, the Scylla/Cassandra driver library which sends the INSERT/UPDATE. So one reason why you might be missing some writes is because the clock of the machine which sends these writes (not the Scylla node) is off.
It is also possible to disable client-generated timestamps, for example, in Python driver there is the use_client_timestamp session property. Then Scylla node will create timestamps.

from scylladb.

nyh commented on July 3, 2024

About the confidence window:

Had you used Alternator also to read the CDC, i.e., the DynamoDB Streams API, you would have got the confidence window defaulting to 10 seconds (configurable by alternator_streams_time_window_s). Which means that the confidence window that you used, 15 seconds, "should" have been enough to avoid the out-of-order problems.

You should stop reading the log when you reach to data newer than this confidence window, I'm not sure if that's what you're doing?

If it really does takes 60 seconds for data to move between Scylla nodes, maybe there is some other problems somewhere. Are some of your nodes or shards at 100% CPU? How is your "background writes" metric? Are you using materialized views (Alternator GSI or LSI)? Maybe some of these writes do get queued for 60 seconds causing the updates to arrive to different nodes at vary different times.

from scylladb.

zey1996 commented on July 3, 2024

@nyh @kbr-scylla If I can sync cluster time, guaranteed time drift less then confidence window. I won't lose any records?
If scylla load is too high. Some request timeout, or take a long time return. will this affect the confidence window?

from scylladb.

zey1996 commented on July 3, 2024

If time drifts too far, can I know through scylla?

from scylladb.

kbr-scylla commented on July 3, 2024

@zey1996 as I mentioned in my post, you need to sync the source of your write timestamp with the source of our query time windows. By default the source of write timestamps is not Scylla nodes -- it's the driver.

The larger timeouts you configured, the larger time windows you need:

Suppose that

the max difference between any of your clocks is bounded by δ

every write that doesn't succeed within ε is considered failed (it times out)

clocks are monotonic

(...) a theoretically sufficient confidence_window is ε + δ. But we're engineers, so just in case let's take ε + δ + 5s or something ;) ε should be more or less the write timeout that you configured, but with processes pausing, the decision to timeout not being atomic with checking for timeout, etc., you never know -- hence the 5s for safety. In practice, even more might be needed.

from scylladb.

Data Synchronization Error between scylla and another database due to the Out-of-Order CDC Log Data about scylladb HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs