GithubHelp home page GithubHelp logo

Comments (5)

kbr-scylla avatar kbr-scylla commented on July 3, 2024

If CDC is enabled for a table, can Scylla guarantee that both the data in the base table and the CDC log table are immediately visible after a request returns successfully to the user (using Alternator for PUT or UPDATE with LWT), or is there an asynchronous flushing of the CDC log?

Both writes (base and log) are synchronous -- once driver gets ACK, they were both made. So, assuming that you did the write with QUORUM, for example, then a QUORUM read from CDC log will observe the entry.

If writing to the CDC log fails, will the user still receive a successful response?

No. Both base and log must achieve CL of the write for the write to get ACK


Before proceeding with the rest of my answer, first read:
https://opensource.docs.scylladb.com/stable/using-scylla/cdc/cdc-log-table.html#digression-write-timestamps-in-scylla
and
https://opensource.docs.scylladb.com/stable/using-scylla/cdc/cdc-log-table.html#write-timestamps-in-cdc


Indeed, as you observed, different sources of timestamps for writes cause rows appearing out-of-order in the CDC log table.

Our CDC connectors (for example https://github.com/scylladb/scylla-cdc-java) deal with this by introducing the notion of confidence window.

The connector is querying subsequent time windows in the CDC log table. So, let's say that in a world with perfectly synchronized clock, the connector wants to query the changes from past minute; it would query the time window [now() - 60s, now()], for example.

But we're not living in a perfect world, so we have to take into account that rows are still appearing in this window even though our local clock shows we're past it.
So we're making an assumption that with sufficiently synchronized clocks, no data will appear anymore with timestamps below now() - confidence_window for some value of confidence_window.

How to pick the confidence window?

Suppose that

  • the max difference between any of your clocks is bounded by δ
  • every write that doesn't succeed within ε is considered failed (it times out)
  • clocks are monotonic

Then any write with timestamp below now() - ε - δ is already acked by now().
So if you start a read at now(), and query up to now() - ε - δ, you will (in theory) not miss any writes (other than failed writes or writes that will eventually fail).

Proof:

  • suppose you read at machine M at now_M
  • suppose a successful write has timestamp T <= now_M - ε - δ
  • on machine M', where the timestamp of this write was generated, the read starts at now_M'
  • now_M is within the window [now_M' - δ, now_M' + δ] by our clock sync assumption. In particular, now_M <= now_M' + δ
  • therefore T <= now_M' + δ - ε - δ = now_M' - ε
  • the write started before its timestamp was generated, so at some T0 <= T < now_M' - ε (T0 as measured by M')
  • the write was acked by T0 + ε (otherwise it would timeout -- by our assumption)
  • T0 + ε <= now_M' - ε + ε = now_M'
  • in other words, the write was acked by the moment you start this read.

So a theoretically sufficient confidence_window is ε + δ. But we're engineers, so just in case let's take ε + δ + 5s or something ;) ε should be more or less the write timeout that you configured, but with processes pausing, the decision to timeout not being atomic with checking for timeout, etc., you never know -- hence the 5s for safety. In practice, even more might be needed.

IIRC the default confidence_window used by java connector is 60s so we take a pretty large margin of safety (for healthy clusters/networks).

Notice that there's a catch here: "any of your clocks" includes:

  • the clocks used to generate timestamps for your writes
  • and the clock used to calculate the time window that you're querying

The clock used to generate timestamps for your writes is actually, by default, not the Scylla node. It is the client -- specifically, the Scylla/Cassandra driver library which sends the INSERT/UPDATE. So one reason why you might be missing some writes is because the clock of the machine which sends these writes (not the Scylla node) is off.
It is also possible to disable client-generated timestamps, for example, in Python driver there is the use_client_timestamp session property. Then Scylla node will create timestamps.

from scylladb.

nyh avatar nyh commented on July 3, 2024

About the confidence window:

Had you used Alternator also to read the CDC, i.e., the DynamoDB Streams API, you would have got the confidence window defaulting to 10 seconds (configurable by alternator_streams_time_window_s). Which means that the confidence window that you used, 15 seconds, "should" have been enough to avoid the out-of-order problems.

You should stop reading the log when you reach to data newer than this confidence window, I'm not sure if that's what you're doing?

If it really does takes 60 seconds for data to move between Scylla nodes, maybe there is some other problems somewhere. Are some of your nodes or shards at 100% CPU? How is your "background writes" metric? Are you using materialized views (Alternator GSI or LSI)? Maybe some of these writes do get queued for 60 seconds causing the updates to arrive to different nodes at vary different times.

from scylladb.

zey1996 avatar zey1996 commented on July 3, 2024

@nyh @kbr-scylla If I can sync cluster time, guaranteed time drift less then confidence window. I won't lose any records?
If scylla load is too high. Some request timeout, or take a long time return. will this affect the confidence window?

from scylladb.

zey1996 avatar zey1996 commented on July 3, 2024

If time drifts too far, can I know through scylla?

from scylladb.

kbr-scylla avatar kbr-scylla commented on July 3, 2024

@zey1996 as I mentioned in my post, you need to sync the source of your write timestamp with the source of our query time windows. By default the source of write timestamps is not Scylla nodes -- it's the driver.

The larger timeouts you configured, the larger time windows you need:

Suppose that

  • the max difference between any of your clocks is bounded by δ
  • every write that doesn't succeed within ε is considered failed (it times out)
  • clocks are monotonic

(...) a theoretically sufficient confidence_window is ε + δ. But we're engineers, so just in case let's take ε + δ + 5s or something ;) ε should be more or less the write timeout that you configured, but with processes pausing, the decision to timeout not being atomic with checking for timeout, etc., you never know -- hence the 5s for safety. In practice, even more might be needed.

from scylladb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.