scylladb / scylla-cdc-source-connector Goto Github PK

View Code? Open in Web Editor NEW

41.0 8.0 17.0 1.05 MB

A Kafka source connector capturing Scylla CDC changes

License: Apache License 2.0

Java 100.00%

kafka kafka-connect apache-kafka kafka-producer cdc change-data-capture debezium event-streaming scylla nosql

scylla-cdc-source-connector's Issues

Detected performance impact query

In scylla-cdc-source-connector, the following query pattern is used to retrieve the changed data from the CDC table.

SELECT * FROM test.user_scylla_cdc_log WHERE "cdc$stream_id" IN ? AND "cdc$time">? AND "cdc$time"<=?;

"cdc$stream_id" IN ?
The query pattern seems to be sending a lookup request to all the ScyllaDB Cluster Nodes. So ScyllaDB will see spikes in load and crash.

How about improving these query patterns by looking up and merging them in parallel inside the CDC Connector?

SELECT * FROM test.user_scylla_cdc_log WHERE "cdc$stream_id" = "stream_id_A" AND "cdc$time">? AND "cdc$time"<=?;

SELECT * FROM test.user_scylla_cdc_log WHERE "cdc$stream_id" = "stream_id_B" AND "cdc$time">? AND "cdc$time"<=?;

SELECT * FROM test.user_scylla_cdc_log WHERE "cdc$stream_id" = "stream_id_C" AND "cdc$time">? AND "cdc$time"<=?;

....

SELECT * FROM test.user_scylla_cdc_log WHERE "cdc$stream_id" = "stream_id_Z" AND "cdc$time">? AND "cdc$time"<=?;


Then, merge all query results in source connector.

Currently, ScyllaDB has failed after attaching the Source Connector in the production environment.

[ScyllaDB Cluster Environment]

Total Nodes
- 12Node
AWS_US_WEST_1
- 6Node
AWS_EU_CENTRAL_1
- 6Node

[Scylla CDC Source Connector Configuration]

  "connector.class": "com.scylladb.cdc.debezium.connector.ScyllaConnector",
  "tasks.max": "3",
  "scylla.cluster.ip.addresses": "'${database_url}'",
  "scylla.user": "'${database_user}'",
  "scylla.password": "'${database_password}'",
  "scylla.name": "cdc-data.test",
  "scylla.table.names": "test.a,test.b",
  "scylla.query.time.window.size": "10000",
  "scylla.confidence.window.size": "5000",
  "scylla.consistency.level": "LOCAL_QUORUM",
  "scylla.local.dc": "AWS_US_WEST_1",
  "producer.override.acks": "-1",
  "producer.override.max.in.flight.requests.per.connection": "1",
  "producer.override.compression.type": "snappy",
  "producer.override.linger.ms": "50",
  "producer.override.batch.size": "327680",
  "errors.tolerance": "all",
  "errors.log.enable": "true",
  "errors.log.include.messages": "true",
  "topic.creation.default.replication.factor": "3",
  "topic.creation.default.partitions": "11"

Upgrade to latest and greatest log4j version

although not exposed to the infamous Log4Shell bug, it would be better to move to latest.

Could an event be lost if the tasks.max setting conditions written in the README.md are not met?

Hi
Could an event be lost if the tasks.max setting conditions written in the README.md are not met?

tasks.max >= kafka connect cluster nodes (It's ok, 3 >= 3)
tasks.max >= number of nodes in Scylla cluster (It's not ok, 3 < 12)

In general, the tasks.max property should be greater or equal the number of nodes in Kafka Connect cluster, to allow the connector to start on each node. tasks.max property should also be greater or equal the number of nodes in your Scylla cluster.

[Infra]

Kafka Connect Cluster Node Size : 3
Total Multi DC ScyllaDB Cluster Node Size : 12
Each ScyllaDB Cluster Node Size : 6 (DC : AWS EU_CENTRAL_1), 6 (DC : AWS US_WEST_1)

[Scylla DB Source Connector's configuration]

"connector.class": "com.scylladb.cdc.debezium.connector.ScyllaConnector",
"tasks.max": "3",
"scylla.cluster.ip.addresses": "'${database_url}'",
"scylla.user": "'${database_user}'",
"scylla.password": "'${database_password}'",
"scylla.name": "cdc-data.test",
"scylla.table.names": "'${table_include_list}'",
"scylla.query.time.window.size": "5000",
"scylla.confidence.window.size": "5000",
"producer.override.acks": "-1",
"producer.override.max.in.flight.requests.per.connection": "1",
"producer.override.compression.type": "snappy",
"producer.override.linger.ms": "50",
"producer.override.batch.size": "327680",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"topic.creation.default.replication.factor": "'${replication_factor}'",
"topic.creation.default.partitions": "11"

MultiDC cluster, observing 100% CPU when connecting kafka debezium source conector

We have our multi dc setup with 3 node in dc1 and 3 in dc2 . Although both the DCs are in the same region n same subnet, This is done as we require separate clusters for reading n writing data.

Our setup creates a new table everyday at 12 midnight with cdc enabled in DC1

At the same time we also create a kafka source connector to consume cdc logs from DC2 everyday

Issue:
At around 12
When creating a new source con , we observe scylladb servers on dc2 consumes 100% cpu.

We increased cpu from 16cores to 32 but still same behavior
Once the kafka connector creates its topic and start reading data from cdc log ,scylladb cpu cools down

Logs:

The logs in syslog shows reader_concurrency_semaphores for that time period.

Any expert thoughts is appreciated
Thanks in advance

Reducing data observation lag during CDC generation switches

By "data observation lag" I refer to the phenomenon where the data is present in CDC log tables but we don't read it yet. We may introduce such a lag intentionally in order to minimize the chance that an out-of-order write-to-the-past appears which our query window will miss - this is why the "confidence window" concept exists in scylla-cdc-java and scylla-cdc-go. But this lag may also appear unintentionally, as a side effect of library/application design/implementation. Such unintentional lag appears in the source connector during CDC generation switches (which most commonly happen on Scylla cluster topology changes).

Currently the design is roughly as follows. There is a number of "worker" processes and there is a "master" process.

Each worker periodically queries a subset of streams in the current generation. Each worker, roughly each 60 seconds (configurable, I'll call this offset_flush_interval) saves its offsets to some kind of persistent storage (there is one offset per stream, denoting that the worker has read all changes up to this offset in that stream).

The master periodically queries the CDC generations table(s) to check if there are any new generations roughly each 30 seconds (configurable, I'll call this generation_fetch_interval; in code it's called sleepBeforeGenerationDoneMs but I don't like this name). If it sees that there is a generation succeeding the currently operating one, it queries the offsets of all workers from the persistent storage. When it sees that all offsets are >= than the timestamp of the succeeding generation, it turns off the workers and starts new ones which query streams from the new generation.

This design may introduce a huge data observation lag which is unnecessary. New generations appear in the generation table(s) roughly 2 * ring_delay before they start operating, where ring_delay is a Scylla configuration parameter that nobody ever changes (except in tests) and is equal to 30s. So in practice new generations appear 60s before they start operating, speaking in terms of the clock of the Scylla node which creates the generation, and we can probably safely assume that the clocks of all our processes fit within a few-seconds interval, so we can speak in terms of the clock of our master process. This means that the master knows about a generation very early (say, 50s before it starts operating) and can take steps to get rid of the observation lag.

Consider the following example scenario with the current design. Let X be some time point.

at X - 2s the master queries the generations table and sees no new generations.
at X - 1s the workers store their offsets, each offset equal to X - 1s.
at X a new generation appears in the tables with timestamp X + 60s (so that's when it starts operating).
at X + 28s and X + 58s the master queries the generations table and sees a new generation, but does not do anything because the offsets are still < than the new generation's timestamp (X - 1s < X + 60s).
at X + 59s each worker stores its offsets, each offset equal to X + 59s.
at X + 58s + 30s and X + 58s + 60s the master queries the generations table and sees that there is a new generation, but as before, does nothing (X + 59s < X + 60s).
at time X + 59s + 60s each worker again stores its offsets, each offset equal to X + 59s + 60s.
at time X + 58s + 90s the master finally sees that all stored offsets are >= than X + 60s (the generation timestamp) so it performs the switch.

So new workers are created at X + 58s + 90s, but the generation started operating at X + 60s. We get a ~90s lag (90s - epsilon, where epsilon = 2s in my example) before we start observing data from the new generation!

This doesn't have to be the case. Consider the following alternative design (and I'm sure there are many more different/better designs):

As soon as the master sees a new generation in the table (at X + 28s in the above example), it tells the existing workers that they should query no further than the generation's timestamp (X + 60s).
For each worker, as soon as it queries the last window (the window which intersects the X + 60s time point), it persists its offsets and informs the master.
As soon as the master learns that each worker queried the last window, it creates new workers.

Then the observation lag is independent of offset_flush_interval because the workers will truncate this interval when they learn about a new generation (they'll do the last flush earlier than usual). Furthermore, if generation_fetch_interval < 2 * ring_delay = 60s, the master will learn about the new generation before it starts operating. Then the observation lag will depend only on the querying frequency of each worker, the confidence window, and the communication delays between master and workers; thus, assuming that the communication delay is small, the lag will be roughly the same as if no generation switch was performed.

Throw exception when integrate with io.debezium.embedded

io.debezium version:1.4.1
scylla-cdc-source-connector version: 1.0.1

cause a error:class io.debezium.embedded.EmbeddedEngine$2 cannot be cast to class org.apache.kafka.connect.source.SourceConnectorContext

CDC log stream state (cdc$time) persisted via connect topic `connect-offsets`

Hi, when looking at the data published to connect-offsets table I noticed the latest window state is tracked by

'connector name'
array (per table..?) of tuple4
- keyspace_name
- table_name
- vnode_id
- generation_start

Why is this at the vnode_id level and where does this information come from?
When querying the table the vnode_id is not used as a query condition, right?

Further implication (maybe?):
The topic connect-offsets is created by kafka connect (not the scylla connector) and is not a compacted topic.
While running a simple test (scylla.query.time.window.size: 2000) for 1 connector, 1 task, 1 table - resulted in ~1M messages on the docker-connect-offsets topic.
@pkgonan may I ask if you've got numbers to confirm this for a more comprehensive setup?

@haaawk how is this topic consumed upon connector (re)start / task/consumer rebalancing? From beginning?

Update 2021-12-15:

ℹ️ For reference: the part on connect-offsets already has been well described and addressed in a section in the repo README:

scylla-cdc-source-connector/README.md

Lines 601 to 605 in ecbeb1d

 #### Offset (progress) storage 

 Scylla CDC Source Connector reads the CDC log by quering on [Vnode](https://docs.scylladb.com/architecture/ringarchitecture/) granularity level. It uses Kafka Connect to store current progress (offset) for each Vnode. By default, there are 256 Vnodes per each Scylla node. Kafka Connect stores those offsets in its `connect-offsets` internal topic, but it could grow large in case of big Scylla clusters. You can minimize this topic size, by adjusting the following configuration options on this topic: 

 1. `segment.bytes` or `segment.ms` - lowering them will make the compaction process trigger more often. 

 2. `cleanup.policy=delete` and setting `retention.ms` to at least the TTL value of your Scylla CDC table (in milliseconds; Scylla default is 24 hours). Using this configuration, older offsets will be deleted. By setting `retention.ms` to at least the TTL value of your Scylla CDC table, we make sure to delete only those offsets that have already expired in the source Scylla CDC table.

Offset (progress) storage

Scylla CDC Source Connector reads the CDC log by quering on Vnode granularity level. It uses Kafka Connect to store current progress (offset) for each Vnode. By default, there are 256 Vnodes per each Scylla node. Kafka Connect stores those offsets in its connect-offsets internal topic, but it could grow large in case of big Scylla clusters. You can minimize this topic size, by adjusting the following configuration options on this topic:

segment.bytes or segment.ms - lowering them will make the compaction process trigger more often.

cleanup.policy=delete and setting retention.ms to at least the TTL value of your Scylla CDC table (in milliseconds; Scylla default is 24 hours). Using this configuration, older offsets will be deleted. By setting retention.ms to at least the TTL value of your Scylla CDC table, we make sure to delete only those offsets that have already expired in the source Scylla CDC table.

Kafka Connect Scylla Connector tasks getting deleted from status topic

In testing scylla cdc using kafka connect tasks are automatically getting removed leading to no cdc events being streamed even when there are write ops on the table where cdc stream has been put.

Connector Config

{ "name": "cdc-platform-scylla-load-test-3", "config": { "connector.class": "com.scylladb.cdc.debezium.connector.ScyllaConnector", "scylla.user": "***", "auto.create.topics.enable": "true", "scylla.table.names": "scyllacdcloadtest.livestream", "tasks.max": "50", "scylla.cluster.ip.addresses": "172.19.0.103:19042,172.19.0.104:19042", "scylla.password": "***", "key.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "scylla.name": "livestream-cdc-load-testing", "key.converter": "org.apache.kafka.connect.json.JsonConverter" } }

Table details -
CREATE TABLE scyllacdcloadtest.livestream ( livestream_id text PRIMARY KEY, createreceivetime bigint, createtime bigint, endreceivetime bigint, endtime bigint, status text ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'IncrementalCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE';

Image showing write ops on the table where cdc is put.

Image showing cdc kafka topic messages/sec.

List kafka connect tasks returns

curl --location --request GET '100.98.4.123:8083/connectors

[ "cdc-platform-scylla-load-test-3" ]

Also describe kafka connect task return

curl --location --request GET '100.98.1.157:8083/connectors/cdc-platform-scylla-load-test-3

{ "name": "cdc-platform-scylla-load-test-3", "config": { "connector.class": "com.scylladb.cdc.debezium.connector.ScyllaConnector", "scylla.user": "***", "auto.create.topics.enable": "true", "scylla.table.names": "scyllacdcloadtest.livestream", "tasks.max": "50", "scylla.cluster.ip.addresses": "172.19.0.103:19042,172.19.0.104:19042", "scylla.password": "***", "key.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "name": "cdc-platform-scylla-load-test-3", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "scylla.name": "livestream-cdc-load-testing", "key.converter": "org.apache.kafka.connect.json.JsonConverter" }, "tasks": [], "type": "source" }
which shows there are no tasks running.

Also there are null values for task keys in status topic of kafka connect

Scylla CDC Connector Tests

Currently CDC connector is severely lacking in tests department. This results in unnecessarily long processes of manual testing, even when introducing the smallest of changes. And even after such testing it is hard to tell if it was extensive enough.

What kind of tests would be useful:

Unit tests – Currently there are 0 unit tests. First step should be covering at least the critical parts. Would greatly help with quickly checking nothing important is breaking with new changes.

Integration tests – Here debezium-connector-cassandra may be a good reference. We could similarly set up Scylla using Testcontainers and java-driver. After that thoroughly check if the CDC tables are translated into Kafka messages correctly, without yet sending them.

E2E – The most expensive type, but would save a lot of pain from setting up every component manually. It would be good to at least have 1 setup with recent Scylla version and one type of Kafka cluster.

Other nice-to-haves:

Stress tests – Mainly to see if anything breaks only under load. We could also think later about tracking performance.

“Nemesis” type of tests – Scenarios where we intentionally throw a wrench in-between important operations. For example: does the connector correctly resume if we crash it after processing pre-image event but before processing the insert it is related to? Will the offset be correct and pre-image reread upon restart?

Other debezium connectors that may be used as a references:
https://github.com/debezium/debezium-connector-spanner
https://github.com/debezium/debezium-connector-jdbc (This is a sink connector)
https://github.com/debezium/debezium-connector-cassandra
https://github.com/debezium/debezium-examples (Not a connector repo, but has some end-to-end examples)

Pulsar compatibility

Could we add Apache Pulsar compatibility to this same CDC Source Connector, or would we need to clone and create a side-by-side project that could be compatible with Apache Pulsar?

Add support for streaming preimage in kafka

Currently the before value in Kafka Message is always null as mentioned in the doc https://github.com/scylladb/scylla-cdc-source-connector#cell-representation.

The scylla cdc table already has preimage support as mentioned in the official docs here https://opensource.docs.scylladb.com/stable/using-scylla/cdc/cdc-preimages.html.

Add support for streaming the preimages stored in CDC table to Kafka Connector.

before field is null in Debezium format message

This is an example of update operation Debezium format message when I try to use scylla-cdc-source-connector to fetch cdc data from scylla, but before field is null and it can't work together with Flink DebeziumJson

{
"schema":{
"type":"struct",
"fields":[
{
"type":"struct",
"fields":[
{
"type":"string",
"optional":false,
"field":"version"
},
{
"type":"string",
"optional":false,
"field":"connector"
},
{
"type":"string",
"optional":false,
"field":"name"
},
{
"type":"int64",
"optional":false,
"field":"ts_ms"
},
{
"type":"string",
"optional":true,
"name":"io.debezium.data.Enum",
"version":1,
"parameters":{
"allowed":"true,last,false"
},
"default":"false",
"field":"snapshot"
},
{
"type":"string",
"optional":false,
"field":"db"
},
{
"type":"string",
"optional":false,
"field":"keyspace_name"
},
{
"type":"string",
"optional":false,
"field":"table_name"
},
{
"type":"int64",
"optional":false,
"field":"ts_us"
}
],
"optional":false,
"name":"com.scylladb.cdc.debezium.connector",
"field":"source"
},
{
"type":"struct",
"fields":[
{
"type":"struct",
"fields":[
{
"type":"string",
"optional":true,
"field":"value"
}
],
"optional":true,
"name":"MyScyllaCluster.dev.test_cdc.a.Cell",
"field":"a"
},
{
"type":"int32",
"optional":true,
"field":"pk"
},
{
"type":"struct",
"fields":[
{
"type":"int32",
"optional":true,
"field":"value"
}
],
"optional":true,
"name":"MyScyllaCluster.dev.test_cdc.v.Cell",
"field":"v"
}
],
"optional":true,
"name":"MyScyllaCluster.dev.test_cdc.Before",
"field":"before"
},
{
"type":"struct",
"fields":[
{
"type":"struct",
"fields":[
{
"type":"string",
"optional":true,
"field":"value"
}
],
"optional":true,
"name":"MyScyllaCluster.dev.test_cdc.a.Cell",
"field":"a"
},
{
"type":"int32",
"optional":true,
"field":"pk"
},
{
"type":"struct",
"fields":[
{
"type":"int32",
"optional":true,
"field":"value"
}
],
"optional":true,
"name":"MyScyllaCluster.dev.test_cdc.v.Cell",
"field":"v"
}
],
"optional":true,
"name":"MyScyllaCluster.dev.test_cdc.After",
"field":"after"
},
{
"type":"string",
"optional":true,
"field":"op"
},
{
"type":"int64",
"optional":true,
"field":"ts_ms"
},
{
"type":"struct",
"fields":[
{
"type":"string",
"optional":false,
"field":"id"
},
{
"type":"int64",
"optional":false,
"field":"total_order"
},
{
"type":"int64",
"optional":false,
"field":"data_collection_order"
}
],
"optional":true,
"field":"transaction"
}
],
"optional":false,
"name":"MyScyllaCluster.dev.test_cdc.Envelope"
},
"payload":{
"source":{
"version":"1.0.1",
"connector":"scylla",
"name":"MyScyllaCluster",
"ts_ms":1676889174620,
"snapshot":"false",
"db":"dev",
"keyspace_name":"dev",
"table_name":"test_cdc",
"ts_us":1676889174620301
},
"before":null,
"after":{
"a":null,
"pk":1,
"v":{
"value":2222222
}
},
"op":"u",
"ts_ms":1676889216086,
"transaction":null
}
}

https://github.com/apache/flink/blob/release-1.15.2/flink-formats/flink-json/src/main/java/org/apache/flink/formats/json/debezium/DebeziumJsonDeserializationSchema.java#L146

feature request: support for preimage

As a consumer of my CDC event stream (Kafka topic), with table cdc preimages enabled, I'd like to also receive data of the preimage *_cdc_log record (cdc$operation=0).

This would allow me to fully utilise the change event for stream processing use cases.

Optional: either follow the cdc setting of the (source) table in question - or have the scylla-cdc-source-connector to explicitly configure (enable/disable) processing of preimages.

Example use cases:

UPDATE to a table account where I'd like to determine if the col accountname has changed (+value from 'a' -> 'b')
construct the full updated record (~postimage) to have a complete new record of my document, e.g. for streaming to other systems / databases

Add Integration Tests using testcontainers + EmbeddedConnectCluster

Confluent provides org.apache.kafka.connect.util.clusters.EmbeddedConnectCluster.

Along with testcontainers for a scylla node this should allow for integration tests, debugging, improve code quality, confidence, maintenance, ...

Runtime error while starting connector

Stack trace:

Caused by: java.lang.NoSuchMethodError: 'org.apache.kafka.connect.source.SourceConnectorContext com.scylladb.cdc.debezium.connector.ScyllaConnector.context()'
at com.scylladb.cdc.debezium.connector.ScyllaConnector.buildMaster(ScyllaConnector.java:67)
at com.scylladb.cdc.debezium.connector.ScyllaConnector.validate(ScyllaConnector.java:138)
at org.apache.kafka.connect.runtime.AbstractHerder.validateConnectorConfig(AbstractHerder.java:318)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder$6.call(DistributedHerder.java:672)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder$6.call(DistributedHerder.java:669)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:299)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:248)

Better errors in case automatic topic creation is disabled

If automatic topic creation is disabled on a Kafka Connect cluster, the connector will not start succesfully. We should make sure that it gives clear error messages in that case and that this scenario is documented and explained in docs (including what are the names of topics to create: target topic and heartbeat topic).

confuent-hub can't install with version 1.2.0

I try to install on confuent-hub but it not yet to update.
RUN confluent-hub install --no-prompt scylladb/scylla-cdc-source-connector:1.2.0

I can't use ScyllaFlattenColumns class for transform.

https://www.confluent.io/hub/scylladb/scylla-cdc-source-connector

feature request: support for collection types (LIST, SET, MAP) and UDT

As a consumer of my CDC event stream (Kafka topic), with table cdc enabled and collection types (LIST, SET, MAP) and UDT used, I'd like to receive change data of all columns of the *_cdc_log record, incl. collection type + UDT fields.

This would allow me to utilise the change event for stream processing as no data is omitted.

Example use cases:

any consumer for a table cdc event where collection type / UDT cols have changed

Kafka Connector Vulnerabilities

Confluent regularly performs security scans on Confluent Hub connectors, as per Confluent’s security policy. Unfortunately this connector has been flagged as having unacceptable vulnerabilities and our policy is to escalate the connector to removal stages, unless we receive confirmation that the issues are being addressed by the partner.

I have attached the vulnerability scan. Please note that we acknowledge two exceptions for vulnerabilities raised:
Partner confirms that vulnerability is a false positive
Partner confirms that the issue is valid but not exploitable

Please can you urgently acknowledge receipt of this email, and as soon as possible thereafter let us know the ScyllaDB position on these vulnerabilities.

If you require further information on any of the above, please do not hesitate to get in touch.

Best regards,

Confluent CCET Team

scylladb.csv

Any support plan changing consistency level when reading cdc log?

HI.

Only reading the cdc log on Scylla takes too long, over a few seconds. When reading the cdc log, it seems to query other data centers together. Are there any plans to support a setting that changes the consistency level when reading cdc logs? For example local_quorum, each_quorum.. etc.

[Environment]

DC US_WEST_1 => AWS EC2 i3en.6xlarge * 3
DC EU_CENTRAL_1 => AWS EC2 i3en.6xlarge * 3

[Config]

CREATE KEYSPACE IF NOT EXISTS test_service WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'AWS_US_WEST_1' : 3,
'AWS_EU_CENTRAL_1' : 3
};

[SourceConnector Config]

  "connector.class": "com.scylladb.cdc.debezium.connector.ScyllaConnector",
  "tasks.max": "3",
  "scylla.cluster.ip.addresses": "'${database_url}'",
  "scylla.user": "'${database_user}'",
  "scylla.password": "'${database_password}'",
  "scylla.name": "cdc-data.test",
  "scylla.table.names": "'${table_include_list}'",
  "scylla.query.time.window.size": "5000",
  "scylla.confidence.window.size": "5000",
  "producer.override.acks": "-1",
  "producer.override.max.in.flight.requests.per.connection": "1",
  "producer.override.compression.type": "snappy",
  "producer.override.linger.ms": "50",
  "producer.override.batch.size": "327680",
  "errors.tolerance": "all",
  "errors.log.enable": "true",
  "errors.log.include.messages": "true",
  "topic.creation.default.replication.factor": "'${replication_factor}'",
  "topic.creation.default.partitions": "11"

[Metric]

Kafka MaskField transform problem

Hi, I have a user table with a column named ssn that holds SSN#. When the source connector publish to Kafka topic I want to mask out the SSN# field. So I use the MaskField [transform] (https://docs.confluent.io/platform/current/connect/transforms/maskfield.html#maskfield). I setup the transform with the following properties:
transforms=dataMask
transforms.dataMask.type=org.apache.kafka.connect.transforms.MaskField$Value
transforms.dataMask.fields=ssn
transforms.dataMask.replacement=
But I'm not seeing the field being masked. I wonder maybe the field name should not be the same as the database column name? Thanks for any help.

cdc events are not sent to kafka

Hi.
cdc events are not sent to kafka.
When we tested in dev environemnt (SIngle DC) worked well. But in production environment (Multi DC) did not work.

When create & update & delete command is executed in my_table (CDC Enabled), cdc log is generated to my_table_scylla_cdc_log successfully. But cdc event not sent to kafka topic. However, heartbeat event is produced to kafka successfully. (Kafka Topic : __debezium-heartbeat.cdc-data.test)

If an error log is detected, we can tell what the problem is, but it is difficult to know what the problem is because the error log does not occur.

[Versions]

Kafka Broker Version : 2.6.0
Confluent Kafka Connect Version : 6.1.1
scylla-cdc-source-connector Version : 1.0.0
ScyllaDB Open Source Version : 4.4.1

[Configs - Same in all environments.]

  "connector.class": "com.scylladb.cdc.debezium.connector.ScyllaConnector",
  "tasks.max": "3",  
  "scylla.cluster.ip.addresses": "'${database_url}'",  
  "scylla.user": "'${database_user}'",
  "scylla.password": "'${database_password}'",
  "scylla.name": "cdc-data.test",
  "scylla.table.names": "'${table_include_list}'",
  "scylla.query.time.window.size": "1000",
  "scylla.confidence.window.size": "1000",
  "producer.override.acks": "-1",
  "producer.override.max.in.flight.requests.per.connection": "1",
  "producer.override.compression.type": "snappy",
  "producer.override.linger.ms": "50",
  "producer.override.batch.size": "327680",
  "errors.tolerance": "all",
  "errors.log.enable": "true",
  "errors.log.include.messages": "true",
  "topic.creation.default.replication.factor": "'${replication_factor}'",
  "topic.creation.default.partitions": "11"

CREATE TABLE IF NOT EXISTS test_service.my_table(
user_id varchar,
blocked_user_id varchar,
type varchar,
created_at timeuuid,
PRIMARY KEY((user_id, blocked_user_id)))
with cdc={'enabled': true, 'ttl': 172800};

[Dev Environment Config - Single DC]

CREATE KEYSPACE IF NOT EXISTS test_service WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'ap-northeast-1' : 3
};

[Production Environment Config - Multi DC]

CREATE KEYSPACE IF NOT EXISTS test_service WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'us-west-1' : 3,
'eu-central-1' : 3
};

[Confluent Kafka Connect Log]

[2021-04-26 08:01:55,208] INFO    tasks.max = 3 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    scylla.table.names = test_service.my_table (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    producer.override.batch.size = 327680 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    scylla.cluster.ip.addresses = AA:9042,BB:9042,CC:9042 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    heartbeat.interval.ms = 30000 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    scylla.password = ******** (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    producer.override.linger.ms = 50 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    scylla.query.time.window.size = 1000 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    task.class = com.scylladb.cdc.debezium.connector.ScyllaConnectorTask (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    producer.override.acks = -1 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    scylla.confidence.window.size = 1000 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    topic.creation.default.replication.factor = 2 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    name = my_table_db_connector (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    errors.tolerance = all (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,208] INFO    errors.log.enable = true (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,209] INFO    scylla.name = cdc-data.test (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,209] INFO    producer.override.max.in.flight.requests.per.connection = 1 (io.debezium.connector.common.BaseSourceTask:102)
[2021-04-26 08:01:55,219] INFO Requested thread factory for connector ScyllaConnector, id = cdc-data.test named = change-event-source-coordinator (io.debezium.util.Threads:270)
[2021-04-26 08:01:55,219] INFO Creating thread debezium-scyllaconnector-cdc-data.test-change-event-source-coordinator (io.debezium.util.Threads:287)
[2021-04-26 08:01:55,220] INFO WorkerSourceTask{id=my_table_db_connector-1} Source task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2021-04-26 08:01:55,220] INFO Metrics registered (io.debezium.pipeline.ChangeEventSourceCoordinator:91)
[2021-04-26 08:01:55,220] INFO Context created (io.debezium.pipeline.ChangeEventSourceCoordinator:94)
[2021-04-26 08:01:55,220] INFO Snapshot ended with SnapshotResult [status=SKIPPED, offset=com.scylladb.cdc.debezium.connector.ScyllaOffsetContext@58cdf7fb] (io.debezium.pipeline.ChangeEventSourceCoordinator:106)
[2021-04-26 08:01:55,220] INFO Connected metrics set to 'true' (io.debezium.pipeline.metrics.StreamingChangeEventSourceMetrics:60)
[2021-04-26 08:01:55,220] INFO Starting streaming (io.debezium.pipeline.ChangeEventSourceCoordinator:139)
[2021-04-26 08:01:55,220] INFO Using native clock to generate timestamps. (shaded.com.scylladb.cdc.driver3.driver.core.ClockFactory:57)
===== Using optimized driver!!! =====
[2021-04-26 08:01:55,220] INFO ===== Using optimized driver!!! ===== (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:186)
[2021-04-26 08:01:55,222] INFO Requested thread factory for connector ScyllaConnector, id = cdc-data.test named = change-event-source-coordinator (io.debezium.util.Threads:270)
[2021-04-26 08:01:55,222] INFO Creating thread debezium-scyllaconnector-cdc-data.test-change-event-source-coordinator (io.debezium.util.Threads:287)
[2021-04-26 08:01:55,223] INFO WorkerSourceTask{id=my_table_db_connector-0} Source task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSourceTask:233)
[2021-04-26 08:01:55,253] WARN Unable to register the MBean 'debezium.scylla:type=connector-metrics,context=snapshot,server=cdc-data.test': debezium.scylla:type=connector-metrics,context=snapshot,server=cdc-data.test (io.debezium.pipeline.ChangeEventSourceCoordinator:56)
[2021-04-26 08:01:55,253] WARN Unable to register the MBean 'debezium.scylla:type=connector-metrics,context=streaming,server=cdc-data.test': debezium.scylla:type=connector-metrics,context=streaming,server=cdc-data.test (io.debezium.pipeline.ChangeEventSourceCoordinator:56)
[2021-04-26 08:01:55,253] INFO Metrics registered (io.debezium.pipeline.ChangeEventSourceCoordinator:91)
[2021-04-26 08:01:55,253] INFO Context created (io.debezium.pipeline.ChangeEventSourceCoordinator:94)
[2021-04-26 08:01:55,253] INFO Snapshot ended with SnapshotResult [status=SKIPPED, offset=com.scylladb.cdc.debezium.connector.ScyllaOffsetContext@7b13f3ae] (io.debezium.pipeline.ChangeEventSourceCoordinator:106)
[2021-04-26 08:01:55,254] INFO Connected metrics set to 'true' (io.debezium.pipeline.metrics.StreamingChangeEventSourceMetrics:60)
[2021-04-26 08:01:55,254] INFO Starting streaming (io.debezium.pipeline.ChangeEventSourceCoordinator:139)
[2021-04-26 08:01:55,254] INFO Using native clock to generate timestamps. (shaded.com.scylladb.cdc.driver3.driver.core.ClockFactory:57)
===== Using optimized driver!!! =====
[2021-04-26 08:01:55,254] INFO ===== Using optimized driver!!! ===== (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:186)
[2021-04-26 08:01:55,831] INFO Using data-center name 'us-west-1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) (shaded.com.scylladb.cdc.driver3.driver.core.policies.DCAwareRoundRobinPolicy:110)
[2021-04-26 08:01:55,832] INFO New Cassandra host /AA:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /BB:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /CC:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /DD:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO Using data-center name 'us-west-1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) (shaded.com.scylladb.cdc.driver3.driver.core.policies.DCAwareRoundRobinPolicy:110)
[2021-04-26 08:01:55,832] INFO New Cassandra host /EE:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /FF:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /AA:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /BB:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /DD:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /CC:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /EE:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)
[2021-04-26 08:01:55,832] INFO New Cassandra host /FF:9042 added (shaded.com.scylladb.cdc.driver3.driver.core.Cluster:1812)

[Below is a log that does not occur often and only occurs once.]

[2021-04-26 05:19:11,602] INFO Query SELECT * FROM test_service.my_table_scylla_cdc_log WHERE "cdc$stream_id" IN ? AND "cdc$time">? AND "cdc$time"<=?; is not prepared on null, preparing before retrying executing. Seeing this message a few times is fine, but seeing it a lot may be source of performance problems (shaded.com.scylladb.cdc.driver3.driver.core.RequestHandler:822)
[2021-04-26 05:19:11,604] INFO Query SELECT * FROM test_service.my_table_scylla_cdc_log WHERE "cdc$stream_id" IN ? AND "cdc$time">? AND "cdc$time"<=?; is not prepared on null, preparing before retrying executing. Seeing this message a few times is fine, but seeing it a lot may be source of performance problems (shaded.com.scylladb.cdc.driver3.driver.core.RequestHandler:822)
[2021-04-26 05:19:11,644] INFO Query SELECT * FROM test_service.my_table_scylla_cdc_log WHERE "cdc$stream_id" IN ? AND "cdc$time">? AND "cdc$time"<=?; is not prepared on null, preparing before retrying executing. Seeing this message a few times is fine, but seeing it a lot may be source of performance problems (shaded.com.scylladb.cdc.driver3.driver.core.RequestHandler:822)

feature request: support for postimage

As a consumer of my CDC event stream (Kafka topic), with table cdc postimages enabled, I'd like to also receive data of the postimage *_cdc_log record (cdc$operation=9).

This would allow me to fully utilise the change event for stream processing use cases.

Without the CDC postimage record included to the message to Kafka the change is lost. Enriching the record as part of stream processing not only would result in extra read operations to Scylla (network IO, latency, ..) but it is also impossible to fetch the actual point-in-time row postimage of the change event (since the row might have changed again in the meantime - or no longer exist..)

Optional: either follow the cdc setting of the (source) table in question - or have the scylla-cdc-source-connector to explicitly configure (enable/disable) processing of postimages.

Example use cases:

have the full latest record available for operations where not all columns are written, e.g. for streaming to other systems / databases.

NoSuchMethodException

Commit 80b7117 introduced a bug by updating flogger dependency version from 0.5.1 to 0.7.4.

Indeed, these lines of code https://github.com/scylladb/scylla-cdc-source-connector/blob/master/src/main/java/com/scylladb/cdc/debezium/connector/ScyllaConnector.java#L36-L38 trigger

java.lang.NoSuchMethodException: com.google.common.flogger.backend.log4j.Log4jBackendFactory.getInstance()

Add / use `poll.interval.ms` config option (!= `scylla.query.time.window.size`)

Description

To allow to tune/customise the behaviour of one's source connector setup, I'd like to also have a config option poll.interval.ms in addition to scylla.query.time.window.size which defines effectively the query time window size + query interval for a 'live' / caught up worker task.

As per my understanding / reasoning the poll.interval.ms would/should be smaller than scylla.query.time.window.size - with the latter being applied while catching up / init phase.

Workers (connect tasks) ideally will evenly scatter queries for to the assigned array of streamIds / streamIdGroups (scylla-cdc-java worker task?).

Config Field Definition

poll.interval.ms
Positive integer value that specifies the frequency in milliseconds the connector should wait to poll for new data in each worker task (Vnode). Defaults to 15.000 milliseconds.

Type: Integer
Importance: Low
Default: 15000
Frequency in ms to poll for new data in each table.

References

scylladb/scylla-cdc-go#10 (comment)

java.lang.NoSuchFieldError: tlm with kafka 3.*

When creating a connector, it fails with a stacktrace:

java.lang.NoSuchFieldError: tlm
              at org.apache.log4j.MDCFriend.fixForJava9(MDCFriend.java:11)
              at org.slf4j.impl.Log4jMDCAdapter.<clinit>(Log4jMDCAdapter.java:38)
              at org.slf4j.impl.StaticMDCBinder.getMDCA(StaticMDCBinder.java:59)
              at org.slf4j.MDC.bwCompatibleGetMDCAdapterFromBinder(MDC.java:99)
              at org.slf4j.MDC.<clinit>(MDC.java:108)
              at org.apache.kafka.connect.util.LoggingContext.<init>(LoggingContext.java:209)
              at org.apache.kafka.connect.util.LoggingContext.forConnector(LoggingContext.java:104)
              at org.apache.kafka.connect.runtime.Worker.startConnector(Worker.java:282)
              at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startConnector(DistributedHerder.java:1803)
              at org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$getConnectorStartingCallable$37(DistributedHerder.java:1809)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
              at java.base/java.lang.Thread.run(Thread.java:833)

Happens with Kafka 3.3.1, 3.4.0 (quay.io/strimzi/kafka). Does not happen with Kafka 2.8.1.

feature request: allow to define initial ChangeAgeLimit

As a connect user/admin I'd like to be able to configure scylla.change.age.limit.

(quoting scylla-cdc-go)

When the library starts for the first time it has to start consuming changes from some point in time. This parameter defines how far in the past it needs to look. If the value of the parameter is set to an hour, then the library will only read historical changes that are no older than an hour.

Consuming from a 'connector::table::vnodeId::streamId' should start from ~
max(persistedState, now()-tableCdcTtl, now()-changeAgeLimit)

	#### Offset (progress) storage
	Scylla CDC Source Connector reads the CDC log by quering on [Vnode](https://docs.scylladb.com/architecture/ringarchitecture/) granularity level. It uses Kafka Connect to store current progress (offset) for each Vnode. By default, there are 256 Vnodes per each Scylla node. Kafka Connect stores those offsets in its `connect-offsets` internal topic, but it could grow large in case of big Scylla clusters. You can minimize this topic size, by adjusting the following configuration options on this topic:

	1. `segment.bytes` or `segment.ms` - lowering them will make the compaction process trigger more often.
	2. `cleanup.policy=delete` and setting `retention.ms` to at least the TTL value of your Scylla CDC table (in milliseconds; Scylla default is 24 hours). Using this configuration, older offsets will be deleted. By setting `retention.ms` to at least the TTL value of your Scylla CDC table, we make sure to delete only those offsets that have already expired in the source Scylla CDC table.

scylladb / scylla-cdc-source-connector Goto Github PK

scylla-cdc-source-connector's Issues

Update 2021-12-15:

Offset (progress) storage

What kind of tests would be useful:

Other nice-to-haves:

Description

Config Field Definition

References

References

Recommend Projects

Recommend Topics

Recommend Org

Jobs