datastax / cassandra-data-migrator Goto Github PK

View Code? Open in Web Editor NEW

24.0 8.0 16.0 1.32 MB

Cassandra Data Migrator - Migrate & Validate data between origin and target Apache Cassandra®-compatible clusters.

License: Apache License 2.0

Java 88.77% Scala 2.94% Shell 7.81% Dockerfile 0.33% Makefile 0.14%

cassandra-data-migrator's Introduction

cassandra-data-migrator

Migrate and Validate Tables between Origin and Target Cassandra Clusters.

⚠️ Please note this job has been tested with spark version 3.5.1

Install as a Container

Get the latest image that includes all dependencies from DockerHub
- All migration tools (cassandra-data-migrator + dsbulk + cqlsh) would be available in the /assets/ folder of the container

Install as a JAR file

Download the latest jar file from the GitHub packages area here

Prerequisite

Install Java11 (minimum) as Spark binaries are compiled with it.
Install Spark version 3.5.1 on a single VM (no cluster necessary) where you want to run this job. Spark can be installed by running the following: -

wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3-scala2.13.tgz
tar -xvzf spark-3.5.1-bin-hadoop3-scala2.13.tgz

⚠️ If the above Spark and Scala version is not properly installed, you'll then see a similar exception like below when running the CDM jobs,

Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.Statics.releaseFence()V

Steps for Data-Migration:

⚠️ Note that Version 4 of the tool is not backward-compatible with .properties files created in previous versions, and that package names have changed.

cdm.properties file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file. The file can have any name, it does not need to be cdm.properties.
- A simplified sample properties file configuration can be found here as cdm.properties
- A complete sample properties file configuration can be found here as cdm-detailed.properties
Place the properties file where it can be accessed while running the job via spark-submit.
Run the below job using spark-submit command as shown below:

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Note:

Above command generates a log file logfile_name_*.txt to avoid log output on the console.
Update the memory options (driver & executor memory) based on your use-case

Steps for Data-Validation:

To run the job in Data validation mode, use class option --class com.datastax.cdm.job.DiffData as shown below

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Validation job will report differences as “ERRORS” in the log file as shown below

23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999) 
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]

Please grep for all ERROR from the output log files to get the list of missing and mismatched records.
- Note that it lists differences by primary-key values.
The Validation job can also be run in an AutoCorrect mode. This mode can
- Add any missing records from origin to target
- Update any mismatched records between origin and target (makes target same as origin).
Enable/disable this feature using one or both of the below setting in the config file

spark.cdm.autocorrect.missing                     false|true
spark.cdm.autocorrect.mismatch                    false|true

Note:

The validation job will never delete records from target i.e. it only adds or updates data on target

Migrating or Validating specific partition ranges

You can also use the tool to Migrate or Validate specific partition ranges by using a partition-file with the name ./<keyspacename>.<tablename>_partitions.csv in the below format in the current folder as input

-507900353496146534,-107285462027022883
-506781526266485690,1506166634797362039
2637884402540451982,4638499294009575633
798869613692279889,8699484505161403540

Each line above represents a partition-range (min,max). Alternatively, you can also pass the partition-file via command-line param as shown below

./spark-submit --properties-file cdm.properties \
 --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
 --conf spark.cdm.tokenrange.partitionFile.input="/<path-to-file>/<csv-input-filename>" \
 --master "local[*]" --driver-memory 25G --executor-memory 25G \
 --class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.

A file named ./<keyspacename>.<tablename>_partitions.csv is auto-generated by the Migration & Validation jobs in the above format containing any failed partition ranges. No file is created if there are no failed partitions. This file can be used as an input to process any failed partition in a following run. You can also specify a different output file using the spark.cdm.tokenrange.partitionFile.output option.

./spark-submit --properties-file cdm.properties \
 --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
 --conf spark.cdm.tokenrange.partitionFile.input="/<path-to-file>/<csv-input-filename>" \
 --conf spark.cdm.tokenrange.partitionFile.output="/<path-to-file>/<csv-output-filename>" \
 --master "local[*]" --driver-memory 25G --executor-memory 25G \
 --class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

For the Data-Validation step, use the conf option -conf spark.cdm.tokenrange.partitionFile.appendOnDiff as shown below. This allows the partition range to be outputted whenever there are differences, not just fails.

./spark-submit --properties-file cdm.properties \
 --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
 --conf spark.cdm.tokenrange.partitionFile.input="/<path-to-file>/<csv-input-filename>" \
 --conf spark.cdm.tokenrange.partitionFile.output="/<path-to-file>/<csv-output-filename>" \
 --conf spark.cdm.tokenrange.partitionFile.appendOnDiff=true \
 --master "local[*]" --driver-memory 25G --executor-memory 25G \
 --class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

If spark.cdm.tokenrange.partitionFile.input or spark.cdm.tokenrange.partitionFile.output are not specified, the system will use ./<keyspacename>.<tablename>_partitions.csv as the default file.

Perform large-field Guardrail violation checks

The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field) --class com.datastax.cdm.job.GuardrailCheck as shown below

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Features

Auto-detects table schema (column names, types, keys, collections, UDTs, etc.)
- Including counter table Counter tables
Preserve writetimes and TTLs
Supports migration/validation of advanced DataTypes (Sets, Lists, Maps, UDTs)
Filter records from Origin using writetimes and/or CQL conditions and/or a list of token-ranges
Perform guardrail checks (identify large fields)
Supports adding constants as new columns on Target
Supports expanding Map columns on Origin into multiple records on Target
Fully containerized (Docker and K8s friendly)
SSL Support (including custom cipher algorithms)
Migrate from any Cassandra Origin (Apache Cassandra® / DataStax Enterprise™ / DataStax Astra DB™) to any Cassandra Target (Apache Cassandra® / DataStax Enterprise™ / DataStax Astra DB™)
Supports migration/validation from and to Azure Cosmos Cassandra
Validate migration accuracy and performance using a smaller randomized data-set
Supports adding custom fixed writetime
Validation - Log partitions range level exceptions, use the exceptions file as input for rerun

Known Limitations

This tool does not migrate ttl & writetime at the field-level (for optimization reasons). It instead finds the field with the highest ttl & the field with the highest writetime within an origin row and uses those values on the entire target row.

Building Jar for local development

Clone this repo
Move to the repo folder cd cassandra-data-migrator
Run the build mvn clean package (Needs Maven 3.9.x)
The fat jar (cassandra-data-migrator-4.x.x.jar) file should now be present in the target folder

Contributors

Checkout all our wonderful contributors here.

cassandra-data-migrator's People

Contributors

Stargazers

Watchers

Forkers

abinaya21 mieslep mayurchoubey andrewhogg faizalrub-datastax arvy yukim rzvoncek johnsmartco jange jeremya icaredb raghu-nandan-bs guofei

cassandra-data-migrator's Issues

Counts reported during retry are incorrect & error-counts are never reported

There are two issues with counts when there is an exception while processing a partition-range

When the partition-range is retried, it keeps adding the reads/writes to the counts performed on the same-range instead of resetting the counts for that partition-range & restarting. This leads to incorrect stats of & reads/writes even if the partition-range succeeds or fails
It never reports the error count in-case the error for a partition-range was not resolved after retries.

Upgrade mvn plugin versions

Refactor CDM field-size guardrail check reporting

CDM tool currently has utility to report on large single-fields that may break guardrail limits (e.g. Astra Size of values in a single column limit of 10MB). This functionality is currently broken and is difficult to maintain.

Make this utility working again & Refactor the code to be much leaner & easy to maintain.

README link to Spark 2.4.8 is a 404

In the README, https://github.com/datastax/cassandra-data-migrator/blob/main/README.md, the link to Spark 2.4.8 is a 404.

https://downloads.apache.org/spark/spark-2.4.8/

Should it now point to one of the 3.x.x version listed on https://downloads.apache.org/spark/ ?

Add information about CLA for contribution

Similar to ZDM proxy and drivers, we need to have a clear CLA in place for new contributors.

Astra to DSE migration not working

When origin is set to an Astra cluster and target is non-Astra, the migration fails with Authentication exceptions stating that the username/password is incorrect. However, the logs indicate that the credentials(username & pwd) of non-Astra cluster is being used to authenticate against the Astra DNS..

Azure/Cosmos compatibility

Implement & test compatibility with Azure/CosmosDB

enable a Map to be exploded

When migrating a Map, allow the map to be expanded into two adjacent columns (map key/value).

Indicate the source column to be expanded via index reference in the .properties file
On spark.query.target, in place of the position where the map column is, specify two columns corresponding with the map key and map value, respectively. This will possibly have the effect of the spark.query.target having one more column than spark.query.origin.

Migrate / Diff / Patch should work normally, though the compare row counts will differ.

Implement auto-discovery and processing of tables containing counter column types

#126 implemented auto-discovery in the 3.3.0_stable branch and this feature requests adds on top of it.

Handle blank `timestamp` values in primary-key columns gracefully

Issue: Unable to handle (throws below exception) blank timestamp values in primary-key columns. Note: Ideally primary-key timestamp column should never have blanks, however Cassandra/DSE/Astra seems to allow such values for timestamp (seems to be a bug), while it does not allow the same for Date or Time columns. This cause the Migration/Validation to fail for the partition-range (does not handle this issue graciously) midway.

Exception:

java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.servererrors.InvalidQueryException: Invalid null value in condition for column transactiondate
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
	at datastax.astra.migrate.CopyJobSession.iterateAndClearWriteResults(CopyJobSession.java:191)
	at datastax.astra.migrate.CopyJobSession.getDataAndInsert(CopyJobSession.java:148)
	at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$3(Migrate.scala:30)
	at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$3$adapted(Migrate.scala:28)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
	at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$2(Migrate.scala:28)
	at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$2$adapted(Migrate.scala:27)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)

Acceptable fix: Identify, log and skip (gracefully handle) blank timestamp records, this will be the default behavior. Optionally also allow a config to replace the blank timestamp with a fixed timestamp value i.e. it will not skip the blank values, but will insert dummy value in the field based on config (in addition to logging the same)

Improve readme docs to explain step-by-step container image usage in doing migrations

Tasks

Beta Give feedback

indented tasks (i.e. nested tasklist; support for this is coming!)
an empty task (i.e. - [ ] on a line by itself)
any empty new lines (before/after the tasklist or in between tasks)
duplicate issue/pr links
a draft task exceeds 512 characters

Please check the tasklists documentation for more information. 🔗 Feedback Discussion

Thank you for participating in the Private Beta ❤️

Currently [container image instructions here](https://github.com/datastax/cassandra-data-migrator#container-image) is slim on details on how to actually use it.
Options

Automatically set .batchSize based on the primary key

Automatically set .batchSize to 1 if a table has a primary key that is also the partition key. Example tables as follows:

Example 1:

CREATE TABLE IF NOT EXISTS ks1.tbl1 (
  pk1 int,
  pk2 long,
  c1 text,
  c2 uuid,
  PRIMARY KEY ((pk1,pk2))
);

Example 2:

CREATE TABLE IF NOT EXISTS ks1.tbl1 (
  c1 text PRIMARY KEY,
  c2 uuid
);

or use the default.

Not escaping quotes on insert to target for columns that require it

CDM is not escaping special columns like “order” in the insert statement on the target. It produces a syntax error on the insert:

com.datastax.oss.driver.api.core.servererrors.SyntaxError: line 1:121 no viable alternative at input 'order' (...document_type,order_header_key,create_ts,journals,last_upd_ts,[order]...)
        at com.datastax.oss.driver.api.core.servererrors.SyntaxError.copy(SyntaxError.java:48)
        at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
        at com.datastax.oss.driver.internal.core.cql.CqlPrepareSyncProcessor.process(CqlPrepareSyncProcessor.java:59)
        at com.datastax.oss.driver.internal.core.cql.CqlPrepareSyncProcessor.process(CqlPrepareSyncProcessor.java:31)
        at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
        at com.datastax.oss.driver.api.core.cql.SyncCqlSession.prepare(SyncCqlSession.java:224)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:43)
        at com.sun.proxy.$Proxy62.prepare(Unknown Source)

Source Schema:

CREATE TABLE ks_carrybooking_qp.order (
extn_host_order_ref text,
    document_type text,
    order_header_key text,
    create_ts timestamp,
    journals text,
    last_upd_ts timestamp,
    "order" blob,
    PRIMARY KEY (extn_host_order_ref, document_type, order_header_key)
)

Destination Schema:

CREATE TABLE ks_carrybooking_qp.order1 (
    extn_host_order_ref text,
    document_type text,
    order_header_key text,
    create_ts timestamp,
    journals text,
    last_upd_ts timestamp,
    "order" blob,
    PRIMARY KEY (extn_host_order_ref, document_type, order_header_key)
)

Automatically migrate all tables within a given application keyspace

#126 implemented auto-discovery of all columns and their types within a given keyspace.table. This ticket is to enhance that capability to migrate all tables within an application keyspace, when provided.

Fix package names used in SIT/PERF scripts

There were some package name issues identified while reviewing #133. Please fix package names in the scripts and documentation accordingly.

Create a CDM docker image

Docker image should package below
Java 8
Spark 2.4.8
Latest Cassandra-Data-Migrator packaged jar
Latest DSBulk
Latest CQLSH

Add all the above to PATH

Migrate entire keyspace

Auto discover column-names from origin

Auto discover column-names from origin (& remove the applicable config)

Document how to override default Java Driver configuration

This is for 3.3.0_stable branch but, would be helpful if we could clarify the same in main branch too.

Today, there is no manual/documentation which explains how to pass in Java Driver configuration to override parameters such as for example request timeouts, heartbeat timeouts, etc.,

Original Error

**Expand/Collapse to view stacktrace 1**

23/05/03 11:50:06 ERROR DiffJobSession: Could not perform diff for Key: 10061674 %% 1 %% ABCPTSQDBRRLCKJ07 java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.connection.HeartbeatException at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at datastax.astra.migrate.DiffJobSession.diffAndClear(DiffJobSession.java:107) at datastax.astra.migrate.DiffJobSession.lambda$getDataAndDiff$0(DiffJobSession.java:85) at java.util.Iterator.forEachRemaining(Iterator.java:116) at com.datastax.oss.driver.internal.core.cql.PagingIterableSpliterator.forEachRemaining(PagingIterableSpliterator.java:118) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) at datastax.astra.migrate.DiffJobSession.getDataAndDiff(DiffJobSession.java:69) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$3(DiffData.scala:28) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$3$adapted(DiffData.scala:26) at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$2(DiffData.scala:26) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$2$adapted(DiffData.scala:25) at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$1(DiffData.scala:25) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$1$adapted(DiffData.scala:24) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: com.datastax.oss.driver.api.core.connection.HeartbeatException at com.datastax.oss.driver.internal.core.channel.HeartbeatHandler$HeartbeatRequest.fail(HeartbeatHandler.java:109) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.fail(ChannelHandlerRequest.java:62) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:108) at com.datastax.oss.driver.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at com.datastax.oss.driver.shaded.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at com.datastax.oss.driver.shaded.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Heartbeat request: timed out after 5000 ms ... 10 more

**Expand/Collapse to view stacktrace 2**

23/05/02 13:16:35 ERROR CopyJobSession: Error occurred during Attempt#: 1 java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT10S at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at datastax.astra.migrate.CopyJobSession.iterateAndClearWriteResults(CopyJobSession.java:197) at datastax.astra.migrate.CopyJobSession.getDataAndInsert(CopyJobSession.java:109) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$3(Migrate.scala:30) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$3$adapted(Migrate.scala:28) at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$2(Migrate.scala:28) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$2$adapted(Migrate.scala:27) at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$1(Migrate.scala:27) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$1$adapted(Migrate.scala:26) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT10S at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206) at com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672) at com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747) at com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more 23/05/02 13:16:35 ERROR CopyJobSession: Error with PartitionRange -- ThreadID: 123 Processing min: 3335171328526692640 max: 3337016002934063595 -- Attempt# 1 23/05/02 13:16:35 ERROR CopyJobSession: Error stats Read#: 47, Wrote#: 0, Skipped#: 0, Error#: 47

What options were tried

Option 1

Added spark.cassandra.connection.timeoutMS 90000 in cdm.properties. This did not incrase the timeout value.

Option 2

Added --conf datastax-java-driver.advanced.heartbeat.timeout="90 seconds" --conf datastax-java-driver.basic.request.timeout="60 seconds" to the ./spark-submit command and had no effect in changing the timeout.

Option 3

Attempted the below without any luck,

./spark-submit \
--files /path/to/application.conf \
--conf spark.cassandra.connection.config.profile.path=application.conf \
...

in this case, it would only consider properties from application.conf and ignores everything else in cdm.properties as we could confirm this based on the below error stack,

23/05/02 16:20:56 WARN CassandraConnectionFactory: Ignoring all programmatic configuration, only using configuration from application.conf

Option 4

Attempted to add --driver-java-options "-Ddriver.basic.request.timeout='60 seconds'" to the ./spark-submit command and it ended up with no luck too.

Handle tables with no TTL and Writetime columns graciously

When migrating tables with no NON-FROZEN NON-PRIMARY-KEY columns, TTL and WRITETIME cannot be derived. Warn users when migrating such tables and handle it appropriately.

Introduce support for new vector CQL data type

With the latest inclusion of vector cql data type, we need CDM to be able to handle them seamlessly.

Notes:

A cql3 data type of vector maps to CqlVector Java class and another item to consider is to consider mapping CQL vectors to Java array as mentioned here in the custom codecs [i.e. vector<float> CQL type to float[] Java array].

If CDM is pulling in Java Driver dependency via the Spark Cassandra Connector, then https://datastax-oss.atlassian.net/browse/SPARKC-706 maybe a precursor to building this capability.

NPE observed in logs when running DiffData job across multiple nodes

Null pointer exception noticed in logs while running DiffData spark job on multiple containers/nodes. When attempting to print final counts, the source and destination sessions are being passed as null. Although the sessions are not required to print counts, the nulls have to be handled gracefully to not error out.

The min/max writetime filter does not work if one value was provided (min or max but not both) but the other was not

The min/max writetime filter does not work if one value was provided (min or max but not both) but the other was not.

CDM should use default values for min/max when any of the two are missing.

Config file should represent base use-case

Config file should represent base use-case
All other complex & special-case scenarios should be part of config but disabled with appropriate comments on when to enable it

Error accessing spark config while running spark job across multiple nodes

A sample spark job like the one below throws error when trying to fetch spark config in Migrate or Diff jobs. The following is run in cluster mode where the job is split across multiple containers/nodes.
spark-submit --properties-file test.properties --num-executors=3 --master yarn --deploy-mode cluster --class datastax.astra.migrate.Migrate cassandra-data-migrator.jar &> log.txt

allow static value to be defined on target

when loading from source to target, allow a hard-coded value to be configured that is applied on all target rows for the migration.

Can’t migrate time type

Version

Cassandra 3.11.10
Spark 2.4.8
cassandra-data-migrate 2.10.3

Origin table

CREATE TYPE ks.myty (
  a int,
  b text
);

CREATE TABLE ks.tb (
   id int PRIMARY KEY, 
   tt text,
   bg bigint,
   db double,
   tm time,
   mp map<int, text>,
   lt list<text>,
   bb blob,
   st set<int>,
   uu uuid,
   bn boolean,
   tp tuple<int, int>,
   ft float,
   tn tinyint,
   dc decimal,
   dt date,
   ut myty,
   vi varint
);

INSERT INTO ks.tb (id, tt, bg, db, tm, mp, lt, bb, st, uu, bn, tp, ft, tn, dc, dt, ut, vi) VALUES (
  1, 
  'text', 
  1000000000, 
  1.2, 
  '08:12:54', 
  {1:'a'},
  ['a', 'b'],
  bigintAsBlob(123),
  {1,2,3},
  now(),
  true,
  (1,1),
  1.1,
  1,
  12,
  '2011-02-03',
  {a: 1, b: 'b'},
  0
);

Target table

The target table is the same as the origin table.

sparkConf.properties

spark.query.origin                                id,tt,bg,db,tm,mp,lt,bb,st,uu,bn,tp,ft,tn,dc,dt,ut,vi
spark.query.origin.partitionKey                   id
spark.query.target.id                             id
spark.query.types                                 1,0,2,3,4,5%1%0,6%0,7,8%1,9,10,11,12,13,14,15,16,17

Error

23/02/06 11:37:27 ERROR CopyJobSession: Error occurred retry#: 1
com.datastax.oss.driver.api.core.type.codec.CodecNotFoundException: Codec not found for requested operation: [TIME <-> java.time.LocalDate]
        at com.datastax.oss.driver.internal.core.type.codec.registry.CachingCodecRegistry.createCodec(CachingCodecRegistry.java:609)
        at com.datastax.oss.driver.internal.core.type.codec.registry.DefaultCodecRegistry$1.load(DefaultCodecRegistry.java:95)
        at com.datastax.oss.driver.internal.core.type.codec.registry.DefaultCodecRegistry$1.load(DefaultCodecRegistry.java:92)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2276)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$Segment.get(LocalCache.java:2044)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache.get(LocalCache.java:3951)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache.getOrLoad(LocalCache.java:3973)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4957)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4963)
        at com.datastax.oss.driver.internal.core.type.codec.registry.DefaultCodecRegistry.getCachedCodec(DefaultCodecRegistry.java:117)
        at com.datastax.oss.driver.internal.core.type.codec.registry.CachingCodecRegistry.codecFor(CachingCodecRegistry.java:215)
        at com.datastax.oss.driver.api.core.data.GettableByIndex.get(GettableByIndex.java:126)
        at datastax.astra.migrate.BaseJobSession.getData(BaseJobSession.java:95)
        at datastax.astra.migrate.AbstractJobSession.bindInsert(AbstractJobSession.java:194)
        at datastax.astra.migrate.CopyJobSession.getDataAndInsert(CopyJobSession.java:121)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1$$anonfun$apply$1$$anonfun$apply$2.apply(Migrate.scala:29)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1$$anonfun$apply$1$$anonfun$apply$2.apply(Migrate.scala:27)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:105)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:104)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:122)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:104)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1$$anonfun$apply$1.apply(Migrate.scala:27)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1$$anonfun$apply$1.apply(Migrate.scala:26)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:105)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:104)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:122)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:104)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1.apply(Migrate.scala:26)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1.apply(Migrate.scala:25)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2107)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2107)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:411)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Other info

After I add a new type like this:

diff --git a/src/main/java/datastax/astra/migrate/MigrateDataType.java b/src/main/java/datastax/astra/migrate/MigrateDataType.java
index cd324e0..f176aba 100644
--- a/src/main/java/datastax/astra/migrate/MigrateDataType.java
+++ b/src/main/java/datastax/astra/migrate/MigrateDataType.java
@@ -8,6 +8,7 @@ import java.math.BigInteger;
 import java.nio.ByteBuffer;
 import java.time.Instant;
 import java.time.LocalDate;
+import java.time.LocalTime;
 import java.util.*;

 public class MigrateDataType {
@@ -81,6 +82,8 @@ public class MigrateDataType {
                 return UdtValue.class;
             case 17:
                 return BigInteger.class;
+            case 18:
+                return LocalTime.class;
         }

         return Object.class;

And change sparkConf.properties like this:

spark.query.origin                                id,tt,bg,db,tm,mp,lt,bb,st,uu,bn,tp,ft,tn,dc,dt,ut,vi
spark.query.origin.partitionKey                   id
spark.query.target.id                             id
spark.query.types                                 1,0,2,3,18,5%1%0,6%0,7,8%1,9,10,11,12,13,14,15,16,17

It migrates successfully.

CI/CD workflow

We need to have a good CI/CD workflow to test for any regressions with unit tests.

Add first unit test

Add first unit test, currently no tests are configured.

Rename splitSize to numSplits

Rename splitSize to numSplits as this params is used to make that many splits of the Cassandra token range. The word size is confusion as it makes you believe having a small size will throttle the load down while it has the opposite effect.
Also, keep backward compatibility with splitSize (for some time), but remove it from docs.

CDM 4.1.3 fails with NPE when a target counter table column has null values

Create an Origin counter table

CREATE TABLE IF NOT EXISTS test_ks.comments_count (
	comment_id text PRIMARY KEY,
	like_count counter, 
	reply_count counter,
	report_count counter,
	seq_counter counter);

Create a Target counter table

CREATE TABLE IF NOT EXISTS test_ks.comments_count_dest (
	comment_id text PRIMARY KEY,
	like_count counter, 
	reply_count counter,
	report_count counter,
	seq_counter counter);

Add a record to Origin table with no nulls & also add the same record on Target with null in some column

token@cqlsh:test_ks> update comments_count set like_count = like_count +1, reply_count = reply_count +1, report_count = report_count + 1, seq_counter = seq_counter +1 where comment_id = 't1';
token@cqlsh:test_ks> update comments_count_dest  set like_count = like_count +1, reply_count = reply_count +1,  seq_counter = seq_counter +1 where comment_id = 't1';token@cqlsh:test_ks> select * FROM comments_count;

 comment_id | like_count | reply_count | report_count | seq_counter
------------+------------+-------------+--------------+-------------
         t1 |          1 |           1 |            1 |           1

token@cqlsh:test_ks> select * FROM comments_count_dest ;

 comment_id | like_count | reply_count | report_count | seq_counter
------------+------------+-------------+--------------+-------------
         t1 |          1 |           1 |         null |           1

Now run CDM using version 4.1.3 or earlier 4.x version, you will get the below exception


23/08/16 11:39:49 INFO TaskSetManager: Finished task 49.0 in stage 0.0 (TID 49) in 29 ms on pbhat-rmbp16.lan (executor driver) (49/50)
23/08/16 11:39:49 ERROR TargetUpdateStatement: Error trying to bind value:0 to column:report_count of targetDataType:COUNTER/java.lang.Long at column index:3
23/08/16 11:39:49 ERROR CopyJobSession: Error occurred during Attempt#: 1
java.lang.NullPointerException
	at com.datastax.cdm.cql.statement.TargetUpdateStatement.bind(TargetUpdateStatement.java:51)
	at com.datastax.cdm.cql.statement.TargetUpsertStatement.bindRecord(TargetUpsertStatement.java:74)
	at com.datastax.cdm.job.CopyJobSession.bind(CopyJobSession.java:175)
	at com.datastax.cdm.job.CopyJobSession.getDataAndInsert(CopyJobSession.java:98)
	at com.datastax.cdm.job.CopyJobSession.processSlice(CopyJobSession.java:52)
	at com.datastax.cdm.job.CopyJobSession.processSlice(CopyJobSession.java:22)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$3(Migrate.scala:13)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$3$adapted(Migrate.scala:11)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$2(Migrate.scala:11)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$2$adapted(Migrate.scala:10)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$1(Migrate.scala:10)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$1$adapted(Migrate.scala:9)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)
	at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
23/08/16 11:39:49 ERROR CopyJobSession: Error with PartitionRange -- ThreadID: 94 Processing min: 4058283696216101360 max: 4242751136953196876 -- Attempt# 1
23/08/16 11:39:49 ERROR CopyJobSession: Error stats Read#: 1, Wrote#: 0, Skipped#: 0, Error#: 1

Add CI to this repo

Every commit should be built & ideally show a success/fail status.

When DiffData/Validation job has exceptions, its difficult to rerun by partition-range (unlike Migrate)

See details provided by customer
We have done some initial testing of the cassandra-data-migrator-3.4.2.jar and so far the incremental timestamp fix is working as expected. We do however have one question about a particular scenario we observe when running the "DiffData"/Validator job.
Occasionally, when the job is initially starting we observe the following errors:

23/05/08 13:11:05 ERROR DiffJobSession: Could not perform diff for Key: 17241788
java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at datastax.astra.migrate.DiffJobSession.diffAndClear(DiffJobSession.java:108)
        at datastax.astra.migrate.DiffJobSession.lambda$getDataAndDiff$0(DiffJobSession.java:86)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at com.datastax.oss.driver.internal.core.cql.PagingIterableSpliterator.forEachRemaining(PagingIterableSpliterator.java:118)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
        at datastax.astra.migrate.DiffJobSession.getDataAndDiff(DiffJobSession.java:70)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$3(DiffData.scala:28)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$3$adapted(DiffData.scala:26)
        at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$2(DiffData.scala:26)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$2$adapted(DiffData.scala:25)
        at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$1(DiffData.scala:25)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$1$adapted(DiffData.scala:24)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)
        at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query

The errors eventually stop and the job is able to carry on normally and finish.
We assume this is because the Read/Write throughput is too high when the job starts and the requests aren't able to be throttled correctly. When this happens during the "Migrate" job, we are able to extract the partition key ranges and re-run them manually at a lower throughput.
However, since this job only prints out the Primary Key values is there any other suggested mitigation/retry strategy other than re-running the entire table at a lower throughput? For context, the throughput we select is able to work with good read/write latencies on the server-side for the entirety of the job it just has errors at the beginning.

AWS/Keyspaces origin compatibility

Implement & test compatibility with AWS/Keyspaces

FrameTooLongException while migrating tables with large rows

We see the below FrameTooLongException while migrating tables with large rows
ERROR CopyJobSession: Error occurred retry#: 3
com.datastax.oss.driver.api.core.connection.FrameTooLongException: Adjusted frame length exceeds 268435456

Migrate with UDT fails if the keyspace names are different

Migrate with UDT fails if the keyspace names are different. This happens because UDT uses keyspace names as part of its full name & the dynamic java objects created internally to represent the UDT have different types that cannot be auto casted.

Auto discover primary-key from target

Auto discover primary-key from target (& remove the applicable config)

Ability to stop/kill & resume the job

The tool should have ability to resume the job from the point of previous exit. Customers may want to kill a running job (to change throttling speed, etc.), or sometimes a long running job may get killed for various reasons (vm restart, etc.). The tool should write partitions that have been processed and use it as an input during a future run to exclude migrated partitions.

Duplicate values observed in List cells when running Migrate or Validation multiple times

Duplicate values observed in List cells when running Migrate or Validation multiple times.

Note: This issue is a bug in Cassandra (List inserts are not truly idempotent) & not really a CDM issue.

Details of the issue: As a side effect of the above C* bug, when CDM Migration or Validation (with auto-correct) is executed multiple times on the same table (this is not common, but can happen for various reasons), any existing list fields on target are overwritten again with the same timestamp (list values & timestamp from source), and this instead of replacing the full list duplicates all the entries on list. If you run the tool N times, the values would be duplicated N times.

Priority/Customer Impact: IHAC who is in midst of migrating several DSE 5.1 clusters to DSE 6.8. They are using ZDM + CDM for the same & are facing this issue. They understand that this is not a CDM issue, however they have asked us to provide a possible workaround/fix.

Workaround/Fix: The issue (duplicate cell values in List) happens only the timestamp is exactly same. If we increment the long epoch timestamp value by 1 (which is 1 microseconds (1/1,000,000 second)) & use this as the timestamp, the issue goes away. The timestamp value also is almost unchanged for humans and it does not create any (known) issues with migration (ZDM and/or CDM). Hence we will conditionally allow users to add an arbitrary long value (this should be typically 1) to the timestamp, which will allow them to workaround this issue.

CDM 4.1.2 fails with NPE when a counter column has null values

Consider a counter table as follows,

CREATE TABLE IF NOT EXISTS test_ks.comments_count (
	comment_id text PRIMARY KEY,
	like_count counter,                        << all counters
	reply_count counter,
	report_count counter,
	seq_counter counter);

and if it has data with null values for certain columns,

token@cqlsh:test_ks> SELECT * FROM comments_count;
comment_id | like_count | reply_count | report_count | seq_count
------------+-----------+------------+-------------+------------
        t3 |          3 |         10 |         null  |          5

(1 rows)

When the migrate job is ran, we get an error as follows:

23/08/14 09:25:34 ERROR TargetUpdateStatement: Error trying to bind value:10 to column:report_count of targetDataType:COUNTER/java.lang.Long at column index:3
23/08/14 09:25:34 ERROR CopyJobSession: Error occurred during Attempt#: 1
java.lang.NullPointerException
	at com.datastax.cdm.cql.statement.TargetUpdateStatement.bind(TargetUpdateStatement.java:51)
	at com.datastax.cdm.cql.statement.TargetUpsertStatement.bindRecord(TargetUpsertStatement.java:74)
	at com.datastax.cdm.job.CopyJobSession.bind(CopyJobSession.java:175)
	at com.datastax.cdm.job.CopyJobSession.getDataAndInsert(CopyJobSession.java:98)
	at com.datastax.cdm.job.CopyJobSession.processSlice(CopyJobSession.java:52)
	at com.datastax.cdm.job.CopyJobSession.processSlice(CopyJobSession.java:22)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$3(Migrate.scala:13)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$3$adapted(Migrate.scala:11)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$2(Migrate.scala:11)

Scylladb origin compatibility

Implement & test compatibility with Scylladb

Upgrade to Java 17/21

Most systems now use java 11 as default & hence upgrade the build to use java 11

Bug:Provided consistency property not used. local_quorum should be the default.

CDM does not use the provided consistency property for all read/writes.
local_quorum should be used as default read/write consistency when not provided in properties
Read & Write consistencies should have separate values.

Auto discover partition-key from origin

Auto discover partition-key from origin (& remove the applicable config)

enable one or more constants to be added to target

Update docs for new users

CDM works really well if you know all of the corners of the codebase and configuration. It would be great to surface this in the documentation. This would be good in the github repo as well as in the zero-downtime migration documentation.

Auto discover column to data-type mapping from origin

Auto discover column to data-type mapping from origin (& remove the applicable config)

Maven has removed 3.9.3 version from its CDN location causing Docker publishes to fail

Dockerfile today relies on a hard-coded version of Maven (i.e. 3.9.3). Since Maven has removed this version (& replaced it with 3.9.4) from it's CDN locations, we're facing with the below error when we run the Build and Publish to Docker workflow.

Error Stack Received

curl -fsSL -o /tmp/apache-maven.tar.gz https://dlcdn.apache.org/maven/maven-3/3.9.3/binaries/apache-maven-3.9.3-bin.tar.gz

0.725 curl: (22) The requested URL returned error: 404

Publish package to a central artifact repo

Publish package to a central artifact repo
@Ankitp1342 @mieslep

progress counts are inaccurate

In following example reported in the logs by a single thread:

23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Read Record Count: 52600000
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Mismatch Record Count: 4
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Corrected Mismatch Record Count: 4
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Missing Record Count: 0
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Corrected Missing Record Count: 0
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Valid Record Count: 52587843
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Skipped Record Count: 0

The Valid + Corrected do not match the Read count.

At issue is the threads are reporting global totals, and these totals are not updated as an atomic grouping.

datastax / cassandra-data-migrator Goto Github PK

cassandra-data-migrator's Introduction

cassandra-data-migrator

Install as a Container

Install as a JAR file

Prerequisite

Steps for Data-Migration:

Steps for Data-Validation:

Migrating or Validating specific partition ranges

Perform large-field Guardrail violation checks

Features

Known Limitations

Building Jar for local development

Contributors

cassandra-data-migrator's People

Contributors

Stargazers

Watchers

Forkers

cassandra-data-migrator's Issues

Tasks

Original Error

What options were tried

Option 1

Option 2

Option 3

Option 4

Version

Origin table

Target table

sparkConf.properties

Error

Other info

Error Stack Received

Recommend Projects

Recommend Topics

Recommend Org

Jobs