GithubHelp home page GithubHelp logo

datastax / cassandra-data-migrator Goto Github PK

View Code? Open in Web Editor NEW
24.0 8.0 16.0 1.32 MB

Cassandra Data Migrator - Migrate & Validate data between origin and target Apache Cassandra®-compatible clusters.

License: Apache License 2.0

Java 88.77% Scala 2.94% Shell 7.81% Dockerfile 0.33% Makefile 0.14%

cassandra-data-migrator's Introduction

License Apache2 Star on Github GitHub release (with filter) Docker Pulls

cassandra-data-migrator

Migrate and Validate Tables between Origin and Target Cassandra Clusters.

⚠️ Please note this job has been tested with spark version 3.5.1

Install as a Container

  • Get the latest image that includes all dependencies from DockerHub
    • All migration tools (cassandra-data-migrator + dsbulk + cqlsh) would be available in the /assets/ folder of the container

Install as a JAR file

Prerequisite

  • Install Java11 (minimum) as Spark binaries are compiled with it.
  • Install Spark version 3.5.1 on a single VM (no cluster necessary) where you want to run this job. Spark can be installed by running the following: -
wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3-scala2.13.tgz
tar -xvzf spark-3.5.1-bin-hadoop3-scala2.13.tgz

⚠️ If the above Spark and Scala version is not properly installed, you'll then see a similar exception like below when running the CDM jobs,

Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.Statics.releaseFence()V

Steps for Data-Migration:

⚠️ Note that Version 4 of the tool is not backward-compatible with .properties files created in previous versions, and that package names have changed.

  1. cdm.properties file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file. The file can have any name, it does not need to be cdm.properties.
  2. Place the properties file where it can be accessed while running the job via spark-submit.
  3. Run the below job using spark-submit command as shown below:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Note:

  • Above command generates a log file logfile_name_*.txt to avoid log output on the console.
  • Update the memory options (driver & executor memory) based on your use-case

Steps for Data-Validation:

  • To run the job in Data validation mode, use class option --class com.datastax.cdm.job.DiffData as shown below
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
  • Validation job will report differences as “ERRORS” in the log file as shown below
23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999) 
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
  • Please grep for all ERROR from the output log files to get the list of missing and mismatched records.
    • Note that it lists differences by primary-key values.
  • The Validation job can also be run in an AutoCorrect mode. This mode can
    • Add any missing records from origin to target
    • Update any mismatched records between origin and target (makes target same as origin).
  • Enable/disable this feature using one or both of the below setting in the config file
spark.cdm.autocorrect.missing                     false|true
spark.cdm.autocorrect.mismatch                    false|true

Note:

  • The validation job will never delete records from target i.e. it only adds or updates data on target

Migrating or Validating specific partition ranges

  • You can also use the tool to Migrate or Validate specific partition ranges by using a partition-file with the name ./<keyspacename>.<tablename>_partitions.csv in the below format in the current folder as input
-507900353496146534,-107285462027022883
-506781526266485690,1506166634797362039
2637884402540451982,4638499294009575633
798869613692279889,8699484505161403540

Each line above represents a partition-range (min,max). Alternatively, you can also pass the partition-file via command-line param as shown below

./spark-submit --properties-file cdm.properties \
 --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
 --conf spark.cdm.tokenrange.partitionFile.input="/<path-to-file>/<csv-input-filename>" \
 --master "local[*]" --driver-memory 25G --executor-memory 25G \
 --class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.

A file named ./<keyspacename>.<tablename>_partitions.csv is auto-generated by the Migration & Validation jobs in the above format containing any failed partition ranges. No file is created if there are no failed partitions. This file can be used as an input to process any failed partition in a following run. You can also specify a different output file using the spark.cdm.tokenrange.partitionFile.output option.

./spark-submit --properties-file cdm.properties \
 --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
 --conf spark.cdm.tokenrange.partitionFile.input="/<path-to-file>/<csv-input-filename>" \
 --conf spark.cdm.tokenrange.partitionFile.output="/<path-to-file>/<csv-output-filename>" \
 --master "local[*]" --driver-memory 25G --executor-memory 25G \
 --class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

For the Data-Validation step, use the conf option -conf spark.cdm.tokenrange.partitionFile.appendOnDiff as shown below. This allows the partition range to be outputted whenever there are differences, not just fails.

./spark-submit --properties-file cdm.properties \
 --conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
 --conf spark.cdm.tokenrange.partitionFile.input="/<path-to-file>/<csv-input-filename>" \
 --conf spark.cdm.tokenrange.partitionFile.output="/<path-to-file>/<csv-output-filename>" \
 --conf spark.cdm.tokenrange.partitionFile.appendOnDiff=true \
 --master "local[*]" --driver-memory 25G --executor-memory 25G \
 --class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

If spark.cdm.tokenrange.partitionFile.input or spark.cdm.tokenrange.partitionFile.output are not specified, the system will use ./<keyspacename>.<tablename>_partitions.csv as the default file.

Perform large-field Guardrail violation checks

  • The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field) --class com.datastax.cdm.job.GuardrailCheck as shown below
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Features

  • Auto-detects table schema (column names, types, keys, collections, UDTs, etc.)
  • Preserve writetimes and TTLs
  • Supports migration/validation of advanced DataTypes (Sets, Lists, Maps, UDTs)
  • Filter records from Origin using writetimes and/or CQL conditions and/or a list of token-ranges
  • Perform guardrail checks (identify large fields)
  • Supports adding constants as new columns on Target
  • Supports expanding Map columns on Origin into multiple records on Target
  • Fully containerized (Docker and K8s friendly)
  • SSL Support (including custom cipher algorithms)
  • Migrate from any Cassandra Origin (Apache Cassandra® / DataStax Enterprise™ / DataStax Astra DB™) to any Cassandra Target (Apache Cassandra® / DataStax Enterprise™ / DataStax Astra DB™)
  • Supports migration/validation from and to Azure Cosmos Cassandra
  • Validate migration accuracy and performance using a smaller randomized data-set
  • Supports adding custom fixed writetime
  • Validation - Log partitions range level exceptions, use the exceptions file as input for rerun

Known Limitations

  • This tool does not migrate ttl & writetime at the field-level (for optimization reasons). It instead finds the field with the highest ttl & the field with the highest writetime within an origin row and uses those values on the entire target row.

Building Jar for local development

  1. Clone this repo
  2. Move to the repo folder cd cassandra-data-migrator
  3. Run the build mvn clean package (Needs Maven 3.9.x)
  4. The fat jar (cassandra-data-migrator-4.x.x.jar) file should now be present in the target folder

Contributors

Checkout all our wonderful contributors here.


cassandra-data-migrator's People

Contributors

abinaya21 avatar allcontributors[bot] avatar andrewhogg avatar ankitp1342 avatar anoop-datastax avatar arvydasj avatar faizalrub-datastax avatar guofei avatar jeremya avatar johnsmartco avatar mfmaher2 avatar mieslep avatar msmygit avatar pravinbhat avatar rzvoncek avatar snyk-bot avatar ssdatastax avatar vaishakhbn avatar weideng1 avatar yukim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cassandra-data-migrator's Issues

Counts reported during retry are incorrect & error-counts are never reported

There are two issues with counts when there is an exception while processing a partition-range

  1. When the partition-range is retried, it keeps adding the reads/writes to the counts performed on the same-range instead of resetting the counts for that partition-range & restarting. This leads to incorrect stats of & reads/writes even if the partition-range succeeds or fails
  2. It never reports the error count in-case the error for a partition-range was not resolved after retries.

Refactor CDM field-size guardrail check reporting

CDM tool currently has utility to report on large single-fields that may break guardrail limits (e.g. Astra Size of values in a single column limit of 10MB). This functionality is currently broken and is difficult to maintain.

Make this utility working again & Refactor the code to be much leaner & easy to maintain.

Astra to DSE migration not working

When origin is set to an Astra cluster and target is non-Astra, the migration fails with Authentication exceptions stating that the username/password is incorrect. However, the logs indicate that the credentials(username & pwd) of non-Astra cluster is being used to authenticate against the Astra DNS..

enable a Map to be exploded

When migrating a Map, allow the map to be expanded into two adjacent columns (map key/value).

  1. Indicate the source column to be expanded via index reference in the .properties file
  2. On spark.query.target, in place of the position where the map column is, specify two columns corresponding with the map key and map value, respectively. This will possibly have the effect of the spark.query.target having one more column than spark.query.origin.

Migrate / Diff / Patch should work normally, though the compare row counts will differ.

Handle blank `timestamp` values in primary-key columns gracefully

Issue: Unable to handle (throws below exception) blank timestamp values in primary-key columns. Note: Ideally primary-key timestamp column should never have blanks, however Cassandra/DSE/Astra seems to allow such values for timestamp (seems to be a bug), while it does not allow the same for Date or Time columns. This cause the Migration/Validation to fail for the partition-range (does not handle this issue graciously) midway.

Exception:

java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.servererrors.InvalidQueryException: Invalid null value in condition for column transactiondate
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
	at datastax.astra.migrate.CopyJobSession.iterateAndClearWriteResults(CopyJobSession.java:191)
	at datastax.astra.migrate.CopyJobSession.getDataAndInsert(CopyJobSession.java:148)
	at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$3(Migrate.scala:30)
	at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$3$adapted(Migrate.scala:28)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
	at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$2(Migrate.scala:28)
	at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$2$adapted(Migrate.scala:27)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)

Acceptable fix: Identify, log and skip (gracefully handle) blank timestamp records, this will be the default behavior. Optionally also allow a config to replace the blank timestamp with a fixed timestamp value i.e. it will not skip the blank values, but will insert dummy value in the field based on config (in addition to logging the same)

Improve readme docs to explain step-by-step container image usage in doing migrations

Improve readme docs to explain step-by-step container image usage in doing migrations

Tasks

Automatically set .batchSize based on the primary key

Automatically set .batchSize to 1 if a table has a primary key that is also the partition key. Example tables as follows:

Example 1:

CREATE TABLE IF NOT EXISTS ks1.tbl1 (
  pk1 int,
  pk2 long,
  c1 text,
  c2 uuid,
  PRIMARY KEY ((pk1,pk2))
);

Example 2:

CREATE TABLE IF NOT EXISTS ks1.tbl1 (
  c1 text PRIMARY KEY,
  c2 uuid
);

or use the default.

Not escaping quotes on insert to target for columns that require it

CDM is not escaping special columns like “order” in the insert statement on the target. It produces a syntax error on the insert:

com.datastax.oss.driver.api.core.servererrors.SyntaxError: line 1:121 no viable alternative at input 'order' (...document_type,order_header_key,create_ts,journals,last_upd_ts,[order]...)
        at com.datastax.oss.driver.api.core.servererrors.SyntaxError.copy(SyntaxError.java:48)
        at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
        at com.datastax.oss.driver.internal.core.cql.CqlPrepareSyncProcessor.process(CqlPrepareSyncProcessor.java:59)
        at com.datastax.oss.driver.internal.core.cql.CqlPrepareSyncProcessor.process(CqlPrepareSyncProcessor.java:31)
        at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
        at com.datastax.oss.driver.api.core.cql.SyncCqlSession.prepare(SyncCqlSession.java:224)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:43)
        at com.sun.proxy.$Proxy62.prepare(Unknown Source)

Source Schema:

CREATE TABLE ks_carrybooking_qp.order (
extn_host_order_ref text,
    document_type text,
    order_header_key text,
    create_ts timestamp,
    journals text,
    last_upd_ts timestamp,
    "order" blob,
    PRIMARY KEY (extn_host_order_ref, document_type, order_header_key)
)

Destination Schema:

CREATE TABLE ks_carrybooking_qp.order1 (
    extn_host_order_ref text,
    document_type text,
    order_header_key text,
    create_ts timestamp,
    journals text,
    last_upd_ts timestamp,
    "order" blob,
    PRIMARY KEY (extn_host_order_ref, document_type, order_header_key)
)

Document how to override default Java Driver configuration

This is for 3.3.0_stable branch but, would be helpful if we could clarify the same in main branch too.

Today, there is no manual/documentation which explains how to pass in Java Driver configuration to override parameters such as for example request timeouts, heartbeat timeouts, etc.,

Original Error

**Expand/Collapse to view stacktrace 1** 23/05/03 11:50:06 ERROR DiffJobSession: Could not perform diff for Key: 10061674 %% 1 %% ABCPTSQDBRRLCKJ07 java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.connection.HeartbeatException at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at datastax.astra.migrate.DiffJobSession.diffAndClear(DiffJobSession.java:107) at datastax.astra.migrate.DiffJobSession.lambda$getDataAndDiff$0(DiffJobSession.java:85) at java.util.Iterator.forEachRemaining(Iterator.java:116) at com.datastax.oss.driver.internal.core.cql.PagingIterableSpliterator.forEachRemaining(PagingIterableSpliterator.java:118) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) at datastax.astra.migrate.DiffJobSession.getDataAndDiff(DiffJobSession.java:69) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$3(DiffData.scala:28) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$3$adapted(DiffData.scala:26) at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$2(DiffData.scala:26) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$2$adapted(DiffData.scala:25) at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$1(DiffData.scala:25) at datastax.astra.migrate.DiffData$.$anonfun$diffTable$1$adapted(DiffData.scala:24) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: com.datastax.oss.driver.api.core.connection.HeartbeatException at com.datastax.oss.driver.internal.core.channel.HeartbeatHandler$HeartbeatRequest.fail(HeartbeatHandler.java:109) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.fail(ChannelHandlerRequest.java:62) at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:108) at com.datastax.oss.driver.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at com.datastax.oss.driver.shaded.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at com.datastax.oss.driver.shaded.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Heartbeat request: timed out after 5000 ms ... 10 more
**Expand/Collapse to view stacktrace 2** 23/05/02 13:16:35 ERROR CopyJobSession: Error occurred during Attempt#: 1 java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT10S at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at datastax.astra.migrate.CopyJobSession.iterateAndClearWriteResults(CopyJobSession.java:197) at datastax.astra.migrate.CopyJobSession.getDataAndInsert(CopyJobSession.java:109) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$3(Migrate.scala:30) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$3$adapted(Migrate.scala:28) at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$2(Migrate.scala:28) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$2$adapted(Migrate.scala:27) at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104) at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$1(Migrate.scala:27) at datastax.astra.migrate.Migrate$.$anonfun$migrateTable$1$adapted(Migrate.scala:26) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT10S at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206) at com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672) at com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747) at com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472) at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more 23/05/02 13:16:35 ERROR CopyJobSession: Error with PartitionRange -- ThreadID: 123 Processing min: 3335171328526692640 max: 3337016002934063595 -- Attempt# 1 23/05/02 13:16:35 ERROR CopyJobSession: Error stats Read#: 47, Wrote#: 0, Skipped#: 0, Error#: 47

What options were tried

Option 1

Added spark.cassandra.connection.timeoutMS 90000 in cdm.properties. This did not incrase the timeout value.

Option 2

Added --conf datastax-java-driver.advanced.heartbeat.timeout="90 seconds" --conf datastax-java-driver.basic.request.timeout="60 seconds" to the ./spark-submit command and had no effect in changing the timeout.

Option 3

Attempted the below without any luck,

./spark-submit \
--files /path/to/application.conf \
--conf spark.cassandra.connection.config.profile.path=application.conf \
...

in this case, it would only consider properties from application.conf and ignores everything else in cdm.properties as we could confirm this based on the below error stack,

23/05/02 16:20:56 WARN CassandraConnectionFactory: Ignoring all programmatic configuration, only using configuration from application.conf

Option 4

Attempted to add --driver-java-options "-Ddriver.basic.request.timeout='60 seconds'" to the ./spark-submit command and it ended up with no luck too.

Introduce support for new vector CQL data type

With the latest inclusion of vector cql data type, we need CDM to be able to handle them seamlessly.

Notes:

A cql3 data type of vector maps to CqlVector Java class and another item to consider is to consider mapping CQL vectors to Java array as mentioned here in the custom codecs [i.e. vector<float> CQL type to float[] Java array].

If CDM is pulling in Java Driver dependency via the Spark Cassandra Connector, then https://datastax-oss.atlassian.net/browse/SPARKC-706 maybe a precursor to building this capability.

NPE observed in logs when running DiffData job across multiple nodes

Null pointer exception noticed in logs while running DiffData spark job on multiple containers/nodes. When attempting to print final counts, the source and destination sessions are being passed as null. Although the sessions are not required to print counts, the nulls have to be handled gracefully to not error out.

Config file should represent base use-case

Config file should represent base use-case
All other complex & special-case scenarios should be part of config but disabled with appropriate comments on when to enable it

Error accessing spark config while running spark job across multiple nodes

A sample spark job like the one below throws error when trying to fetch spark config in Migrate or Diff jobs. The following is run in cluster mode where the job is split across multiple containers/nodes.
spark-submit --properties-file test.properties --num-executors=3 --master yarn --deploy-mode cluster --class datastax.astra.migrate.Migrate cassandra-data-migrator.jar &> log.txt

Can’t migrate time type

Version

Cassandra 3.11.10
Spark 2.4.8
cassandra-data-migrate 2.10.3

Origin table

CREATE TYPE ks.myty (
  a int,
  b text
);

CREATE TABLE ks.tb (
   id int PRIMARY KEY, 
   tt text,
   bg bigint,
   db double,
   tm time,
   mp map<int, text>,
   lt list<text>,
   bb blob,
   st set<int>,
   uu uuid,
   bn boolean,
   tp tuple<int, int>,
   ft float,
   tn tinyint,
   dc decimal,
   dt date,
   ut myty,
   vi varint
);

INSERT INTO ks.tb (id, tt, bg, db, tm, mp, lt, bb, st, uu, bn, tp, ft, tn, dc, dt, ut, vi) VALUES (
  1, 
  'text', 
  1000000000, 
  1.2, 
  '08:12:54', 
  {1:'a'},
  ['a', 'b'],
  bigintAsBlob(123),
  {1,2,3},
  now(),
  true,
  (1,1),
  1.1,
  1,
  12,
  '2011-02-03',
  {a: 1, b: 'b'},
  0
);

Target table

The target table is the same as the origin table.

sparkConf.properties

spark.query.origin                                id,tt,bg,db,tm,mp,lt,bb,st,uu,bn,tp,ft,tn,dc,dt,ut,vi
spark.query.origin.partitionKey                   id
spark.query.target.id                             id
spark.query.types                                 1,0,2,3,4,5%1%0,6%0,7,8%1,9,10,11,12,13,14,15,16,17

Error

23/02/06 11:37:27 ERROR CopyJobSession: Error occurred retry#: 1
com.datastax.oss.driver.api.core.type.codec.CodecNotFoundException: Codec not found for requested operation: [TIME <-> java.time.LocalDate]
        at com.datastax.oss.driver.internal.core.type.codec.registry.CachingCodecRegistry.createCodec(CachingCodecRegistry.java:609)
        at com.datastax.oss.driver.internal.core.type.codec.registry.DefaultCodecRegistry$1.load(DefaultCodecRegistry.java:95)
        at com.datastax.oss.driver.internal.core.type.codec.registry.DefaultCodecRegistry$1.load(DefaultCodecRegistry.java:92)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2276)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$Segment.get(LocalCache.java:2044)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache.get(LocalCache.java:3951)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache.getOrLoad(LocalCache.java:3973)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4957)
        at com.datastax.oss.driver.shaded.guava.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4963)
        at com.datastax.oss.driver.internal.core.type.codec.registry.DefaultCodecRegistry.getCachedCodec(DefaultCodecRegistry.java:117)
        at com.datastax.oss.driver.internal.core.type.codec.registry.CachingCodecRegistry.codecFor(CachingCodecRegistry.java:215)
        at com.datastax.oss.driver.api.core.data.GettableByIndex.get(GettableByIndex.java:126)
        at datastax.astra.migrate.BaseJobSession.getData(BaseJobSession.java:95)
        at datastax.astra.migrate.AbstractJobSession.bindInsert(AbstractJobSession.java:194)
        at datastax.astra.migrate.CopyJobSession.getDataAndInsert(CopyJobSession.java:121)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1$$anonfun$apply$1$$anonfun$apply$2.apply(Migrate.scala:29)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1$$anonfun$apply$1$$anonfun$apply$2.apply(Migrate.scala:27)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:105)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:104)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:122)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:104)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1$$anonfun$apply$1.apply(Migrate.scala:27)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1$$anonfun$apply$1.apply(Migrate.scala:26)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:105)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:104)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:122)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:104)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1.apply(Migrate.scala:26)
        at datastax.astra.migrate.Migrate$$anonfun$migrateTable$1.apply(Migrate.scala:25)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2107)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2107)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:411)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Other info

After I add a new type like this:

diff --git a/src/main/java/datastax/astra/migrate/MigrateDataType.java b/src/main/java/datastax/astra/migrate/MigrateDataType.java
index cd324e0..f176aba 100644
--- a/src/main/java/datastax/astra/migrate/MigrateDataType.java
+++ b/src/main/java/datastax/astra/migrate/MigrateDataType.java
@@ -8,6 +8,7 @@ import java.math.BigInteger;
 import java.nio.ByteBuffer;
 import java.time.Instant;
 import java.time.LocalDate;
+import java.time.LocalTime;
 import java.util.*;

 public class MigrateDataType {
@@ -81,6 +82,8 @@ public class MigrateDataType {
                 return UdtValue.class;
             case 17:
                 return BigInteger.class;
+            case 18:
+                return LocalTime.class;
         }

         return Object.class;

And change sparkConf.properties like this:

spark.query.origin                                id,tt,bg,db,tm,mp,lt,bb,st,uu,bn,tp,ft,tn,dc,dt,ut,vi
spark.query.origin.partitionKey                   id
spark.query.target.id                             id
spark.query.types                                 1,0,2,3,18,5%1%0,6%0,7,8%1,9,10,11,12,13,14,15,16,17

It migrates successfully.

CI/CD workflow

We need to have a good CI/CD workflow to test for any regressions with unit tests.

Rename splitSize to numSplits

Rename splitSize to numSplits as this params is used to make that many splits of the Cassandra token range. The word size is confusion as it makes you believe having a small size will throttle the load down while it has the opposite effect.
Also, keep backward compatibility with splitSize (for some time), but remove it from docs.

CDM 4.1.3 fails with NPE when a target counter table column has null values

Create an Origin counter table

CREATE TABLE IF NOT EXISTS test_ks.comments_count (
	comment_id text PRIMARY KEY,
	like_count counter, 
	reply_count counter,
	report_count counter,
	seq_counter counter);

Create a Target counter table

CREATE TABLE IF NOT EXISTS test_ks.comments_count_dest (
	comment_id text PRIMARY KEY,
	like_count counter, 
	reply_count counter,
	report_count counter,
	seq_counter counter);

Add a record to Origin table with no nulls & also add the same record on Target with null in some column

token@cqlsh:test_ks> update comments_count set like_count = like_count +1, reply_count = reply_count +1, report_count = report_count + 1, seq_counter = seq_counter +1 where comment_id = 't1';
token@cqlsh:test_ks> update comments_count_dest  set like_count = like_count +1, reply_count = reply_count +1,  seq_counter = seq_counter +1 where comment_id = 't1';token@cqlsh:test_ks> select * FROM comments_count;

 comment_id | like_count | reply_count | report_count | seq_counter
------------+------------+-------------+--------------+-------------
         t1 |          1 |           1 |            1 |           1

token@cqlsh:test_ks> select * FROM comments_count_dest ;

 comment_id | like_count | reply_count | report_count | seq_counter
------------+------------+-------------+--------------+-------------
         t1 |          1 |           1 |         null |           1

Now run CDM using version 4.1.3 or earlier 4.x version, you will get the below exception


23/08/16 11:39:49 INFO TaskSetManager: Finished task 49.0 in stage 0.0 (TID 49) in 29 ms on pbhat-rmbp16.lan (executor driver) (49/50)
23/08/16 11:39:49 ERROR TargetUpdateStatement: Error trying to bind value:0 to column:report_count of targetDataType:COUNTER/java.lang.Long at column index:3
23/08/16 11:39:49 ERROR CopyJobSession: Error occurred during Attempt#: 1
java.lang.NullPointerException
	at com.datastax.cdm.cql.statement.TargetUpdateStatement.bind(TargetUpdateStatement.java:51)
	at com.datastax.cdm.cql.statement.TargetUpsertStatement.bindRecord(TargetUpsertStatement.java:74)
	at com.datastax.cdm.job.CopyJobSession.bind(CopyJobSession.java:175)
	at com.datastax.cdm.job.CopyJobSession.getDataAndInsert(CopyJobSession.java:98)
	at com.datastax.cdm.job.CopyJobSession.processSlice(CopyJobSession.java:52)
	at com.datastax.cdm.job.CopyJobSession.processSlice(CopyJobSession.java:22)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$3(Migrate.scala:13)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$3$adapted(Migrate.scala:11)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$2(Migrate.scala:11)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$2$adapted(Migrate.scala:10)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$1(Migrate.scala:10)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$1$adapted(Migrate.scala:9)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)
	at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
23/08/16 11:39:49 ERROR CopyJobSession: Error with PartitionRange -- ThreadID: 94 Processing min: 4058283696216101360 max: 4242751136953196876 -- Attempt# 1
23/08/16 11:39:49 ERROR CopyJobSession: Error stats Read#: 1, Wrote#: 0, Skipped#: 0, Error#: 1

When DiffData/Validation job has exceptions, its difficult to rerun by partition-range (unlike Migrate)

See details provided by customer
We have done some initial testing of the cassandra-data-migrator-3.4.2.jar and so far the incremental timestamp fix is working as expected. We do however have one question about a particular scenario we observe when running the "DiffData"/Validator job.
Occasionally, when the job is initially starting we observe the following errors:

23/05/08 13:11:05 ERROR DiffJobSession: Could not perform diff for Key: 17241788
java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at datastax.astra.migrate.DiffJobSession.diffAndClear(DiffJobSession.java:108)
        at datastax.astra.migrate.DiffJobSession.lambda$getDataAndDiff$0(DiffJobSession.java:86)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at com.datastax.oss.driver.internal.core.cql.PagingIterableSpliterator.forEachRemaining(PagingIterableSpliterator.java:118)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
        at datastax.astra.migrate.DiffJobSession.getDataAndDiff(DiffJobSession.java:70)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$3(DiffData.scala:28)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$3$adapted(DiffData.scala:26)
        at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$2(DiffData.scala:26)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$2$adapted(DiffData.scala:25)
        at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$1(DiffData.scala:25)
        at datastax.astra.migrate.DiffData$.$anonfun$diffTable$1$adapted(DiffData.scala:24)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)
        at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2268)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query

The errors eventually stop and the job is able to carry on normally and finish.
We assume this is because the Read/Write throughput is too high when the job starts and the requests aren't able to be throttled correctly. When this happens during the "Migrate" job, we are able to extract the partition key ranges and re-run them manually at a lower throughput.
However, since this job only prints out the Primary Key values is there any other suggested mitigation/retry strategy other than re-running the entire table at a lower throughput? For context, the throughput we select is able to work with good read/write latencies on the server-side for the entirety of the job it just has errors at the beginning.

Migrate with UDT fails if the keyspace names are different

Migrate with UDT fails if the keyspace names are different. This happens because UDT uses keyspace names as part of its full name & the dynamic java objects created internally to represent the UDT have different types that cannot be auto casted.

Ability to stop/kill & resume the job

The tool should have ability to resume the job from the point of previous exit. Customers may want to kill a running job (to change throttling speed, etc.), or sometimes a long running job may get killed for various reasons (vm restart, etc.). The tool should write partitions that have been processed and use it as an input during a future run to exclude migrated partitions.

Duplicate values observed in List cells when running Migrate or Validation multiple times

Duplicate values observed in List cells when running Migrate or Validation multiple times.

Note: This issue is a bug in Cassandra (List inserts are not truly idempotent) & not really a CDM issue.

Details of the issue: As a side effect of the above C* bug, when CDM Migration or Validation (with auto-correct) is executed multiple times on the same table (this is not common, but can happen for various reasons), any existing list fields on target are overwritten again with the same timestamp (list values & timestamp from source), and this instead of replacing the full list duplicates all the entries on list. If you run the tool N times, the values would be duplicated N times.

Priority/Customer Impact: IHAC who is in midst of migrating several DSE 5.1 clusters to DSE 6.8. They are using ZDM + CDM for the same & are facing this issue. They understand that this is not a CDM issue, however they have asked us to provide a possible workaround/fix.

Workaround/Fix: The issue (duplicate cell values in List) happens only the timestamp is exactly same. If we increment the long epoch timestamp value by 1 (which is 1 microseconds (1/1,000,000 second)) & use this as the timestamp, the issue goes away. The timestamp value also is almost unchanged for humans and it does not create any (known) issues with migration (ZDM and/or CDM). Hence we will conditionally allow users to add an arbitrary long value (this should be typically 1) to the timestamp, which will allow them to workaround this issue.
Screenshot 2023-05-02 at 9 53 23 AM

CDM 4.1.2 fails with NPE when a counter column has null values

Consider a counter table as follows,

CREATE TABLE IF NOT EXISTS test_ks.comments_count (
	comment_id text PRIMARY KEY,
	like_count counter,                        << all counters
	reply_count counter,
	report_count counter,
	seq_counter counter);

and if it has data with null values for certain columns,

token@cqlsh:test_ks> SELECT * FROM comments_count;
comment_id | like_count | reply_count | report_count | seq_count
------------+-----------+------------+-------------+------------
        t3 |          3 |         10 |         null  |          5

(1 rows)

When the migrate job is ran, we get an error as follows:

23/08/14 09:25:34 ERROR TargetUpdateStatement: Error trying to bind value:10 to column:report_count of targetDataType:COUNTER/java.lang.Long at column index:3
23/08/14 09:25:34 ERROR CopyJobSession: Error occurred during Attempt#: 1
java.lang.NullPointerException
	at com.datastax.cdm.cql.statement.TargetUpdateStatement.bind(TargetUpdateStatement.java:51)
	at com.datastax.cdm.cql.statement.TargetUpsertStatement.bindRecord(TargetUpsertStatement.java:74)
	at com.datastax.cdm.job.CopyJobSession.bind(CopyJobSession.java:175)
	at com.datastax.cdm.job.CopyJobSession.getDataAndInsert(CopyJobSession.java:98)
	at com.datastax.cdm.job.CopyJobSession.processSlice(CopyJobSession.java:52)
	at com.datastax.cdm.job.CopyJobSession.processSlice(CopyJobSession.java:22)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$3(Migrate.scala:13)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$3$adapted(Migrate.scala:11)
	at com.datastax.spark.connector.cql.CassandraConnector.$anonfun$withSessionDo$1(CassandraConnector.scala:104)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:121)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
	at com.datastax.cdm.job.Migrate$.$anonfun$execute$2(Migrate.scala:11)

Update docs for new users

CDM works really well if you know all of the corners of the codebase and configuration. It would be great to surface this in the documentation. This would be good in the github repo as well as in the zero-downtime migration documentation.

Maven has removed 3.9.3 version from its CDN location causing Docker publishes to fail

Dockerfile today relies on a hard-coded version of Maven (i.e. 3.9.3). Since Maven has removed this version (& replaced it with 3.9.4) from it's CDN locations, we're facing with the below error when we run the Build and Publish to Docker workflow.

Error Stack Received

curl -fsSL -o /tmp/apache-maven.tar.gz https://dlcdn.apache.org/maven/maven-3/3.9.3/binaries/apache-maven-3.9.3-bin.tar.gz

0.725 curl: (22) The requested URL returned error: 404

progress counts are inaccurate

In following example reported in the logs by a single thread:

23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Read Record Count: 52600000
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Mismatch Record Count: 4
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Corrected Mismatch Record Count: 4
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Missing Record Count: 0
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Corrected Missing Record Count: 0
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Valid Record Count: 52587843
23/08/30 13:48:31 INFO DiffJobSession: ThreadID: 109 Skipped Record Count: 0

The Valid + Corrected do not match the Read count.

At issue is the threads are reporting global totals, and these totals are not updated as an atomic grouping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.