GithubHelp home page GithubHelp logo

fabianmeiswinkel / azure-cosmos-db-cassandra-api-spark-connector-sample Goto Github PK

View Code? Open in Web Editor NEW

This project forked from azure-samples/azure-cosmos-db-cassandra-api-spark-connector-sample

1.0 2.0 0.0 26 KB

Sample that provides guidelines and best practices for using the DataStax Spark Cassandra Connector against the Cosmos DB Cassandra API.

License: MIT License

Scala 100.00%

azure-cosmos-db-cassandra-api-spark-connector-sample's Introduction

page_type languages products description urlFragment
sample
scala
azure
This maven project provides samples and best practices for using the DataStax Spark Cassandra Connector against Azure Cosmos DB's Cassandra API.
azure-cosmos-db-cassandra-api-spark-connector-sample

Azure Cosmos DB Cassandra API - Datastax Spark Connector Sample

This maven project provides samples and best practices for using the DataStax Spark Cassandra Connector against Azure Cosmos DB's Cassandra API. For the purposes of providing an end-to-end sample, we've made use of an Azure HDI Spark Cluster to run the spark jobs provided in the example. All samples provided are in scala, built with maven.

Note - this sample is configured against the 2.0.6 version of the spark connector.

Running this Sample

Prerequisites

  • Cosmos DB Account configured with Cassandra API
  • Spark Cluster

Quick Start

Information regarding submitting spark jobs is not covered as part of this sample, please refer to Apache Spark's documentation. In order run this sample, correctly configure the sample to your cluster(as discussed below), build the project, generate the required jar(s), and then submit the job to your spark cluster.

Cassandra API Connection Parameters

In order for your spark jobs to connect with Cosmos DB's Cassandra API, you must set the following configurations:

Note - all these values can be found on the "Connection String" blade of your CosmosDB Account

Property NameValue
spark.cassandra.connection.host Your Cassandra Endpoint: ACOUNT_NAME.cassandra.cosmosdb.azure.com
spark.cassandra.connection.port 10350
spark.cassandra.connection.ssl.enabled true
spark.cassandra.auth.username COSMOSDB_ACCOUNTNAME
spark.cassandra.auth.password COSMOSDB_KEY

Configurations for Throughput optimization

Because Cosmos DB follows a provisioned throughput model, it is important to tune the relevant configurations of the connector to optimize for this model. General information regarding these configurations can be found on the Configuration Reference page of the DataStax Spark Cassandra Connector github repository.

Property NameDescription
spark.cassandra.output.batch.size.rows Leave this to 1. This is prefered for Cosmos DB's provisioning model in order to achieve higher throughput for heavy workloads.
spark.cassandra.connection.connections_per_executor_max 10*n

Which would be equivalent to 10 connections per node in an n-node Cassandra cluster. Hence if you require 5 connections per node per executor for a 5 node Cassandra cluster, then you would need to set this configuration to 25.
(Modify based on the degree of parallelism/number of executors that your spark job are configured for)
spark.cassandra.output.concurrent.writes 100

Defines the number of parallel writes that can occur per executor. As batch.size.rows is 1, make sure to scale up this value accordingly. (Modify this based on the degree of parallelism/throughput that you want to achieve for your workload)
spark.cassandra.concurrent.reads 512

Defines the number of parallel reads that can occur per executor. (Modify this based on the degree of parallelism/throughput that you want to achieve for your workload)
spark.cassandra.output.throughput_mb_per_sec Defines the total write throughput per executor. This can be used as an upper cap for your spark job throughput, and base it on the provisioned throughput of your Cosmos DB Collection.
spark.cassandra.input.reads_per_sec Defines the total read throughput per executor. This can be used as an upper cap for your spark job throughput, and base it on the provisioned throughput of your Cosmos DB Collection.
spark.cassandra.output.batch.grouping.buffer.size 1000
spark.cassandra.connection.keep_alive_ms 60000

Regarding throughput and degree of parallelism, it is important to tune the relevant parameters based on the amount of load you expect your upstream/downstream flows to be, the executors provisioned for your spark jobs, and the throughput you have provisioned for your Cosmos DB account.

Connection Factory Configuration and Retry Policy

As part of this sample, we have provided a connection factory and custom retry policy for Cosmos DB. We need a custom connection factory as that is the only way to configure a retry policy on the connector - SPARKC-437.

  • CosmosDbConnectionFactory.scala
  • CosmosDbMultipleRetryPolicy.scala

Retry Policy

The retry policy for Cosmos DB is configured to handle http status code 429 - Request Rate Large exceptions. The Cosmos Db Cassandra API, translates these exceptions to overloaded errors on the Cassandra native protocol, which we want to retry with back-offs. The reason for doing so is because Cosmos DB follows a provisioned throughput model, and having this retry policy would protect your spark jobs against spikes of data ingress/egress that would momentarily exceed the allocated throughput for your collection, resulting in the request rate limiting exceptions.

Note - that this retry policy is meant to only protect your spark jobs against momentary spikes. If you have not configured enough RUs on your collection for the intended throughput of your workload such that the retries don't catch up, then the retry policy will result in rethrows.

Known Issues

Tokens and Token Range Filters

We do not currently support methods that make use of Tokens for filtering data. Hence please avoid using any APIs that perform table scans.

Resources

azure-cosmos-db-cassandra-api-spark-connector-sample's People

Contributors

fabianmeiswinkel avatar microsoftopensource avatar msftgits avatar supernova-eng avatar vivekr20 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.