GithubHelp home page GithubHelp logo

vishwakarmarahul / cassandra-spark-analytics Goto Github PK

View Code? Open in Web Editor NEW

This project forked from danpilch/cassandra-spark-analytics

0.0 2.0 0.0 65 KB

Supercharge your analysis of Cassandra data with Apache Spark

License: MIT License

Python 100.00%

cassandra-spark-analytics's Introduction

cassandra-spark-analytics

Supercharge your analysis of Cassandra data with Apache Spark

Introduction

Apache Cassandra is a fantastic scalable, fault-tolerant NoSQL database, however it is abhorrently hard to query your data outside the realms of what your datamodel will allow. Apache Spark is a fast and general engine for large-scale data processing.

In this guide I intend to demonstrate the power of Spark to provide analysis of your Cassandra data in ways that were previously thought impossible. This guide is aimed at anyone who is interested in data analysis.

Requirements

This guide has been tested with the software versions outlined below. Older versions may work, test at your own risk.

This guide was developed using CentOS 7.

Software Version
Cassandra 2.2.5
Cqlsh 4.1.1
Cassandra Spark Connector 1.5
Spark 1.6.1
Docker 1.11.1
Docker-compose 1.7.1
Python 2.7.5

Setup

Throughout this guide it will assume your working directory path is the base of this repo, YMMV.

Docker:

Firstly we need docker and docker-compose installed and running. A good startup guide can be viewed here.

Check docker is functioning correctly by issuing command docker ps.

Cassandra:

Start Cassandra container

We will start Cassandra with the included compose/cassandra.yml instruction file. This file can be edited if you know what you are doing but the defaults are fairly sane for this demo.

The Cassandra Thrift port (9160) will bind to 127.0.0.1:9160 and we will use this to connect to cassandra via cqlsh from the host.

docker-compose -f compose/cassandra.yml up -d

Check the container has started with docker ps.

Install Cqlsh

Install cqlsh with command pip install --user cqlsh==4.1.1.

Test you can access cassandra correctly: cqlsh 127.0.0.1

Create schema

Create the Cassandra schema with:

cqlsh 127.0.0.1 -f schema/spark_demo.cql

Import the test dataset (use your relative path):

echo "use spark_demo; COPY person_data (id, first_name, last_name, email, gender, ip_address) FROM 'schema/spark_demo_data.csv' WITH HEADER=true;" | cqlsh 127.0.0.1

If it is sucessful you should get some output like: 1000 rows imported in 0.477 seconds.

I generated the demo data using Mockaroo.

echo "SELECT * FROM spark_demo.person_data;" | cqlsh 127.0.0.1

 id  | email                  | first_name | gender | ip_address    | last_name
-----+------------------------+------------+--------+---------------+-----------
 769 | [email protected] |     Ernest |   Male | 165.66.44.126 |   Sanchez

...

Spark

We will start Spark with the included compose/spark.yml instruction file. You will need to edit this file to set the relative path to cassandra-spark-analytics/scripts this directory will be mounted into the running Spark container to give Spark access to the processing scripts.

The Spark instructions are located in docker/spark/Dockerfile if you want to roll your own image.

Start Spark container

docker-compose -f compose/spark.yml up -d

Check the container has started with docker ps.

Execute a Spark job

For this demo we are going to group and count all the first names in our person_data table. In cassandra-spark-analytics/scripts is count.py. You will need to alter CASSANDRA_DOCKER_IP and insert the IP address of the running cassandra container:

conf.set("spark.cassandra.connection.host", "CASSANDRA_DOCKER_IP")

To get the Cassandra container's IP address run:

docker inspect --format '{{ .NetworkSettings.IPAddress }}' cassandra

Once the IP address is updated you can execute the Spark job:

docker exec -ti spark /spark/bin/spark-submit /jobs/count.py

When the script finished executing you will be left with output like below:

+----------+-----+
|first_name|count|
+----------+-----+
|    Brenda|   12|
|    Sharon|   12|
|   Deborah|   12|
|     Kelly|   11|
|     Diana|   10|
|    Carlos|   10|
|     Irene|    9|
|   Heather|    9|
|    Louise|    9|
|    Andrea|    9|
|   Jeffrey|    9|
|   Michael|    9|
|    Victor|    9|
|     Terry|    8|
|     Steve|    8|
|    Ernest|    8|
|     Larry|    8|
|      Lois|    8|
|      John|    8|
|   Richard|    8|
|     Louis|    8|
|     Ralph|    8|
|    Joseph|    8|
|      Mary|    8|
|     Billy|    8|
+----------+-----+
only showing top 25 rows

That concludes this demo. If you come up with anything cool submit a PR!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.