GithubHelp home page GithubHelp logo

voltdb / app-fastdata Goto Github PK

View Code? Open in Web Editor NEW
16.0 48.0 9.0 3.1 MB

VoltDB Click Stream Processing Example.

Home Page: http://voltdb.github.io/app-fastdata

License: MIT License

Java 21.68% Shell 4.59% PigLatin 2.03% Scala 1.45% R 0.62% CSS 0.10% HTML 2.07% JavaScript 67.47%

app-fastdata's Introduction

VoltDB Fast Data Example App

Use Case

The Fast Data app simulates real-time click stream processing. Click events are ingested into VoltDB at a high rate, then cleaned up and persisted into a data warehouse for historical analysis. Segmentation information is calculated in data warehouse and stored back into VoltDB. VoltDB uses the information to segment real-time events for per-event decisioning.

The click events can come from persistent queues like Apache Kafka. For simplicity, this sample app generates random events at a constant rate. Each click event contains basic information like source IP address, destination URL, timestamp, HTTP method name, referral URL, and the user agent string. More information can be included in a real-world use case.

The stream is ingested into VoltDB, cleaned up, then exported into a data warehouse like Hadoop for long term persistence. The data warehouse runs machine learning algorithm on the full historical dataset to segment the click events into clusters periodically. Each cluster is represented by its center. The cluster centers are sent back into VoltDB.

VoltDB uses the clustering information to further segment new click events into the corresponding cluster at real-time. VoltDB can use this to make per-event real-time decisions like what advertisement to show a user, or whether a user should be blocked for spamming.

Events are aged out of VoltDB after they are persisted in the data warehouse. This limits the dataset inside VoltDB to only include relatively hot data.

The example includes a dashboard that shows click stream cluster distribution, top users and top visited URLs in a moving window. All of these information is calculated from materialized views on relevant source tables.

Code organization

The code is divided into projects:

  • "db": the database project, which contains the schema, stored procedures and other configurations that are compiled into a catalog and run in a VoltDB database.
  • "client": a java client that generates click events.
  • "web": a simple web server that provides the demo dashboard.
  • "hadoop": a collection of pig/shell scripts, and a Spark program

See below for instructions on running these applications. For any questions, please contact [email protected].

Pre-requisites

  1. Before running these scripts you need to have VoltDB 6.1 or later installed, and the bin subdirectory should be added to your PATH environment variable. For example, if you installed VoltDB Enterprise 6.1 in your home directory, you could add it to the PATH with the following command:

    export PATH="$PATH:$HOME/voltdb-ent-6.1/bin"

Hadoop Demo

Running with Hortonworks

This demo requires a Hortonworks HDP Hadoop installation (HDP). The demo consists of a VoltDB server writing export data to files stored in Hadoop, and a set of scripts, and programs that collect the exported data, compute k-means clusters on the collected data, and store the computations back to VoltDB.

  1. On all Hortonworks cluster nodes install Spark by following these instructions

  2. On the server where VoltDB is installed set the environment variable WEBHDFS_ENDPOINT to a WebHDFS endpoint that matches the following pattern

    http://[host]:[port]/webhdfs/v1/[export-base-directory]/%g/%p/%t.avro?user.name=[user]
    
  3. Download this archive and unpack it on an Hadoop node

    $ tar -jxf fastdata-kmeans.tar.bz2
  4. Change your working directory to fastdata-kmeans and run the hdp_compute_clusters.sh script when you to want to process exported data from VoltDB (see Demo Instructions section bellow)

    $ cd fastdata-kmeans
    #
    # Assuming you are exporting to /export/fastdata, and VoltDB is
    # running on volthost
    $ ./hdp_compute_clusters.sh /export/fastdata volthost

    This script

    • Harvests the exported data by renaming the export base directory
    • Invokes the hdp.harvest.pig pig script to write the harvested data into a Parquet database
    • Starts a Spark job that computes the K-Means clusters on the data stored in the parquet database, and creates another parquet db with the resulting computations.
    • Invokes the hdp.load.pig script that reads the K-Means computation Parquet database, and writes its content back to VoltDB by utilizing the VoltDB hadoop extensions

Running with Cloudera

This demo requires a Cloudera Hadoop installation, whose configured services also include Spark. The demo consists of a VoltDB server writing export data to files stored in Hadoop, and a set of scripts, and programs that collect the exported data, compute k-means clusters on the collected data, and store the computations back to VoltDB.

  1. On the server where VoltDB is installed set the environment variable WEBHDFS_ENDPOINT to a WebHDFS endpoint that matches the following pattern

    http://[host]:[port]/webhdfs/v1/[export-base-directory]/%g/%p/%t.avro?user.name=[user]
    
  2. Download this archive and unpack it on an Hadoop node

    $ tar -jxf fastdata-kmeans.tar.bz2
  3. Change your working directory to fastdata-kmeans and run the compute_clusters.sh script when you to want to process exported data from VoltDB (see Demo Instructions section bellow)

    $ cd fastdata-kmeans
    #
    # Assuming you are exporting to export/fastdata, and VoltDB is
    # running on volthost
    $ ./compute_clusters.sh export/fastdata volthost

    This script

    • Harvests the exported data by renaming the export base directory
    • Invokes the harvest.pig pig script to write the harvested data into a Parquet database
    • Starts a Spark job that computes the K-Means clusters on the data stored in the parquet database, and creates another parquet db with the resulting computations.
    • Invokes the load.pig script that reads the K-Means computation Parquet database, and writes its content back to VoltDB by utilizing the VoltDB hadoop extensions

Vertica Demo

  1. This example also requires Vertica installed on the same machine or a machine that the VoltDB machine has access to. A Vertica database must be created with the username dbadmin with no password. Once Vertica is running, set the environment variable VERTICAIP to point to the Vertica machine on the VoltDB machine. For example, if your Vertica is running on 192.168.0.1, run the following command on the VoltDB machine:

    export VERTICAIP="192.168.0.1"
  2. VoltDB uses JDBC to export click events to Vertica. Vertica's JDBC driver must be present in the classpath before starting VoltDB. If your Vertica is installed in /opt/vertica, you can find the JDBC driver in /opt/vertica/java/lib/. Copy the JAR file in that directory to the VoltDB machine and put it in the lib/extension sub-directory in your VoltDB installation.

  3. To run the K-means clustering algorithm in Vertica, it requires the R language package to be installed. Please follow the instructions in Vertica documentation to install the R language pack.

  4. Copy the vertica sub-directory in the example to the vertica machine. This directory contains Vertica extensions for K-means clustering and loading data back into VoltDB. The following instructions assume that the directory is copied to /tmp/vertica.

Demo Instructions

  1. Start the web server

    ./run.sh start_web
  2. Start the database and client

    # for Hadoop run
    $ ./run.sh hadoop_demo
    # while for Vertical run
    $ ./run.sh demo
  3. Open a web browser to http://hostname:8081

  4. For Vertica run the updatemodel.sh script on the Vertica machine to run the K-means clustering algorithm on the data in Vertica. You can run this command periodically to update the cluster model in VoltDB.

    $ /tmp/vertica/updatemodel.sh
  5. To stop the demo:

Stop the client (if it hasn't already completed) by pressing Ctrl-C

Stop the database

$ voltadmin shutdown

Stop the web server

$ ./run.sh stop_web

Options

You can control various characteristics of the demo by modifying the parameters passed into the LogGenerator java application in the "client" function within the run.sh script.

Speed & Duration:

--duration=120                (benchmark duration in seconds)
--ratelimit=20000             (run up to this rate of requests/second)

Instructions for running on a cluster

Before running this demo on a cluster, make the following changes:

  1. On each server, edit the run.sh file to set the HOST variable to the name of the first server in the cluster:

    HOST=voltserver01

  2. On each server, edit db/deployment.xml to change hostcount from 1 to the actual number of servers:

    <cluster hostcount="1" sitesperhost="3" kfactor="0" />
    
  3. On each server, start the database

    ./run.sh server
  4. On one server, Edit the run.sh script to set the SERVERS variable to a comma-separated list of the servers in the cluster

    SERVERS=voltserver01,voltserver02,voltserver03

  5. Run the client script:

    ./run.sh client

app-fastdata's People

Contributors

akhanzode avatar benjaminballard avatar jhugg avatar vtkstef avatar zeemanhe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

app-fastdata's Issues

VoltDB 5.1 installer problem

VoltDB 5.1 installs VoltDB under /usr/

This would not be issue if it would install /usr/voltdb, but as all libraries are /usr/lib/voltdb.

This leads run.sh to guide the script wrong as all the jars are under /usr/lib/ instead of the /usr which it thinks they are under.

Also facing connection failed issue, which is caused by voltdb not starting with the ./run.sh script.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.