GithubHelp home page GithubHelp logo

strata-tutorial-2016-nyc's Introduction

Strata 2016 NYC - Hadoop in the Cloud Tutorial
==============================================

Introduction
------------

This repository contains everything needed to run through a two step pipeline
that will ingest data via Spark and make it available for interactive queries
via Impala.

The Spark step is expected to be a transient cluster running only for a couple
of hours each day (potentially triggered via cron).

The Impala one is expected to be long running and elastic and have multiple
users that use Hue or the JDBC interface.

Common Settings
---------------

Start by going through the AWS Quickstart flow:

https://aws.amazon.com/quickstart/
http://docs.aws.amazon.com/quickstart/latest/cloudera/welcome.html

Relevant identifiers will be available as CloudFormation outputs.

Go to the AWS console and make note of the EC2 instance called ClusterLauncher. This is your Director server instance.

SSH to the created Director server instance.
$ ssh ec2-user@<Director server IP> -i <PEM file whose keyname was provided to AWS Quickstart>

Download the conf files from this github repo to your Director server instance.
Modify common.conf.sample by providing details specific to your AWS account.
Use the information from the CloudFormation output. Use the ClusterLauncher security-group, and not the NAT security-group.
Alternatively, look at aws.sample.conf to see values that should go into common.conf.
Save this file as common.conf.

Run validation for both configuration files to ensure everything is 
configured properly:

$ cloudera-director validate spark.conf
$ cloudera-director validate impala.conf

Create a tunnel to Director from your local machine:

$ ssh -C -L 7189:localhost:7189 ec2-user@<Cluster Launcher IP>
# Use your browser to go to http://localhost:7189/

Data ingest via Spark
---------------------

Ask Director to setup the Spark cluster for ETL:

$ cloudera-director bootstrap-remote spark.conf --lp.remote.username=admin
# Director will ask for the admin password

Progress information is also available in the Director UI.

Establish a tunnel to Cloudera Manager:

$ ssh -i cloudera.pem -CN -L 7180:<CM Private IP>:7180 ec2-user@<Cluster Launcher IP>

SSH into the master node and open the Spark shell:

$ sudo -u hdfs -i bash

$ curl -o ingest.scala https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/ingest.scala
$ spark-shell -i ingest.scala

$ curl -o schema.hql https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/schema.hql
$ curl -o copy_to_s3.hql https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/copy_to_s3.hql

Modify the schema.hql file to point to a new S3 bucket created via the AWS console.

$ hive -f schema.hql
$ hive -f copy_to_s3.hql

SQL via Impala on integested data
---------------------------------

Ask Director to setup the Impala cluster for interactive queries:

$ cloudera-director bootstrap-remote impala.conf --lp.remote.username=admin
# Director will ask for the admin password

Establish a tunnel to Cloudera Manager:

$ ssh -i cloudera.pem -CN -L 7180:<CM Private IP>:7180 ec2-user@<Cluster Launcher IP>

SSH into the master node and open the Impala shell:

$ sudo -u hdfs -i bash

$ curl -o schema.hql https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/schema.hql
$ hive -f schema.hql

Start the impala-shell from a worker node. Identify the Impala worker node from Cloudera Director UI.
$ impala-shell -i <IP address of Impala worker node>
# Run some interesting queries reading from S3

strata-tutorial-2016-nyc's People

Contributors

vinithra avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.