garyfub / strata-tutorial-2016-nyc Goto Github PK
View Code? Open in Web Editor NEWThis project forked from cloudera/strata-tutorial-2016-nyc
License: Apache License 2.0
This project forked from cloudera/strata-tutorial-2016-nyc
License: Apache License 2.0
Strata 2016 NYC - Hadoop in the Cloud Tutorial ============================================== Introduction ------------ This repository contains everything needed to run through a two step pipeline that will ingest data via Spark and make it available for interactive queries via Impala. The Spark step is expected to be a transient cluster running only for a couple of hours each day (potentially triggered via cron). The Impala one is expected to be long running and elastic and have multiple users that use Hue or the JDBC interface. Common Settings --------------- Start by going through the AWS Quickstart flow: https://aws.amazon.com/quickstart/ http://docs.aws.amazon.com/quickstart/latest/cloudera/welcome.html Relevant identifiers will be available as CloudFormation outputs. Go to the AWS console and make note of the EC2 instance called ClusterLauncher. This is your Director server instance. SSH to the created Director server instance. $ ssh ec2-user@<Director server IP> -i <PEM file whose keyname was provided to AWS Quickstart> Download the conf files from this github repo to your Director server instance. Modify common.conf.sample by providing details specific to your AWS account. Use the information from the CloudFormation output. Use the ClusterLauncher security-group, and not the NAT security-group. Alternatively, look at aws.sample.conf to see values that should go into common.conf. Save this file as common.conf. Run validation for both configuration files to ensure everything is configured properly: $ cloudera-director validate spark.conf $ cloudera-director validate impala.conf Create a tunnel to Director from your local machine: $ ssh -C -L 7189:localhost:7189 ec2-user@<Cluster Launcher IP> # Use your browser to go to http://localhost:7189/ Data ingest via Spark --------------------- Ask Director to setup the Spark cluster for ETL: $ cloudera-director bootstrap-remote spark.conf --lp.remote.username=admin # Director will ask for the admin password Progress information is also available in the Director UI. Establish a tunnel to Cloudera Manager: $ ssh -i cloudera.pem -CN -L 7180:<CM Private IP>:7180 ec2-user@<Cluster Launcher IP> SSH into the master node and open the Spark shell: $ sudo -u hdfs -i bash $ curl -o ingest.scala https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/ingest.scala $ spark-shell -i ingest.scala $ curl -o schema.hql https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/schema.hql $ curl -o copy_to_s3.hql https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/copy_to_s3.hql Modify the schema.hql file to point to a new S3 bucket created via the AWS console. $ hive -f schema.hql $ hive -f copy_to_s3.hql SQL via Impala on integested data --------------------------------- Ask Director to setup the Impala cluster for interactive queries: $ cloudera-director bootstrap-remote impala.conf --lp.remote.username=admin # Director will ask for the admin password Establish a tunnel to Cloudera Manager: $ ssh -i cloudera.pem -CN -L 7180:<CM Private IP>:7180 ec2-user@<Cluster Launcher IP> SSH into the master node and open the Impala shell: $ sudo -u hdfs -i bash $ curl -o schema.hql https://raw.githubusercontent.com/cloudera/strata-tutorial-2016-nyc/master/schema.hql $ hive -f schema.hql Start the impala-shell from a worker node. Identify the Impala worker node from Cloudera Director UI. $ impala-shell -i <IP address of Impala worker node> # Run some interesting queries reading from S3
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.