Spark Tutorial for DSSG 2015

You definitely want to refer to the official docs from Apache Spark.

Installation

Anaconda is le shiz

conda install -c blaze spark

This is apparently not public, but Matt Rocklin and the Blaze team are very interested in supporting spark.

On Linux

CDH 5 is probably the best way to go for Linux, it includes Spark 1.3.0 (which includes Spark SQL), and also Hadoop, etc. Strangely, it doesn't appear to support postgres 9.4, and Spark SQL is "unsupported" (but it's installed). I don't know if this is just a judgement call, or if there are CDH-specific problems with Spark SQL. Cloudera develops Impala, a "competitor" to Spark.

On OS X

Spark 1.3 still targets Scala 2.10. This is non-standard at this point on homebrew, so I did:

brew install scala210
brew link --force scala210

Homebrew complains, but I won't be installing scala 2.11 anytime soon.

Virtual Machines

HortonWorks and Cloudera both provide VMs. For now, it looks like Cloudera is more up-to-date (HortonWorks does Spark 1.2). Cloudera also supports more Linux flavors (provides debs and rpms).

Setting up IPython

At a minimum, You'll need something like this in your ~/.bash_profile:

# Setup for Spark / PySpark (sadly, that IPYTHON variable is a bit generally named...)
export IPYTHON=1
export SPARK_HOME=~/anaconda/share/spark # Or wherever your anaconda dir is
# Or wherever you put the local spark install
# export SPARK_HOME=~/WHEREVER-YOU-UNPACKED-SPARK-CHANGE-THIS/spark-1.3.1-bin-hadoop2.6
# export PATH=$SPARK_HOME/bin:$PATH

# You should reduce the memory used to something reasonable for your laptop
export PYSPARK_SUBMIT_ARGS='--master local[*] --executor-memory 12g'

This is from a Cloudera Blog Post (that I'm no longer linking because it has problems, and includes setting things up for remote, secure execution that we won't worry about today). So here's the essentials:

ipython profile create pyspark

Copy the 00-pyspark-setup.py file to your new profile directory, which will be something like ~/.ipython/profile_pyspark/startup.

You'll need to modify the paths to reflect your installation root (under share/spark in your anaconda root, or wherever you unzipped the tarball).

davclark / spark-tut-dssg2015 Goto Github PK

spark-tut-dssg2015's Introduction

Spark Tutorial for DSSG 2015

Installation

Anaconda is le shiz

On Linux

On OS X

Virtual Machines

Setting up IPython

spark-tut-dssg2015's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs