GithubHelp home page GithubHelp logo

id2221's Introduction

Data Intensive Computing - ID2221

This course consists of four lab assignments that span over different topics of the course:

  • Lab 1: In the first lab assignment you will practice the basics of the data intensive programming by setting up HDFS and Hadoop MapReduce and implementing a simple application on them.
  • Lab 2: The second lab assignment is based of the Spark and Spark SQL, and you will learn how to use them to process and analyze data.
  • Lab 3: The concentration of this lab assignment is to process streaming data. You will work with Spark Streaming in this lab to process data on-line as it flows.
  • Lab 4: In the last assignment you will work with GraphX to process graph-based data.

Lab Assignments Environment

We have used the IPython/Jupyter notebooks for the lab assignments. Notebooks are documents that contain both the programming code (e.g., python or scala), as well as human-readable text elements (e.g., paragraph, figures, and links). Since we are using Scala programming language and Spark platform in our assignments, in addition to IPthon/Jupyter, we also need to install ISpark. Below, we first explain how to install IPython/Jupyter, and then present the ISpark installation steps. Note that these steps are given for a Linux operating system, so if you do not have Linux, you need to install it either on your machine or on a VirtualBox. You can download VirtualBox from its page. You can also find different ready to use Linux distribution images for VirtualBox here.

IPython Installation

The steps to install IPython/Jupyter:

  1. Install pip, the package management system used to install and manage software packages written in Python.

sudo apt-get install python-dev libncurses-dev python-pip


2. Use `pip` to install IPython. The following command downloads and installs IPython and its main optional dependencies for the notebook, qtconsole, tests, and other functionality.
```bash
sudo pip install ipython[all]==3.2.1
  1. Download and install Java and Spark on your machine. The easiest way to install Spark is to download its "pre-built for CDH 4" from the here.

  2. Set the environment variables.

export JAVA_HOME=<JAVA_PATH> export SPARK_HOME=<SPARK_PATH> export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$SPARK_HOME/python export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"


5. Create an IPython profile to support Python programming language.
```bash
ipython profile create pyspark
  1. Edit the created profile by copying the following lines in 00-pyspark-setup.py. This file is at ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py.

import os import sys spark_home = os.environ.get('SPARK_HOME', None) sys.path.insert(0, os.path.join(spark_home, 'python')) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip')) execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))


7. Run and test the IPython notebook by running the following command.
```bash
ipython notebook --profile=pyspark --debug

ISpark Installation

ISpark is an Apache Spark-shell backend for IPython.

  1. Install Maven first, if you do not have it.

sudo apt-get install maven


2. Download ISpark from [here](https://github.com/tribbloid/ISpark/archive/master.zip).

3. ISpark needs to be compiled and packaged into a jar by Maven before being submitted and deployed. Extract the ISpark file you downloaded and run the following command.
```bash
cd ISpark-master
./mvn-install.sh
  1. Create an ISpark profile.

ipython profile create spark


5. Edit the profile by copying the following lines in `ipython_config.py`, which is at `~/.ipython/profile_spark/ipython_config.py`. Below we are referring to `ispark-core-assembly-0.2.0-SNAPSHOT.jar`, which is built in step 3. It is located at `ISpark-master/core/target/scala-2.10`.
```python
import os
c = get_config()
spark_home = os.environ['SPARK_HOME']
master = 'local[2]'
c.KernelManager.kernel_cmd = [spark_home+"/bin/spark-submit", 
"--master", master,
"--class", "org.tribbloid.ispark.Main",
"--executor-memory", "2G",
"--jars", "<PATH ON YOUR MACIHE>/ispark-core-assembly-0.2.0-SNAPSHOT.jar",
"--profile", "{connection_file}", 
"--parent"]
  1. Run and test the notebook.

ipython notebook --profile=spark --debug


## Things To Deliver
The assignments can be done individually or in group of two. For the labs 1 and 3, you should hand in the implemented codes in zip files, and for the other labs, you just need to deliver the completed IPython files.

id2221's People

Contributors

payberah avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.