GithubHelp home page GithubHelp logo

migueldvb / bigdatacourse Goto Github PK

View Code? Open in Web Editor NEW

This project forked from asvyatkovskiy/bigdatacourse

0.0 2.0 0.0 31 KB

Materials for the introduction to Big Data with Apache Spark mini-course

Shell 100.00%

bigdatacourse's Introduction

#Introduction to Big Data with Apache Spark

Pre-exercises

The pre-exercises are intended to build some domain knowledge in the fields covered during the course. The interactive iPython notebooks cover web-mining (scraping), text processing, elements of natural language processing, machine learning (in particular, k-nearest neighbour classifier) and some modern data structures (Pandas DataFrame). Each of these pre-exercises is supposed to be completed on the laptop and does not require cluster access, Apache Hadoop or Apache Spark installed.

The pre-exercises cover some of the topics that will be discussed during the main course in detail.

Getting started with pre-exercises

All exercises are intended to be perfromed on your laptops. In addition, there is an exercise to test if you can connect to the computing cluster and start the Spark shell.

Download and install Anaconda

Please go to the following website: https://www.continuum.io/downloads download and install Anaconda version for Python 2.7 for your operating system. After that, type:

conda --help

and read the manual. Once Anaconda is ready, download the following requirements file: https://github.com/ASvyatkovskiy/BigDataCourse/blob/master/preexercise/conda-requirements.txt and proceed with setting up the environment:

conda create --name alexeys_conda --file conda-requirements.txt
source activate alexeys_conda

please feel free to change the anaconda environment name.

Install git and iPython

If you do not have it installed already, install git following the instructions on that page: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git and proceed to checkout the repository for the course. In addition, you are going to need to install the iPython: http://ipython.org/install.html

Check-out the git repository with the pre-exercise

git clone https://github.com/ASvyatkovskiy/BigDataCourse
cd BigDataCourse/preexercise/

Start interactive ipython notebook:

ipython notebook

and proceed to complete each of the pre-exercises one by one.

Connecting to the cluster

All the pre-exercises are supposed to be completed on your laptops. After that, please make sure you have and Adroit computing account or request it following the instructions on the following page: https://www.princeton.edu/researchcomputing/computational-hardware/adroit/

If you have the account, do the following few steps:

Login to Adroit with X11 forwarding enabled:

Check-out the course github repository:

git clone https://github.com/ASvyatkovskiy/BigDataCourse 
cd BigDataCourse/preexercise

Submit a test Spark-slurm job (which only starts the Spark cluster and does nothing else):

sbatch hello_spark_slurm.cmd

Check that your job got submitted:

squeue -u <your_username>

Look for the Slurm output file in the submission folder:

ls -l slurm-*.out

and inspect it with your favourite text editor. The last line in the log file will give you the name of the Spark master node.

In case help needed

If you are experiencing any problems with the installation part or pre-exercises: please email me at [email protected] or come see me at the regular CSES office hours on Tuesday.

bigdatacourse's People

Contributors

asvyatkovskiy avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.