This repository contains two notebooks that I created for my Data Science class.
The first notebook main.ipynb
contains the following:
- General information on how Spark works
- How to create a Spark session
- How to read data into Spark
- How to create and manipulate a Spark DataFrame
- How to use Spark SQL to query data
- How to use Spark ML to create a machine learning model
The second notebook lowLevel.ipynb
is a notebook that is meant to achieve the following:
- Create a Spark Context
- Read data into Spark RDDs
- Manipulate Spark RDDs
- Perform MapReduce operations on Spark RDDs
These notebooks are meant to be used in conjunction with the data
folder. It contains the dataset used for both querying the data and creating the machine learning model.
Please set up your environment to run these notebooks. I used the following
- Python 3.10.12
- Spark 3.4.1
- Java 1.8.0_319
- PySpark 3.4.1
If you feel that you need a code alone, please feel free to use the corresponing student*
files as they contain blank cells for you to fill in.