This project is the Final Project for DSAA5021. You can quickly set up the environment and try out our code by following these steps.
Please Download the full dataset from MovieLen, and put the dataset in the folder ml-25m
.
Application | URL | Description |
---|---|---|
JupyterLab | localhost:8889 | Cluster interface with built-in Jupyter notebooks |
Spark Driver | localhost:4041 | Spark Driver web ui |
Spark Master | localhost:8080 | Spark Master node |
Spark Worker I | localhost:8081 | Spark Worker node with 1 core and 4g of memory (default) |
Spark Worker II | localhost:8082 | Spark Worker node with 1 core and 4g of memory (default) |
Spark Worker III | localhost:8083 | Spark Worker node with 1 core and 4g of memory (default) |
Before starting, ensure you have Docker and Docker Compose installed on your computer. Follow the guides below based on your operating system:
-
For Windows Or Mac:
- Visit Docker Hub to download and install Docker Desktop for Windows or Mac.
- Docker Compose will be included automatically as part of Docker Desktop.
-
For Linux:
- Install Docker using your distribution's package manager (e.g.,
apt
for Ubuntu,yum
for Fedora). - Install Docker Compose separately by following the instructions on the official Docker website.
- Install Docker using your distribution's package manager (e.g.,
To start using the Spark Performance Analysis on MovieLen, follow these steps:
- Edit the docker compose file with your favorite configuration;
- Start the cluster;
docker-compose up
- The directory where the docker compose file is located will be mounted to this path on the container:
/root/local-workspace
; - Run Apache Spark code using the provided Jupyter notebooks with PySpark;
- Stop the cluster by typing
ctrl+c
on the terminal; - Run step 2 to restart the cluster.