Spotify Analysis 🎵

This project is aimed at analyzing data from the Spotify platform, utilizing the Spotify API and MongoDB for data extraction, Apache Hadoop for ETL processes, PySpark for transformation, and leveraging Dremio and Power BI for visualization and in-depth data analysis.

Table of contents 📌

Overview
Demo video
Prerequisite
Getting started
And more
Contributors

Overview

Project Structure

Data Schema

We initiate our data collection by scraping artists's name list from Spotify Artists. Subsequently, leveraging this list, we utilize the Spotify API to extract comprehensive data about each artist. The obtained raw data undergoes a series of ETL processes.

Demo Video

This is our demo video on Youtube, you can watch via this Link

Prerequisite

Getting started 🚀

Set up environment

Clone this project to your machine by running the following command:

git clone https://github.com/PhongHuynh0394/Spotify-Analysis-with-PySpark.git
cd Spotify-Analysis-with-PySpark

then you need to create .env file base on env_template

cp env_template .env

Now please fill these informations blank in .env file, this can be done in Prerequisite section:

# Spotify
SPOTIFY_CLIENT_ID=<your-api-key>
SPOTIFY_CLIENT_SECRET=<your-api-key> 

# Mongodb
MONGODB_USER=<your-user-name>
MONGODB_PASSWORD=<your-user-password>

OK, now it's Docker's job ! Let's build your Docker images of this project by typing make build in your terminal

This process might take a few minutes, so just chill and take a cup of coffee ☕

Note: if you failed in this step, just remove the image or restart Docker and try again

If you've done building Docker images, now its time to run your system. Just type make run

Then check your services to make sure everything work correctly:

Hadoop
- localhost:9870: Namenode
- localhost:9864: Datanode
- localhost:8088: Resources Manager
Prefect
- localhost:4200: Prefect Server
Data Warehouse
- localhost:9047: Dremio (user: dremio, password: dremio123)
Dashboard:
- localhost:8501: Streamlit
Notebook:
- localhost:8888: Jupyter Notebook (password is pass)

Run your data pipeline

We use Prefect to build our data pipeline. When you check out port 4200, you'll see prefect UI, let's go to Deployment section, you'll see 2 deployments there correspond to 2 data pipelines

Pipeline 1 (Ingest MongoDB Atlas flow)

This data flow (or pipeline) is used to scrape data from spotify API by batch and ingest into MongoDB Atlas. It will execute automatically every 2 minutes and 5 seconds.

Tips: The purpose of this flow is preparing your raw data in MongoDB, you would see 4 collections in your database on MongoDB Atlas after this. You should run this flow a few times before run pipeline 2.

Pipeline 2 (ETL flow)

This data flow do ETL job. It Extract raw data from MongoDB and first full load into HDFS in bronze layer, Then Transforming by PySpark in silver and gold layer. You can trigger this flow by press the run button manually on the top right corner.

Bronze, Silver, Gold layer are just Data Qualification Directiory to store backup of data in HDFS.

Warehouse and UI

localhost:9047

We use Dremio to analyze data in HDFS directly. Don't forget the username is dremio and password is dremio123. Then follow this instruction:

Login to Dremie > Add Source > Choose HDFS

The connecting window will appear, please fill as following:

Name: HDFS
NameNode Host: namenode

Then press Save to Save your connection. You would see your connection appearing in your main window go to gold_layer directory and format all .parquet directories. Then run your SQL statement and start analyzing.

You can use our SQL statements in warehouse.sql: These SQL statements used to create analytic view for Power Bi to draw Dashboard. You can also see it in PowerBI Dashboard

UI

Streamlit

localhost:8501

After all, you can access to Streamlit to see the Dashboard. Moreover, it can also utilize Machine Learning model to Recommend most porpular songs for you.

PowerBI Dashboard

You can also see it in powerbi_dashboard Or in our Streamlit app

And more

In future, we will update this repo in:

Utilizing Deep Learning model: In the future, we plan to leverage a Deep Learning model, specifically an NLP model, to analyze the lyrics of tracks.
Using Flask or other frameworks: Our goal is to switch to Flask or other frameworks, replacing the Streamlit Dashboard for improved functionality.
Using MongoDB locally: To streamline deployment and allow for personalized configuration, we'll be transitioning to using MongoDB locally.