GithubHelp home page GithubHelp logo

tuantran0910 / spotify-analysis-with-pyspark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from phonghuynh0394/spotify-analysis-with-pyspark

0.0 0.0 0.0 96.63 MB

Analyzing Spotify Data with Pyspark and ETL Procedures

License: MIT License

Shell 0.49% Python 14.05% CSS 0.76% Makefile 0.06% Jupyter Notebook 84.45% Dockerfile 0.19%

spotify-analysis-with-pyspark's Introduction

Spotify Analysis 🎵

HCMUS License: MIT

This project is aimed at analyzing data from the Spotify platform, utilizing the Spotify API and MongoDB for data extraction, Apache Hadoop for ETL processes, PySpark for transformation, and leveraging Dremio and Power BI for visualization and in-depth data analysis.

Data Pipeline Prefect Docker Spotify Apache Hadoop Apache Spark Dremio MongoDB Power Bi

Table of contents 📌

Overview

Project Structure

Structure

Data Schema

We initiate our data collection by scraping artists's name list from Spotify Artists. Subsequently, leveraging this list, we utilize the Spotify API to extract comprehensive data about each artist. The obtained raw data undergoes a series of ETL processes. Data Schema

Demo Video

This is our demo video on Youtube, you can watch via this Link

Prerequisite

Getting started 🚀

Set up environment

Clone this project to your machine by running the following command:

git clone https://github.com/PhongHuynh0394/Spotify-Analysis-with-PySpark.git
cd Spotify-Analysis-with-PySpark

then you need to create .env file base on env_template

cp env_template .env

Now please fill these informations blank in .env file, this can be done in Prerequisite section:

# Spotify
SPOTIFY_CLIENT_ID=<your-api-key>
SPOTIFY_CLIENT_SECRET=<your-api-key> 

# Mongodb
MONGODB_USER=<your-user-name>
MONGODB_PASSWORD=<your-user-password>

OK, now it's Docker's job ! Let's build your Docker images of this project by typing make build in your terminal

This process might take a few minutes, so just chill and take a cup of coffee ☕

Note: if you failed in this step, just remove the image or restart Docker and try again

If you've done building Docker images, now its time to run your system. Just type make run

Then check your services to make sure everything work correctly:

  1. Hadoop
  2. Prefect
  3. Data Warehouse
  4. Dashboard:
  5. Notebook:

Run your data pipeline

We use Prefect to build our data pipeline. When you check out port 4200, you'll see prefect UI, let's go to Deployment section, you'll see 2 deployments there correspond to 2 data pipelines

Pipeline 1 (Ingest MongoDB Atlas flow)

This data flow (or pipeline) is used to scrape data from spotify API by batch and ingest into MongoDB Atlas. It will execute automatically every 2 minutes and 5 seconds.

pipeline1-a

pipeline1-b

Tips: The purpose of this flow is preparing your raw data in MongoDB, you would see 4 collections in your database on MongoDB Atlas after this. You should run this flow a few times before run pipeline 2.

Pipeline 2 (ETL flow)

This data flow do ETL job. It Extract raw data from MongoDB and first full load into HDFS in bronze layer, Then Transforming by PySpark in silver and gold layer. You can trigger this flow by press the run button manually on the top right corner.

Bronze, Silver, Gold layer are just Data Qualification Directiory to store backup of data in HDFS.

pipline2-a

pipline2-b

Warehouse and UI

localhost:9047

We use Dremio to analyze data in HDFS directly. Don't forget the username is dremio and password is dremio123. Then follow this instruction:

Login to Dremie > Add Source > Choose HDFS

The connecting window will appear, please fill as following:

  • Name: HDFS
  • NameNode Host: namenode

Then press Save to Save your connection. You would see your connection appearing in your main window go to gold_layer directory and format all .parquet directories. Then run your SQL statement and start analyzing.

You can use our SQL statements in warehouse.sql: dremio These SQL statements used to create analytic view for Power Bi to draw Dashboard. You can also see it in PowerBI Dashboard

UI

Streamlit

localhost:8501

After all, you can access to Streamlit to see the Dashboard. Moreover, it can also utilize Machine Learning model to Recommend most porpular songs for you. streamlit

PowerBI Dashboard

PowerBI Dashboard

You can also see it in powerbi_dashboard Or in our Streamlit app

And more

In future, we will update this repo in:

  • Utilizing Deep Learning model: In the future, we plan to leverage a Deep Learning model, specifically an NLP model, to analyze the lyrics of tracks.
  • Using Flask or other frameworks: Our goal is to switch to Flask or other frameworks, replacing the Streamlit Dashboard for improved functionality.
  • Using MongoDB locally: To streamline deployment and allow for personalized configuration, we'll be transitioning to using MongoDB locally.

Contributors

Huỳnh Lưu Vĩnh Phong
Huỳnh Lưu Vĩnh Phong

Data Engineer
Team Lead
Trần Ngọc Tuấn
Trần Ngọc Tuấn

Data Engineer
Phạm Duy Sơn
Phạm Duy Sơn

Data Science
Mai Chiến Vĩ Thiên
Mai Chiến Vĩ Thiên

Data Analyst

Finally

Feel free to use 😄

spotify-analysis-with-pyspark's People

Contributors

phonghuynh0394 avatar tuantran0910 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.