Spotify Track Popularity Predictor by Group 27

License: MIT License

Python 0.01% R 0.01% Makefile 0.01% Jupyter Notebook 89.69% Dockerfile 0.01% HTML 10.29% Shell 0.01%

dsci_522_spotify_track_popularity_predictor's Introduction

Song Popularity Predictor

Data analysis project for DSCI 522 (Data Science workflows); a course in the Master of Data Science program at the University of British Columbia.

Team Members:

Victor Francis
Reza Mirzazadeh
Qingqing Song
Jessie Wong

Project Description

This project will use the audio_features dataset, which contains information from spotify tracks, such as performer, genre, duration, loudness, etc. The data is from tidytuesday and was obtained here. The research question that we aim to answer through this project is to predict the popularity of a song, given various features such as genre, duration, energy, tempo and acousticness.

Exploratory data analysis (EDA)

Each row of the data set represents a song with its features and its popularity. we are intrested in predicting songs popularity given song features. Data wrangling was necessary to keep the infomrative and relavant columns to our target.

Prior to analysis, we performed EDA on the features to assess the correlation betweeen features themselves and each feature with the popularity of songs. As a result, we dropped missing values and features that do not contribute to the predictve quality of the ridge model such as spotify_track_preview_url, song_id and time_signature, and we focused on columns such as energy, danceability, speechiness, and loudness.

Some exploratory data questions we will answer are what pairs of features have strong correlations, and which columns contain the largest number of missing values. One exploratory data analysis figure that we will create is a correlation plot or heatmap to show which pairs of features are correlated. The exploratory data analysis can be found here.

Finally, after completing all necessary analysis to answer our research question, we will share the results as a table and as multiple plots, showing the predicted distribution of song popularity for each feature.

The steps we run our analysis will follow the flowchart below.

Report

The final report can be found here

Usage

There are two suggested ways to run this analysis:

1. Using Docker

To replicate the analysis, install Docker. Then clone this GitHub repository and run the following command at the command line/terminal from the root directory of this project:

docker run --rm -v /$(pwd):/home/spotify qq1207/spotify_track_popularity_predictor make -C /../home/spotify all

To reset the repo to a clean state, with no intermediate or results files, run the following command at the command line/terminal from the root directory of this project:

docker run --rm -v /$(pwd):/home/spotify qq1207/spotify_track_popularity_predictor make -C /../home/spotify clean

2. Without using Docker

To replicate the analysis, clone this GitHub repository, install the dependencies listed below, and run the following command at the command line/terminal from the root directory of this project:

make all

To reset the repo to a clean state, with no intermediate or results files, run the following command at the command line/terminal from the root directory of this project:

make clean

Dependencies

Python 3.9.7 and Python packages:
- docopt=0.6.2
- pandas=1.3.4
- numpy=1.21.4
- sklearn=1.0.1
- altair=4.1.0
R version 4.1.1 and R packages:
- knitr=1.3
- tidyverse=1.3.1
- dplyr=1.0.7

References

de Jonge, Edwin. 2018. Docopt: Command-Line Interface Specification Language. https://CRAN.R-project.org/package=docopt.

Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2019. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Keleshev, Vladimir. 2014. Docopt: Command-Line Interface Description Language. https://github.com/docopt/docopt.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Van Rossum, Guido, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

dsci_522_spotify_track_popularity_predictor's People

Contributors

Stargazers

Watchers

Forkers

scarlqq jessie14 vikiano rezam747 elgohr-update

dsci_522_spotify_track_popularity_predictor's Issues

Milestone 2 Review

Writing an analysis that uses multiple scripts: Accuracy [A]

Completed all requirements.
1A. Code Quality [A-]
Lacking more detailed comments in preprocess_n_model.py
1B. Mechanics [A+]
Did have 5 scripts.
1C. Reasoning [B]
Your project has a lot of results. Try breaking the data down and explain the tables. There's a lack of interpretation in your report. Quantitatively explain your results using the entries in the tables.
Have a distinct summary or conclusion to lay out your findings.
1D. Viz [A+]
Captions or explanation is present.
1E. Writing [A+]
Included documentation on top of each file to scripts.

Expectations: Mechanics [A+]

Git messages are meaningful. Commits are made.
3A. Project organization and documentation expectations: Mechanics [B+]
3B. Writing [A+]
Full sentences are used, less than 10% was in point form.
3C. Reasoning [B]
Narrative of analysis and visualization was not present.
Lack of in depth analysis of EDA in the proposal readme. Should interpret the plot or give captions.
Did not see the evolution of proposal with short excerpts of major findings using your statistical method.
4A. Submission expectations: Mechanics [A+]
URL was attached and release was created.

Dockerfile merge conflicts

The dockerfile in my fork can be found as qq1207/spotify_track_popularity_predictor in dockerhub, I tried

docker run --rm -v /$(pwd):/home/spotify qq1207/spotify_track_popularity_predictor make -C /../home/spotify clean
docker run --rm -v /$(pwd):/home/spotify qq1207/spotify_track_popularity_predictor make -C /../home/spotify all

these two lines and they worked, for mac users maybe the command would be

docker run --rm -v /$(pwd):/home/spotify qq1207/spotify_track_popularity_predictor make -C /home/spotify clean
docker run --rm -v /$(pwd):/home/spotify qq1207/spotify_track_popularity_predictor make -C /home/spotify all

Not sure about the directory :(
If u can run successfully run these lines on ur laptop maybe we can merge the pr!

We also need to change makefile in the line Rscript -e "rmarkdown::render('spotify-track-predictor-report.Rmd')" to Rscript -e "rmarkdown::render('doc/spotify-track-predictor-report.Rmd')" I think I included this line in the pr

Milestone 3 Review

1A. Write test cases and code iteratively: Accuracy [A+]

Does live in the project’s root directory
have 'all' and 'clean'
1B. Write test cases and code iteratively: Code Quality [A+]
Is well documented
2A. We can install and use your R package: Mechanics [A+]
3A. Exception handling: Code Quality [A+]
4A. GitHub actions workflow for continuous integration for the Python project: Mechanics [A+]
5A. Python package documentation: Reasoning [A+]
6A. Specific expectations for this milestone: Mechanics [A+]
7A. Submission instructions: Mechanics [A+]

Feature Transformations Discussion

These are the column types that I have identified along with transformations for each column type.
Does everyone agree with these transformations?

#CountVectorizer
text_features = "song" 

#Impute with mean , StandardScaler
numeric_features = ["spotify_track_duration_ms",
                    "danceability",
                    "energy",
                    "key",
                    "loudness",
                    "mode",
                    "speechiness",
                    "acousticness",
                    "instrumentalness",
                    "liveness",
                    "valence",
                    "tempo",
                    "time_signature"]

#Onehotencoder
binary_features = ["spotify_track_explicit"]

#Impute with 'missing', Onehotencoder
categorical_features = ["performer","spotify_genre"]

#The following features are dropped because
         -"song_id" contains same info as "song"
         -"spotify_track_id" and  "spotify_track_preview_url" do not contain useful information
         -"spotify_track_album"  is less important the song name, and will be difficult to differentiate between words in album name or 
            song name if we use count vectorizer on both columns 
drop_features = ["song_id","spotify_track_id","spotify_track_preview_url","spotify_track_album"]

target = "spotify_track_popularity"

Code modification for download_data.py file

In the code, the library requests was not initially imported, I just imported it as I am trying to create a single shell script for the entire pipeline.

Submission check list

A link in README.md, point to eda.pdf
Create a release named "0.0.1"
The URL needs to be included in submission:
(1) the URL of public project’s repository
(2) the URL to the release "0.0.1"

Milestone 1 Review

Good start team! I provide here some comments and your grades for the first milestone. Please address these concerns in your third milestone submission. Any questions please follow up with me in lab.

Group 27

Draft a Team work contract: Correctness [A+]
Project set-up: Mechanics [A+]
Project proposal: reasoning [B]
"informative and relevant columns" and "tidy the data in a way that makes analysis possible" are vague statements, state exact methods. Explain your columns or features that you deemed necessary and how it relates to the question you are trying to ask.

How would you assess your predictive model and what are some criterion? Why are the results that you are providing significant. How are some implications or hypotheses of the results?

A script that downloads the data: Accuracy [A+]
A script that downloads the data: Quality [A+]
Exploratory data analysis in a literate code document: QUALITY [A]
Exploratory data analysis in a literate code document: VIZ [B]
There are better ways to visualize the data than screenshots, use internal viz libraries.
Maybe explain through your EDA plot by plot or include captions.
Exploratory data analysis in a literate code document: REASONING [B]
There is a lack of assessment on the plots, which EDAs are effective, which have addressed your concerns about whether the data will help you answer your questions?
A plan of how to do further analysis should be included. The only thing I can see is the data may be enough and you might use ridge regression but without explanation what made you make these decisions.
Exploratory data analysis in a literate code document: ACCURACY [A]

Expectations: Mechanics [A+]

Test Set Eval Code Chunk (Using Best Model)

best_preprocessor = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='mean'),StandardScaler()),
                  numeric_features),
    (OneHotEncoder(sparse=False, drop="if_binary", dtype=int),binary_features),
    (make_pipeline(SimpleImputer(strategy="constant", fill_value="missing"),
                  OneHotEncoder(handle_unknown="ignore", sparse=False,categories=[most_frequent.index.values])),
                    categorical_features),
    (CountVectorizer(stop_words="english",
                    max_features=random_search.best_params_["columntransformer__countvectorizer__max_features"],
                    binary=random_search.best_params_["columntransformer__countvectorizer__binary"]),
                    text_features))
  
  best_pipe = make_pipeline(
    best_preprocessor,
    Ridge(
        alpha=random_search.best_params_["ridge__alpha"],
    )
  )

  best_pipe.fit(X_train, y_train)
  best_pipe.score(X_test, y_test)

Code for plot of model results

import numpy as np
import pandas as pd
import altair as alt

y_test = np.random.randint(100, size=100)
y_predicted = np.random.randint(100, size=100)

df = pd.DataFrame({'y_test':y_test, 'y_predicted':y_predicted})
p = alt.Chart(df).mark_point(filled=True).encode(
alt.X('y_test', title='Predicted values of Spotify popularities'),
alt.Y('y_predicted', title='True values of Spotify popularities')
)

p2 = p+p.transform_regression('y_test', 'y_predicted').mark_line()
p2.save('test.png')

RandomizedSearchCV Code Chunk

Code chunk for RandomizedSearch

Need to add argument random_search_results to docopt and main function

pipe = make_pipeline(preprocessor, Ridge())
    
    param_grid = {
    "ridge__alpha": loguniform(1e-3, 1e3),
    "columntransformer__countvectorizer__binary": np.array([True, False]),
    "columntransformer__countvectorizer__max_features": np.arange(1000, 10000, 1000),
  }

  random_search = RandomizedSearchCV(
      pipe, param_distributions=param_grid, n_jobs=-1, n_iter=10, cv=5
  )
  random_search.fit(X_train, y_train)
  random_search.best_params_
  
  random_search_results = pd.DataFrame(random_search.cv_results_)[
    [
        "mean_test_score",
        "param_ridge__alpha",
        "param_columntransformer__countvectorizer__max_features",
        "param_columntransformer__countvectorizer__binary""
        "rank_test_score",
    ]
  ].set_index("rank_test_score").sort_index()
  
  #random_search_results.to_csv(random_search_results, index = False)

ubc-mds / dsci_522_spotify_track_popularity_predictor Goto Github PK