This the capstone project for Udacity DataEngineering nanodegree program.
The objective of this project is to apply the learning from the Udacity Data engineering course.
This is a fictional annual movie awards hosted by Sparkademy awards.
Spakademy wants to identify top movies to award based on IMDB users' votes.
As a data engineer I was asked to build a data model from the datasets.
The data model is used to analyse and identify the top-rated IMDB movies.
The datasets are downloaded from IMDB and TMDB dataset from Kaggle into S3.
The scope of the project is to build an ELT data lake from the datasets and write them as parquet files to S3.
The transformed parquet files can be used for analysis
- Airflow scheduling
- Automated script to download datasets from IMDB & TMDB.
- Analysis table for the top rated movies
The CSV files from IMDB input data and combined additional data from TMDB as JSON are transformed into parquet files.
I have built a star schema of parquet. The dimension tables contains the movie titles, additional details and movie crew.
The fact tables contains IMDB ratings and the finance data like revenue and budget.
- movies_ratings - movie ratings from both imdb & tmdb - split language/subfolder by years
- movies_finances - movie's finance details like revenue and budget
- movies_titles - movies titles that are split by region/subfolder by years
- movies_details - movie details like titles - split by region/subfolder by years
- movies_principal_crew - movies' principal cast and crew.
- movies_crew_names - Names of the member's of the cast and crew.
I have de-normalised fact tables in a way to include basic information like movie title, region, language and year.
The movie data set is huge in millions because it spans across multiple countries(regions), languages and several years since 1800s.
I have split the partition across different years and regions because Join queries can be really expensive so I have de-normalised them to fetch basic details.
Analysis can be run on movie ratings without any complex joins so we can query movie ratings for only specific country, language or a year.
I'll use a join query only if I need to fetch additional details about the movies.
I have chosen Spark & S3 parquet to build a data lake because the IMDB dataset is very huge and ratings keep changing all the time.
Had I chosen a data warehouse like Redshift, we need to run a lot of update queries on a daily basis, and it might end up.
The challenge with the TMDB dataset is it contains a lot of small JSON files.
It turned out that Spark seem to be efficient with few big files rather than a lot of small files.
I was not able to use AWS Glue with the Udacity account, I'll use AWS glue crawler for this use case in the future.
The IMDB dataset is huge, some csv files contains as much as 50 Million records. I have used only with a subset of this dataset for a year. I couldn't use Spark to it's full potential in the Udacity workspace, I was getting out of memory errors.
I'll use EMR to have Spark clusters and scale accordingly. I'll also try to use serverless solution like AWS Glue crawlers to especially process the JSON data.
I'd use Airflow to schedule every morning 7am potentially download the IMDB and TMDB datasets use of AWS Lambda and then do the loading and transforming of the CSVs into parquet files.
The queries results run on the data model can be stored in a database.
I have already denormalised my tables so I'll use Cassandra for both the heavy on read and write scenarios.
If the read requires a lot of join queries then I'll use Redshift.
I will still keep either regions(countries) or year as partition key depending on the use case since the IMDB dataset always spans across multiple countries.
I may also loose the sub folder and flatten the partition since I am going to copy the transformed data into a database.
The purpose of this data model is to retrieve top movies based on IMDB ratings for Sparkademy awards. I create a temporary view from movie_ratings parquet table and retrieved the top 10 movies for 2021.
I've restricted the scope for this project to just ETL, build a data model and extract results from the model parquet tables for the Sparkademy awards, I've left the analysis table out of scope of this project.
The ETL is already run on only one year from settings, that's why we don't have a where condition for the year in the query.
If the data model is done across multiple years and countries (regions), I'll query the data differently and use Cassandra or Redshift in my pipeline.
movies_ratings = spark.parquet.read(output_dir + '/movies_ratings')
movies_ratings.createOrReplaceTempView("movies_ratings")
top_american_english_movies = spark.sql('''
SELECT
imdb_title_id,
collect_list(title) as title,
imdb_total_votes,
imdb_avg_rating,
region,
language,
start_year
FROM movies_ratings
WHERE language IS NULL OR language = 'en'
GROUP BY
imdb_title_id,
imdb_total_votes,
imdb_avg_rating,
region,
language,
start_year
ORDER BY imdb_total_votes DESC
''')
+-------------+------------------------------------------------------+----------------+---------------+------+--------+----------+
|imdb_title_id|title |imdb_total_votes|imdb_avg_rating|region|language|start_year|
+-------------+------------------------------------------------------+----------------+---------------+------+--------+----------+
|tt10872600 |[Spider-Man: No Way Home, Serenity Now] |521999 |8.6 |US |null |2021 |
|tt1160419 |[Dune] |512910 |8.1 |US |null |2021 |
|tt11286314 |[Don't Look Up] |462155 |7.2 |US |null |2021 |
|tt12361974 |[The Snyder Cut, Zack Snyder's Justice League] |364815 |8.1 |US |null |2021 |
|tt3480822 |[Blue Bayou, Black Widow] |341616 |6.7 |US |null |2021 |
|tt9376612 |[Steamboat, Shang-Chi and the Legend of the Ten Rings]|334902 |7.5 |US |null |2021 |
|tt2382320 |[No Time to Die] |327480 |7.3 |US |null |2021 |
|tt6334354 |[The Suicide Squad] |313874 |7.2 |US |null |2021 |
|tt6264654 |[Free Guy] |308222 |7.2 |US |null |2021 |
|tt9032400 |[Eternals] |274107 |6.4 |US |null |2021 |
+-------------+------------------------------------------------------+----------------+---------------+------+--------+----------+
The full description of the IMDB dataset can be found here https://www.imdb.com/interfaces/
Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set.
The available datasets are as follows:
- title.basics.tsv.gz
Contains all types of titles available in IMDB (e.g. movie, tv series, video game, etc)
~8.8M records - title.akas.tsv.gz
Contains additional information about titles (e.g: language, isOriginalTitle, etc.) - title.principals.tsv.gz
~49.7M โ Contains the principal cast/crew for titles - title.ratings.tsv.gz
Contains the IMDb rating and votes information for titles
~1.22M records - name.basics.tsv.gz
Contains the following information for names
~11.4M records
The TMDB dataset is retrieved from kaggle https://www.kaggle.com/datasets/edgartanaka1/tmdb-movies-and-series There are over 526,000 movies json files.
- Filtering out adult movies
- Check if data row exists
- Filter by language and region
- Usage of aggregate function collect_list for titles during retrieval of top 10 movies.
Movies' working title and popular title share the same imdb id.
The ETL script settings can be configured using the settings.cfg file.
The default year is set to 2022 to filter minimal data so that the script finishes sooner.
Run the following command in the Udacity workspace to see the results after the ETL is run.
The results table shown in the readme are for 2021, here's the instructions to verify those results.
Change the OUTPUT_DIR setting in the config to s3a://udacity-dend-imdb-project/output_data and set year to 2021.
python3 ./results.py
Run the following command in the Udacity workspace to do perform ELT of the datasets.
The ELT scripts can run for a long time and will overwrite existing data.
python3 ./etl.py
This is capstone project submission by Karthik Sekar, New Zealand as part of Udacity Data Engineering project.
Refer to Udacity Honor code you can use this repository to get an idea of the capstone project but using it as-is is plagarism.
Copyright ยฉ 2022, Karthik Sekar, New Zealand.