The udacity-dend-project's intro from karthiksekarnz

Udacity DataEngineering capstone project

This the capstone project for Udacity DataEngineering nanodegree program.

Objective

The objective of this project is to apply the learning from the Udacity Data engineering course.

Overview

This is a fictional annual movie awards hosted by Sparkademy awards.
Spakademy wants to identify top movies to award based on IMDB users' votes.

As a data engineer I was asked to build a data model from the datasets.
The data model is used to analyse and identify the top-rated IMDB movies.

Scope

The datasets are downloaded from IMDB and TMDB dataset from Kaggle into S3.
The scope of the project is to build an ELT data lake from the datasets and write them as parquet files to S3.

The transformed parquet files can be used for analysis

Out of scope

Airflow scheduling
Automated script to download datasets from IMDB & TMDB.
Analysis table for the top rated movies

Data Model

The CSV files from IMDB input data and combined additional data from TMDB as JSON are transformed into parquet files.

I have built a star schema of parquet. The dimension tables contains the movie titles, additional details and movie crew.

The fact tables contains IMDB ratings and the finance data like revenue and budget.

Fact tables

movies_ratings - movie ratings from both imdb & tmdb - split language/subfolder by years
movies_finances - movie's finance details like revenue and budget

Dimension tables

movies_titles - movies titles that are split by region/subfolder by years
movies_details - movie details like titles - split by region/subfolder by years
movies_principal_crew - movies' principal cast and crew.
movies_crew_names - Names of the member's of the cast and crew.

The choice of tools, technologies, and data model are justified well.

Data model

I have de-normalised fact tables in a way to include basic information like movie title, region, language and year.
The movie data set is huge in millions because it spans across multiple countries(regions), languages and several years since 1800s.

I have split the partition across different years and regions because Join queries can be really expensive so I have de-normalised them to fetch basic details.

Analysis can be run on movie ratings without any complex joins so we can query movie ratings for only specific country, language or a year.
I'll use a join query only if I need to fetch additional details about the movies.

Tools

I have chosen Spark & S3 parquet to build a data lake because the IMDB dataset is very huge and ratings keep changing all the time.
Had I chosen a data warehouse like Redshift, we need to run a lot of update queries on a daily basis, and it might end up.

Challenges

The challenge with the TMDB dataset is it contains a lot of small JSON files.
It turned out that Spark seem to be efficient with few big files rather than a lot of small files.
I was not able to use AWS Glue with the Udacity account, I'll use AWS glue crawler for this use case in the future.

The data was increased by 100x?

The IMDB dataset is huge, some csv files contains as much as 50 Million records. I have used only with a subset of this dataset for a year. I couldn't use Spark to it's full potential in the Udacity workspace, I was getting out of memory errors.

I'll use EMR to have Spark clusters and scale accordingly. I'll also try to use serverless solution like AWS Glue crawlers to especially process the JSON data.

The pipelines would be run on a daily basis by 7 am every day?

I'd use Airflow to schedule every morning 7am potentially download the IMDB and TMDB datasets use of AWS Lambda and then do the loading and transforming of the CSVs into parquet files.

The database needed to be accessed by 100+ people

The queries results run on the data model can be stored in a database.
I have already denormalised my tables so I'll use Cassandra for both the heavy on read and write scenarios.
If the read requires a lot of join queries then I'll use Redshift.

I will still keep either regions(countries) or year as partition key depending on the use case since the IMDB dataset always spans across multiple countries.
I may also loose the sub folder and flatten the partition since I am going to copy the transformed data into a database.

Process result

The purpose of this data model is to retrieve top movies based on IMDB ratings for Sparkademy awards. I create a temporary view from movie_ratings parquet table and retrieved the top 10 movies for 2021.

I've restricted the scope for this project to just ETL, build a data model and extract results from the model parquet tables for the Sparkademy awards, I've left the analysis table out of scope of this project.

The ETL is already run on only one year from settings, that's why we don't have a where condition for the year in the query.
If the data model is done across multiple years and countries (regions), I'll query the data differently and use Cassandra or Redshift in my pipeline.

movies_ratings = spark.parquet.read(output_dir + '/movies_ratings')
movies_ratings.createOrReplaceTempView("movies_ratings")

top_american_english_movies = spark.sql('''
        SELECT 
          imdb_title_id, 
          collect_list(title) as title, 
          imdb_total_votes, 
          imdb_avg_rating, 
          region, 
          language, 
          start_year 
        FROM movies_ratings 
        WHERE language IS NULL OR language = 'en' 
        GROUP BY 
          imdb_title_id, 
          imdb_total_votes, 
          imdb_avg_rating, 
          region, 
          language, 
          start_year 
        ORDER BY imdb_total_votes DESC
''')

Query results (Top 10 movies of 2021 based on IMDB Ratings)

+-------------+------------------------------------------------------+----------------+---------------+------+--------+----------+
|imdb_title_id|title                                                 |imdb_total_votes|imdb_avg_rating|region|language|start_year|
+-------------+------------------------------------------------------+----------------+---------------+------+--------+----------+
|tt10872600   |[Spider-Man: No Way Home, Serenity Now]               |521999          |8.6            |US    |null    |2021      |
|tt1160419    |[Dune]                                                |512910          |8.1            |US    |null    |2021      |
|tt11286314   |[Don't Look Up]                                       |462155          |7.2            |US    |null    |2021      |
|tt12361974   |[The Snyder Cut, Zack Snyder's Justice League]        |364815          |8.1            |US    |null    |2021      |
|tt3480822    |[Blue Bayou, Black Widow]                             |341616          |6.7            |US    |null    |2021      |
|tt9376612    |[Steamboat, Shang-Chi and the Legend of the Ten Rings]|334902          |7.5            |US    |null    |2021      |
|tt2382320    |[No Time to Die]                                      |327480          |7.3            |US    |null    |2021      |
|tt6334354    |[The Suicide Squad]                                   |313874          |7.2            |US    |null    |2021      |
|tt6264654    |[Free Guy]                                            |308222          |7.2            |US    |null    |2021      |
|tt9032400    |[Eternals]                                            |274107          |6.4            |US    |null    |2021      |
+-------------+------------------------------------------------------+----------------+---------------+------+--------+----------+

Data Dictionary

IMDB dataset

The full description of the IMDB dataset can be found here https://www.imdb.com/interfaces/

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set.
The available datasets are as follows:

title.basics.tsv.gz
Contains all types of titles available in IMDB (e.g. movie, tv series, video game, etc)
~8.8M records
title.akas.tsv.gz
Contains additional information about titles (e.g: language, isOriginalTitle, etc.)
title.principals.tsv.gz
~49.7M – Contains the principal cast/crew for titles
title.ratings.tsv.gz
Contains the IMDb rating and votes information for titles
~1.22M records
name.basics.tsv.gz
Contains the following information for names
~11.4M records

TMDB data set

The TMDB dataset is retrieved from kaggle https://www.kaggle.com/datasets/edgartanaka1/tmdb-movies-and-series There are over 526,000 movies json files.

Data quality checks

Filtering out adult movies
Check if data row exists
Filter by language and region
Usage of aggregate function collect_list for titles during retrieval of top 10 movies.
Movies' working title and popular title share the same imdb id.

Instruction to run the scripts

The ETL script settings can be configured using the settings.cfg file.
The default year is set to 2022 to filter minimal data so that the script finishes sooner.

Data analysis - Run queries on existing data model

Run the following command in the Udacity workspace to see the results after the ETL is run.

The results table shown in the readme are for 2021, here's the instructions to verify those results.
Change the OUTPUT_DIR setting in the config to s3a://udacity-dend-imdb-project/output_data and set year to 2021.

python3 ./results.py

ETL - Beware! this will overwrite the existing data)

Run the following command in the Udacity workspace to do perform ELT of the datasets.
The ELT scripts can run for a long time and will overwrite existing data.

python3 ./etl.py

Copyright

This is capstone project submission by Karthik Sekar, New Zealand as part of Udacity Data Engineering project.

Refer to Udacity Honor code you can use this repository to get an idea of the capstone project but using it as-is is plagarism.

karthiksekarnz / udacity-dend-project Goto Github PK

udacity-dend-project's Introduction