GithubHelp home page GithubHelp logo

udacity-dend-project's Introduction

Udacity DataEngineering capstone project

This the capstone project for Udacity DataEngineering nanodegree program.

Objective

The objective of this project is to apply the learning from the Udacity Data engineering course.

Overview

This is a fictional annual movie awards hosted by Sparkademy awards.
Spakademy wants to identify top movies to award based on IMDB users' votes.

As a data engineer I was asked to build a data model from the datasets.
The data model is used to analyse and identify the top-rated IMDB movies.

Scope

The datasets are downloaded from IMDB and TMDB dataset from Kaggle into S3.
The scope of the project is to build an ELT data lake from the datasets and write them as parquet files to S3.

The transformed parquet files can be used for analysis

Out of scope

  • Airflow scheduling
  • Automated script to download datasets from IMDB & TMDB.
  • Analysis table for the top rated movies

Data Model

The CSV files from IMDB input data and combined additional data from TMDB as JSON are transformed into parquet files.

I have built a star schema of parquet. The dimension tables contains the movie titles, additional details and movie crew.

The fact tables contains IMDB ratings and the finance data like revenue and budget.

Fact tables

  • movies_ratings - movie ratings from both imdb & tmdb - split language/subfolder by years
  • movies_finances - movie's finance details like revenue and budget

Dimension tables

  • movies_titles - movies titles that are split by region/subfolder by years
  • movies_details - movie details like titles - split by region/subfolder by years
  • movies_principal_crew - movies' principal cast and crew.
  • movies_crew_names - Names of the member's of the cast and crew.

Screen Shot 2022-05-29 at 3 13 29 PM

The choice of tools, technologies, and data model are justified well.

Data model

I have de-normalised fact tables in a way to include basic information like movie title, region, language and year.
The movie data set is huge in millions because it spans across multiple countries(regions), languages and several years since 1800s.

I have split the partition across different years and regions because Join queries can be really expensive so I have de-normalised them to fetch basic details.

Analysis can be run on movie ratings without any complex joins so we can query movie ratings for only specific country, language or a year.
I'll use a join query only if I need to fetch additional details about the movies.

Tools

I have chosen Spark & S3 parquet to build a data lake because the IMDB dataset is very huge and ratings keep changing all the time.
Had I chosen a data warehouse like Redshift, we need to run a lot of update queries on a daily basis, and it might end up.

Challenges

The challenge with the TMDB dataset is it contains a lot of small JSON files.
It turned out that Spark seem to be efficient with few big files rather than a lot of small files.
I was not able to use AWS Glue with the Udacity account, I'll use AWS glue crawler for this use case in the future.

The data was increased by 100x?

The IMDB dataset is huge, some csv files contains as much as 50 Million records. I have used only with a subset of this dataset for a year. I couldn't use Spark to it's full potential in the Udacity workspace, I was getting out of memory errors.

I'll use EMR to have Spark clusters and scale accordingly. I'll also try to use serverless solution like AWS Glue crawlers to especially process the JSON data.

The pipelines would be run on a daily basis by 7 am every day?

I'd use Airflow to schedule every morning 7am potentially download the IMDB and TMDB datasets use of AWS Lambda and then do the loading and transforming of the CSVs into parquet files.

The database needed to be accessed by 100+ people

The queries results run on the data model can be stored in a database.
I have already denormalised my tables so I'll use Cassandra for both the heavy on read and write scenarios.
If the read requires a lot of join queries then I'll use Redshift.

I will still keep either regions(countries) or year as partition key depending on the use case since the IMDB dataset always spans across multiple countries.
I may also loose the sub folder and flatten the partition since I am going to copy the transformed data into a database.

Process result

The purpose of this data model is to retrieve top movies based on IMDB ratings for Sparkademy awards. I create a temporary view from movie_ratings parquet table and retrieved the top 10 movies for 2021.

I've restricted the scope for this project to just ETL, build a data model and extract results from the model parquet tables for the Sparkademy awards, I've left the analysis table out of scope of this project.

The ETL is already run on only one year from settings, that's why we don't have a where condition for the year in the query.
If the data model is done across multiple years and countries (regions), I'll query the data differently and use Cassandra or Redshift in my pipeline.

movies_ratings = spark.parquet.read(output_dir + '/movies_ratings')
movies_ratings.createOrReplaceTempView("movies_ratings")

top_american_english_movies = spark.sql('''
        SELECT 
          imdb_title_id, 
          collect_list(title) as title, 
          imdb_total_votes, 
          imdb_avg_rating, 
          region, 
          language, 
          start_year 
        FROM movies_ratings 
        WHERE language IS NULL OR language = 'en' 
        GROUP BY 
          imdb_title_id, 
          imdb_total_votes, 
          imdb_avg_rating, 
          region, 
          language, 
          start_year 
        ORDER BY imdb_total_votes DESC
''')

Query results (Top 10 movies of 2021 based on IMDB Ratings)

+-------------+------------------------------------------------------+----------------+---------------+------+--------+----------+
|imdb_title_id|title                                                 |imdb_total_votes|imdb_avg_rating|region|language|start_year|
+-------------+------------------------------------------------------+----------------+---------------+------+--------+----------+
|tt10872600   |[Spider-Man: No Way Home, Serenity Now]               |521999          |8.6            |US    |null    |2021      |
|tt1160419    |[Dune]                                                |512910          |8.1            |US    |null    |2021      |
|tt11286314   |[Don't Look Up]                                       |462155          |7.2            |US    |null    |2021      |
|tt12361974   |[The Snyder Cut, Zack Snyder's Justice League]        |364815          |8.1            |US    |null    |2021      |
|tt3480822    |[Blue Bayou, Black Widow]                             |341616          |6.7            |US    |null    |2021      |
|tt9376612    |[Steamboat, Shang-Chi and the Legend of the Ten Rings]|334902          |7.5            |US    |null    |2021      |
|tt2382320    |[No Time to Die]                                      |327480          |7.3            |US    |null    |2021      |
|tt6334354    |[The Suicide Squad]                                   |313874          |7.2            |US    |null    |2021      |
|tt6264654    |[Free Guy]                                            |308222          |7.2            |US    |null    |2021      |
|tt9032400    |[Eternals]                                            |274107          |6.4            |US    |null    |2021      |
+-------------+------------------------------------------------------+----------------+---------------+------+--------+----------+

Data Dictionary

IMDB dataset

The full description of the IMDB dataset can be found here https://www.imdb.com/interfaces/

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set.
The available datasets are as follows:

  • title.basics.tsv.gz
    Contains all types of titles available in IMDB (e.g. movie, tv series, video game, etc)
    ~8.8M records
  • title.akas.tsv.gz
    Contains additional information about titles (e.g: language, isOriginalTitle, etc.)
  • title.principals.tsv.gz
    ~49.7M โ€“ Contains the principal cast/crew for titles
  • title.ratings.tsv.gz
    Contains the IMDb rating and votes information for titles
    ~1.22M records
  • name.basics.tsv.gz
    Contains the following information for names
    ~11.4M records

TMDB data set

The TMDB dataset is retrieved from kaggle https://www.kaggle.com/datasets/edgartanaka1/tmdb-movies-and-series There are over 526,000 movies json files.

Data quality checks

  • Filtering out adult movies
  • Check if data row exists
  • Filter by language and region
  • Usage of aggregate function collect_list for titles during retrieval of top 10 movies.
    Movies' working title and popular title share the same imdb id.

Instruction to run the scripts

The ETL script settings can be configured using the settings.cfg file.
The default year is set to 2022 to filter minimal data so that the script finishes sooner.

Data analysis - Run queries on existing data model

Run the following command in the Udacity workspace to see the results after the ETL is run.

The results table shown in the readme are for 2021, here's the instructions to verify those results.
Change the OUTPUT_DIR setting in the config to s3a://udacity-dend-imdb-project/output_data and set year to 2021.

python3 ./results.py

ETL - Beware! this will overwrite the existing data)

Run the following command in the Udacity workspace to do perform ELT of the datasets.
The ELT scripts can run for a long time and will overwrite existing data.

python3 ./etl.py

Copyright

This is capstone project submission by Karthik Sekar, New Zealand as part of Udacity Data Engineering project.

Refer to Udacity Honor code you can use this repository to get an idea of the capstone project but using it as-is is plagarism.

Copyright ยฉ 2022, Karthik Sekar, New Zealand.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.