GithubHelp home page GithubHelp logo

andy-pham-72 / top-rentals-cineplex Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 17.77 MB

Applying data engineering techniques to create data pipeline with Azure Cloud Computing

Python 97.54% Shell 2.46%
api-client azure-databricks azure-storage data-engineering selenium-webdriver slowly-changing-dimensions

top-rentals-cineplex's Introduction

TOP RENTALS CINEPLEX

best-movies-1614634680

Motivation

Emotions help us feel human again and connect us to each other by watching the lives of different characters and feel everything they feel through 24 frames per seconds of the movies. In this way, movies stir up your emotions and is one of the great things about them. I personally a movie lover and sometimes I call myself as a cinephile who is passionate about movies and want to know a lot about them. Moreover, a cinephile should be an educated film consumer with the tool kit to distinguish average films from outstanding ones depending on some particular metrics.

With that motivation, I create my data engineering capstone which we can quickly see the top 36 movie rental from Cineplex website along with the imdb rating, tomatometer and audience score from Rotten tomatoes. In addition, dedicating to the “nerdy” movie lover, we can look up the cast in detail with the curated data sets from imdb.

Idea of the project

image

When we click into the Top Rentals in Cineplex website, we might be confused and indecisive before choosing the right movie that you like. This is when the project Top Rentals Cineplex comes in handy, it can quickly show you the essential information about the movie you want to rent:

  • IMDB rating
  • Tomato meter
  • Audience Score
  • Synopsis
  • Top critics (from Rotten Tomato)

For my more demo visualizations please click this link below: toprentalcineplex.my.canva.site

PROJECT ETL DIAGRAM

image

Events

The project data is collected/scraped from multiple sources by using Selenium WebDriver and API :

  1. Top Rentals Cineplex : is the source to be scraped Top 36 Movie Rentals’ title on Cineplex website.

  2. IMDB.com : Official subsets of IMDB data that are available for personal and commercial use. IMDB data sets contain ”imdb_id” which is used as primary/foreign key to link tables.�Following downloaded list:

  • title.basics.tsv.gz contains the essential information for the movie titles.
  • title.crew.tsv.gz contains the director and writer information for tall the titles.
  • title.principals.tsv.gz contains the principal cast/crew for titles
  • title.ratings.tsv.gz contains the IMDB rating and votes information for titles
  • name.basics.tsv.gz contains the following information for names (actors, actresses, directors, writers, etc..)
  1. Themoviedb.org : is a community built movie and TV database which has API available for everyone to use. I personally use their API to cumulate the movies’ synopsis along with the ”imdb_id” for the Top Movie Rentals from Cineplex.

  2. Rottentomatoes.com : is used to achieve data for corresponding Top 36 Movie Rentals’ title :

  • The top critics from credible reviewers or audience’s review (for titles that don’t have many credible reviewers).
  • Tomatometer and audience score.

Cloud computing solution

I use Azure Databricks which is a fast, easy and collaborative Apache Spark-based big data analytics service designed for data science and data engineering.

I created 4 notebooks in Databricks and incorporated some customized libraries:

  • imdb_datasets_downloader :
    • Automates the official data sets downloading process from IMDB website. (using Selenium)
    • Extracts gz files and convert tsv files to parquet files. (using PySpark)
    • Saves files to Azure Blob Storage (ABS) as data warehouse.
  • top_rentals_cineplex_scrapper :
    • Scrapes Top 36 movie rentals’ titles on Cineplex website save as parquet file to ABS (using PySpark).
    • Applies Slowly Changing Dimension Type 2 for table structure that stores and manages the current and historical data over time in terms of the top titles orders (e.g: Top 1, 2 ,3 ,.. and the data is current or not current with date, time).
  • theMovieDb_and_RottenTomatoes_data_scraper :
    • Scrapes synopsis from themoviedb.org (using API) and top critics from rottentomatoes.com (using Selenium).
    • Saves table as parquet file to ABS (using PySpark).
  • top_rental_rating_from_imdb_and_rotten_tomatoes_data :
    • Extracts imdb_rating from IMDB data set and merge with Tomatometer and Audience Score from rottentomatoes.com into 1 table with corresponding imdb_id of the Top 36 movie rentals (using PySpark).
    • Saves as parquet file to ABS.

Top rentals cineplex’s ER diagram

image

Data Management

Azure Databricks Workflows is used to manage the scheduling jobs which automate the data scraping tasks and saving data into data warehouse (Azure Blob Storage).

Screen Shot 2022-06-12 at 7 13 30 PM

DEMO DATA OUTPUT

Using PySpark to run the query for the fact table:

top_rental_cineplex.join(top_rental_rating, (top_rental_cineplex.imdb_id == top_rental_rating.imdb_id) , how = 'inner' ).\
                                    join(synopsis_table, (synopsis_table.imdb_id == top_rental_cineplex.imdb_id), how = 'inner').\
                                        filter((top_rental_cineplex.is_current == 1) ).\
                                            select(top_rental_cineplex.title, 
                                                   top_rental_cineplex.ordering, 
                                                   synopsis_table.synopsis,
                                                   top_rental_rating.imdb_rating ,
                                                   top_rental_rating.tomato_meter ,
                                                   top_rental_rating.audience_score
                                                  
                                                  ).orderBy(top_rental_cineplex.ordering).show()

Screen Shot 2022-06-12 at 7 15 10 PM

top-rentals-cineplex's People

Contributors

andy-pham-72 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

lukesternbe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.