End-to-End ETL Pipeline with Microsoft Azure for IMDb Movie Rating Dataset

Overview

This project focuses on building a comprehensive Extract, Transform, and Load (ETL) pipeline leveraging the robust functionalities of Microsoft Azure. The pipeline is designed to efficiently fetch data from Azure Blob Storage perform necessary transformations using Azure Databricks and Azure Data Factory, store it in Azure Data Lake and finally store the analyzed data in a SQL layer for further use. This end-to-end solution aims to streamline the data processing workflow and provide actionable insights from the transformed data.

Dataset and Analysis

The Dataset used in this project contains information for over 45,000 movies with 26 million ratings! Since this is primarily a Data Engineering project, we'll only be focusing on one of the datasets "movies-metadata" which contains information such as budget, revenue, release dates, languages, production countries and companies

Analysis #1: Language Analysis

We perform language-based analysis on a movie dataset, focusing on non-numeric original language entries. We generate a summary dataframe that includes the number of movies, average budget, average revenue, and average popularity for each language. This analysis helps in understanding the impact of the original language on various movie metrics

Analysis #2: Release Date Analysis

We analyze movies released after 1995 in the dataset. We calculate the number of movies released per year and perform a genre analysis to identify the top 5 genres for each year. The final output summarizes the number of movies released and the top 5 genres for each year post-1995

Analysis #3: Genre Analysis

We conduct a comprehensive analysis of movie genres in the dataset. We calculate the distribution of genres and aggregates the average budget, revenue, rating, and popularity for each genre. The final output summarizes the average budget, revenue, rating, popularity, and the number of movies for each genre, providing insights into the characteristics and trends of different movie genres

Prerequisites

Microsoft Azure subscription
Azure Blob Storage: Object storage service for storing large amounts of unstructured data
Azure Data Lake Gen2 Storage: Scalable storage for big data analytics
Azure Databricks: Analytics platform based on Apache Spark for big data processing and machine learning
Azure SQL Server: Fully managed relational database service with SQL Server engine
Azure SQL Database: Managed relational database with SQL Server compatibility
Azure Data Factory: Data integration service that enables you to create, schedule, and orchestrate ETL and ELT workflows. ADF provides a visually intuitive interface to build data-driven workflows for orchestrating data movement and transforming data at scale

Data flow:

Extract Data: Retrieve CSV data from Azure Blob Storage for processing
Transform Data: Utilize Python and PySpark on Azure Databricks to analyze and process the data, storing the results in Azure Data Lake Storage Gen2
Load Data: Transfer the processed data into an Azure SQL database to establish a reporting layer for dashboard creation
Automation: Construct end-to-end pipelines in Azure Data Factory to automate the data flow from extraction to reporting layers

Setup and Configuration

Azure Blob Storage

Create a container in your Azure Blob Storage account
Upload your input data files to the container

Azure Data Lake Storage

Create a file system in your Azure Data Lake Storage account
Create a directory in the file system to store the processed Parquet files

Azure Databricks

Create and configure a cluster with the necessary dependencies and settings for Python and PySpark.
In our case, I have used created a cluster with the configuration: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
Create a new notebook in your Azure Databricks workspace

Azure SQL Database

Create a SQL database in Azure
Create the necessary tables in the database to store the processed data

Azure Data Factory

Create a new pipeline in your Azure Data Factory instance
Create a link service for each of the Azure functionalities used. I have created a Linked Service for: Azure Blob Storage, Azure Data Lake Gen2 Storage, Azure Databricks, and Azure SQL
Add activities to the pipeline for each step of the workflow: a. Data processing in Azure Databricks b. Data copy from Azure Data Lake Storage to Azure SQL Database

Running the Pipeline: a. Trigger the pipeline run in Azure Data Factory b. Monitor the pipeline run to ensure that each step completes successfully

By the end of it all, we have a fully automated ETL pipeline that fetches the data from blob, transforms the data and generates the necessary insights and loads it to the reporting layer :)

rohitkulkarni08 / azure-etl-pipeline-movieanalytics Goto Github PK

azure-etl-pipeline-movieanalytics's Introduction

End-to-End ETL Pipeline with Microsoft Azure for IMDb Movie Rating Dataset

Overview

Dataset and Analysis

Analysis #1: Language Analysis

Analysis #2: Release Date Analysis

Analysis #3: Genre Analysis

Prerequisites

Data flow:

Setup and Configuration

Azure Blob Storage

Azure Data Lake Storage

Azure Databricks

Azure SQL Database

Azure Data Factory

azure-etl-pipeline-movieanalytics's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs