GithubHelp home page GithubHelp logo

lovenui / building-modern-data-pipeline-using-python-and-aws Goto Github PK

View Code? Open in Web Editor NEW
6.0 1.0 0.0 180.6 MB

Python 88.34% Shell 11.66%
aws-athena aws-glue aws-iam aws-lambda aws-s3 python sql

building-modern-data-pipeline-using-python-and-aws's Introduction

Building Modern Data Pipeline using Python & AWS

Business Overview

Many problems exist when deploying or transferring analytics to the cloud. Differences in features between on-premises and cloud data platforms, security, and governance are all technical concerns. The danger of moving on-premises data into the cloud has prompted organizations to limit cloud analytics initiatives, especially in regulated industries where data protection is crucial. Cloud-based safe Data Lake solutions aid in the development of rich analytics on data while classifying it into several storage phases, such as raw, cleansed, and analytical. This project aims to securely manage, streamline, and perform analysis on the structured and semi-structured YouTube videos data based on the video categories and the trending metrics.

Data Pipeline

A data pipeline is a technique for transferring data from one system to another. The data may or may not be updated, and it may be handled in real-time (or streaming) rather than in batches. The data pipeline encompasses everything from harvesting or acquiring data using various methods to storing raw data, cleaning, validating, and transforming data into a query-worthy format, displaying KPIs, and managing the above process.

Dataset Description

This Kaggle dataset contains statistics (CSV files) on daily popular YouTube videos over the course of many months. There are up to 200 trending videos published every day for many locations. The data for each region is in its own file. The video title, channel title, publication time, tags, views, likes and dislikes, description, and comment count are among the items included in the data. A category_id field, which differs by area, is also included in the JSON file linked to the region.

Tech Stack:

  • Languages: SQL, Python3
  • Services: AWS S3, AWS Glue, QuickSight, AWS Lambda, AWS Athena, AWS IAM

Key Takeaways

  • Understanding the project Overview and Architecture
  • Understanding ETL on Big Data
  • Introduction to Staging and Data Lake
  • Creating IAM Roles and Policies
  • Creating Lambda Functions
  • Setting up Glue Jobs for ETL
  • Using Glue Crawler and Glue Studio
  • Creating Glue Data Catalog
  • Converting JSON to Parquet format
  • Performing Data Transformations and Joins
  • Visualizing in QuickSight

Architecture diagram:

architecture diagram of project

building-modern-data-pipeline-using-python-and-aws's People

Contributors

lovenui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.