GithubHelp home page GithubHelp logo

rigganni / million-song-apache-airflow-aws-redshift Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 62 KB

Load data from the Million Song Dataset into a final dimensional model in RedShift utilizing Apache Airflow.

Python 91.19% TSQL 8.81%
apache-airflow aws-redshift aws-s3 elt

million-song-apache-airflow-aws-redshift's Introduction

Million Song Dataset JSON -> AWS RedShift

Load data from the Million Song Dataset into a final dimensional model in RedShift utilizing Apache Airflow.

Introduction

A fictional music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The files in this repository will assist in loading the data into a final dimensional model in AWS RedShift.

Apache Airflow Operators

Data is extracted from S3 buckets and loaded into staging tables. Apache Airflow is utilized to coordinate data transfer and transformations. 4 operators are created:

  • StageToRedshiftOperator
    • Stages data from both event and songs source JSON files
  • LoadFactOperator
    • Loads data into songplays fact table
  • LoadDimensionOperator
    • Loads data into all dimension tables
    • Optionally truncates dimension before loading
  • DataQualityOperator
    • Generic operator to perform any SQL data quality check passed to it

Apache Airflow DAG

An Apache Airflow DAG was created utilizing the above operators.

Sparkify Airflow DAG

Installation

Clone this repository:

git clone https://github.com/rigganni/Million-Song-Apache-Airflow-AWS-RedShift

Set up the following variables in dwh.cfg:

variable description
DWH_CLUSTER_TYPE multi-node
DWH_NUM_NODES number of nodes for cluster
DWH_NODE_TYPE i.e. dc2.large
DWH_HOST will be written to in create_redshift_cluster_database.py
DWH_IAM_ROLE_NAME i.e. dwhRoleS3
DWH_CLUSTER_IDENTIFIER dwhCluster
DWH_DB dev
DWH_DB_USER aws
DWH_DB_PASSWORD Your password
DWH_PORT specify port (i.e. 5439)
DWH_REGION us-west-2 since required S3 files are stored there
DWH_CLUSTER_SUBNET_NAME use if RedShift cluster is required in specific VPC
DWH_ROLE_ARN written to in create_redshift_cluster_database.py
DWH_CLUSTER_SUBNET_GROUP_NAME use if RedShift cluster is required in specific VPC
KEY AWS key
SECRET AWS secret
ARN written to in create_redshift_cluster_database.py
LOG_DATA s3://udacity-dend/log_data
LOG_JSONPATH s3://udacity-dend/log_json_path.json
SONG_DATA s3://udacity-dend/song_data

Usage

Ensure variables are set up in dwh.cfg as described in installation above.

Create RedShift IAM role, cluster, and database:

python create_redshift_cluster_database.py

Deploy files to proper directories in your Apache Airflow setup.

million-song-apache-airflow-aws-redshift's People

Contributors

rigganni avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.