Data Warehouse Project with Redshift

Introduction

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Project Description

In this project, you'll apply what you've learned on data warehouses and AWS to build an ETL pipeline for a database hosted on Redshift. To complete the project, you will need to load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.

Project Datasets

You'll be working with two datasets that reside in S3. Here are the S3 links for each:

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data Log data json path: s3://udacity-dend/log_json_path.json

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings. The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

Project Steps

Below are steps you can follow to complete each component of this project.

Create Table Schemas

Design schemas for your fact and dimension tables
Write a SQL CREATE statement for each of these tables in sql_queries.py
Complete the logic in create_tables.py to connect to the database and create these tables
Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
Launch a redshift cluster and create an IAM role that has read access to S3.
Add redshift database and IAM role info to dwh.cfg.
Test by running create_tables.py and checking the table schemas in your redshift database. You can use Query Editor in the AWS Redshift console for this.

Build ETL Pipeline

Implement the logic in etl.py to load data from S3 to staging tables on Redshift.
Implement the logic in etl.py to load data from staging tables to analytics tables on Redshift.
Test by running etl.py after running create_tables.py and running the analytic queries on your Redshift database to compare your results with the expected results.
Delete your redshift cluster when finished.

Document Process

Do the following steps in your README.md file.

Discuss the purpose of this database in context of the startup, Sparkify, and their analytical goals.
State and justify your database schema design and ETL pipeline.

Disclaimer

Data and project information were kindly provided by Udacity.

Process Documentation

Discuss the purpose of this database in context of the startup, Sparkify, and their analytical goals.

The purpose of the database is to provide the Startup Sparkify with a cloud-based storage solution for their growing data load, which also allows for efficient data analytics, as the startup wants to investigate which songs their customers are listening to. For this purpose, a data warehouse using Amazon Redshift was used to read the customer data stored in S3 buckets and stored in dimensional tables, which can be accessed by the Sparkify analytics team.

State and justify your database schema design and ETL pipeline.

The schema used is a star schema, which allows for efficient querying, with one fact table and multiple dimension tables:

songplays (fact table) contains records in event data associated with song plays (page = NextSong)
users (dimension table) contains users and associated information in the app
songs (dimension table) contains songs and associated information in music database
artists (dimension table) contains artists and associated information in music database
time (dimension table) contains timestamps of records in songplays broken down into hour, day, week, month, year and weekday

The ETL pipeline works as follows:

Staging song data and log data json files from AWS S3 bucket to Redshift, using the load_staging_tables function in etl.py
Pushing data from staging tables to dimensional tables, using the insert_tables function in etl.py

Description of the files in the repository

dwh.cfg: contains the configuration information for the Redshift cluster, the IAM role and
sql_queries.py: Python script containing the SQL statements to create, stage and fill the data tables contained in the schema
create_tables.py: Python script for executing the table creation
etl.py: Python script for inserting data into the created tables
README.md: markdown file with documentation of the project

How to run the files

Create a cluster in AWS-Redshift
Fill dwh.cfg with the required information, e.g. db_user and db_password
Run create_tables.py in the terminal using python create_tables.py to create the data tables
Run etl.py in the terminal using python etl.py to fill the created tables with data

Sources used

https://devopscube.com/aws-arn-guide/ for information on ARN
https://knowledge.udacity.com/questions/96309 for cluster setup

juliaobenauer / data_warehouse-with-redshift Goto Github PK

data_warehouse-with-redshift's Introduction