GithubHelp home page GithubHelp logo

data_warehouse-with-redshift's Introduction

Data Warehouse Project with Redshift

Introduction

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Project Description

In this project, you'll apply what you've learned on data warehouses and AWS to build an ETL pipeline for a database hosted on Redshift. To complete the project, you will need to load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.

Project Datasets

You'll be working with two datasets that reside in S3. Here are the S3 links for each:

  • Song data: s3://udacity-dend/song_data
  • Log data: s3://udacity-dend/log_data Log data json path: s3://udacity-dend/log_json_path.json

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

  • song_data/A/B/C/TRABCEI128F424C983.json
  • song_data/A/A/B/TRAABJL12903CDCF1A.json

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings. The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

  • log_data/2018/11/2018-11-12-events.json
  • log_data/2018/11/2018-11-13-events.json

Project Steps

Below are steps you can follow to complete each component of this project.

Create Table Schemas

  1. Design schemas for your fact and dimension tables
  2. Write a SQL CREATE statement for each of these tables in sql_queries.py
  3. Complete the logic in create_tables.py to connect to the database and create these tables
  4. Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
  5. Launch a redshift cluster and create an IAM role that has read access to S3.
  6. Add redshift database and IAM role info to dwh.cfg.
  7. Test by running create_tables.py and checking the table schemas in your redshift database. You can use Query Editor in the AWS Redshift console for this.

Build ETL Pipeline

  1. Implement the logic in etl.py to load data from S3 to staging tables on Redshift.
  2. Implement the logic in etl.py to load data from staging tables to analytics tables on Redshift.
  3. Test by running etl.py after running create_tables.py and running the analytic queries on your Redshift database to compare your results with the expected results.
  4. Delete your redshift cluster when finished.

Document Process

Do the following steps in your README.md file.

  1. Discuss the purpose of this database in context of the startup, Sparkify, and their analytical goals.
  2. State and justify your database schema design and ETL pipeline.

Disclaimer

Data and project information were kindly provided by Udacity.


Process Documentation

Discuss the purpose of this database in context of the startup, Sparkify, and their analytical goals.

The purpose of the database is to provide the Startup Sparkify with a cloud-based storage solution for their growing data load, which also allows for efficient data analytics, as the startup wants to investigate which songs their customers are listening to. For this purpose, a data warehouse using Amazon Redshift was used to read the customer data stored in S3 buckets and stored in dimensional tables, which can be accessed by the Sparkify analytics team.

State and justify your database schema design and ETL pipeline.

The schema used is a star schema, which allows for efficient querying, with one fact table and multiple dimension tables:

  1. songplays (fact table) contains records in event data associated with song plays (page = NextSong)
  2. users (dimension table) contains users and associated information in the app
  3. songs (dimension table) contains songs and associated information in music database
  4. artists (dimension table) contains artists and associated information in music database
  5. time (dimension table) contains timestamps of records in songplays broken down into hour, day, week, month, year and weekday

The ETL pipeline works as follows:

  1. Staging song data and log data json files from AWS S3 bucket to Redshift, using the load_staging_tables function in etl.py
  2. Pushing data from staging tables to dimensional tables, using the insert_tables function in etl.py

Description of the files in the repository

  1. dwh.cfg: contains the configuration information for the Redshift cluster, the IAM role and
  2. sql_queries.py: Python script containing the SQL statements to create, stage and fill the data tables contained in the schema
  3. create_tables.py: Python script for executing the table creation
  4. etl.py: Python script for inserting data into the created tables
  5. README.md: markdown file with documentation of the project

How to run the files

  1. Create a cluster in AWS-Redshift
  2. Fill dwh.cfg with the required information, e.g. db_user and db_password
  3. Run create_tables.py in the terminal using python create_tables.py to create the data tables
  4. Run etl.py in the terminal using python etl.py to fill the created tables with data

Sources used

data_warehouse-with-redshift's People

Watchers

Julia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.