A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.
In this project, you'll apply what you've learned on data warehouses and AWS to build an ETL pipeline for a database hosted on Redshift. To complete the project, you will need to load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.
You'll be working with two datasets that reside in S3. Here are the S3 links for each:
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data Log data json path: s3://udacity-dend/log_json_path.json
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
- song_data/A/B/C/TRABCEI128F424C983.json
- song_data/A/A/B/TRAABJL12903CDCF1A.json
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings. The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
- log_data/2018/11/2018-11-12-events.json
- log_data/2018/11/2018-11-13-events.json
Below are steps you can follow to complete each component of this project.
- Design schemas for your fact and dimension tables
- Write a SQL CREATE statement for each of these tables in sql_queries.py
- Complete the logic in create_tables.py to connect to the database and create these tables
- Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
- Launch a redshift cluster and create an IAM role that has read access to S3.
- Add redshift database and IAM role info to dwh.cfg.
- Test by running create_tables.py and checking the table schemas in your redshift database. You can use Query Editor in the AWS Redshift console for this.
- Implement the logic in etl.py to load data from S3 to staging tables on Redshift.
- Implement the logic in etl.py to load data from staging tables to analytics tables on Redshift.
- Test by running etl.py after running create_tables.py and running the analytic queries on your Redshift database to compare your results with the expected results.
- Delete your redshift cluster when finished.
Do the following steps in your README.md file.
- Discuss the purpose of this database in context of the startup, Sparkify, and their analytical goals.
- State and justify your database schema design and ETL pipeline.
Data and project information were kindly provided by Udacity.
Discuss the purpose of this database in context of the startup, Sparkify, and their analytical goals.
The purpose of the database is to provide the Startup Sparkify with a cloud-based storage solution for their growing data load, which also allows for efficient data analytics, as the startup wants to investigate which songs their customers are listening to. For this purpose, a data warehouse using Amazon Redshift was used to read the customer data stored in S3 buckets and stored in dimensional tables, which can be accessed by the Sparkify analytics team.
The schema used is a star schema, which allows for efficient querying, with one fact table and multiple dimension tables:
- songplays (fact table) contains records in event data associated with song plays (page = NextSong)
- users (dimension table) contains users and associated information in the app
- songs (dimension table) contains songs and associated information in music database
- artists (dimension table) contains artists and associated information in music database
- time (dimension table) contains timestamps of records in songplays broken down into hour, day, week, month, year and weekday
The ETL pipeline works as follows:
- Staging song data and log data json files from AWS S3 bucket to Redshift, using the load_staging_tables function in etl.py
- Pushing data from staging tables to dimensional tables, using the insert_tables function in etl.py
- dwh.cfg: contains the configuration information for the Redshift cluster, the IAM role and
- sql_queries.py: Python script containing the SQL statements to create, stage and fill the data tables contained in the schema
- create_tables.py: Python script for executing the table creation
- etl.py: Python script for inserting data into the created tables
- README.md: markdown file with documentation of the project
- Create a cluster in AWS-Redshift
- Fill dwh.cfg with the required information, e.g. db_user and db_password
- Run create_tables.py in the terminal using
python create_tables.py
to create the data tables - Run etl.py in the terminal using
python etl.py
to fill the created tables with data
- https://devopscube.com/aws-arn-guide/ for information on ARN
- https://knowledge.udacity.com/questions/96309 for cluster setup