GithubHelp home page GithubHelp logo

data-lake-on-s3-using-spark's Introduction

P3: Data Lake

This project comprises the scripts required for setting up a Data Lake on a using Spark and a S3 bucket. This company had been collecting data on user activity from their music streaming application, and storing them as JSON files. However, this rudimentary way of storing data generated some difficulties when extracting insights from the data.

This directory contains the ETL process, that results in parquet tables following a star schema. Using this tables the Sparkify analytics team will access, aggregate and generate insights from their users’ data.

Design and tables

The project is structured as following:

.
├── README.md (this file)
├── data (sample data used for development)
├── dl.cfg (AWS credentials) 
├── database_schema.jpg (graphical view of the schema of the database)
└── etl.py (python script for generating tables)

A star schema was selected for building the tables, saving them as parquet files. The fact and dimension table are built as follows:

Database schema

Tables

  • Songplays: records in log data associated with song plays. Columns: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
  • Users: users in the app. Columns: user_id (PK), first_name, last_name, gender, level
  • Songs: songs in music database. Columns: song_id (PK), title, artist_id, year, duration
  • Artists: artists in music database. Columns: artist_id (PK), name, location, latitude, longitude
  • Time: timestamps of records in songplays broken down into specific units. Columns: start_time (PK), hour, day, week, month, year, weekday

The tables are saved as parquet files in the S3.

How to use

  1. Add the required the credentials to the dl.cfg file.
  2. To build the database and populate the tables, go to the terminal and execute the following in the repository path:
python etl.py

This will result in the fact and dimensions tables saved as parquets as specified previously. Now start doing some analytics! :)

Sample queries

# (This sample code assumes there's an already defined Spark session called "spark")

# Check most popular artists
songplays_table = spark.read.parquet(songplays_table_path)
songplays_table.select("artist_name").groupBy("artist_name").count().sort(col("count").desc())

# What was the average number of songs listened by users?
songplays_table = spark.read.parquet(songplays_table_path)
songplays_table.select("user_id").groupBy("user_id").count().avg()

data-lake-on-s3-using-spark's People

Contributors

amanjeetsahu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.