GithubHelp home page GithubHelp logo

tmaferreira / dataengineeringzoomcampproject Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 146 KB

Data Engineering ZoomCamp Course Project

HCL 30.77% Python 69.23%
bigquery cloud-storage data-engineering data-engineering-pipeline data-visualization dbt prefect terraform data-lake python3

dataengineeringzoomcampproject's Introduction

Data Engineering ZoomCamp Course Project - US Accidents

Preface

This repository contains the course project for the Data Engineering Zoomcamp (Cohort 2023) organized by the by DataTalks.Club community. The project covers main data engineering skills taught in the course:

  • Workflow Orchestration: Data Lake, Prefect tool, ETL with GCP & Prefect
  • Data Warehouse: BigQuery
  • Analytics engineering: dbt (data build tool)
  • Data Analysis: Looker Studio

US Accidents Project

Dataset

US car crash dataset (covers 49 states). Crash data is collected from February 2016 to December 2021 using various APIs that provide streaming traffic incident (or event) data. These APIs transmit traffic data captured by a variety of entities, such as US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors on road networks. There are currently around 2.8 million crash records in this dataset.

The dataset has 47 columns, but for the present project I decided to select only the relevant columns for my analysis. The following columns will be used:

# Attribute Description
1 ID This is a unique identifier of the accident record.
2 Severity Shows the severity of the accident, a number between 1 and 4.
1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
3 Start_Time Shows the start time of the accident in local time zone.
4 End_Time Shows the end time of the accident in local time zone.
End time here refers to when the impact of accident on traffic flow was dismissed.
5 Description Shows the natural language description of the accident.
6 Street Shows the street name in address field.
7 City Shows the city in address field.
8 State Shows the state in address field.
9 Country Shows the country in address field.
10 Weather_Condition Shows the weather condition (rain, snow, thunderstorm, fog, etc.)
11 Sunrise_Sunset Shows the period of day (i.e. day or night) based on sunrise/sunset.

More information about this dataset: Author blog and Kaggle

Dataset Acknowledgments

  • Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. โ€œA Countrywide Traffic Accident Dataset.โ€, 2019.
  • Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

Architecture Diagram

Technologies Used

  • Google Cloud Platform (GCP):
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure as code (IaC)
  • dbt: Data Transformation
  • Pandas: Data Analysis & Exploration
  • Prefect: Workflow Orchestration
  • Looker Studio: Visualize Data

DW Table Structure

# Attribute Description
1 accident_id This is a unique identifier of the accident record.
2 severity_id Shows the severity of the accident, a number between 1 and 4.
1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
3 start_date Shows start date of the accident was started.
4 end_date Shows the end date of the accident was ended.
5 start_time Shows the start time of the accident in local time zone.
6 end_time Shows the end time of the accident in local time zone.
End time here refers to when the impact of accident on traffic flow was dismissed.
7 description Shows the natural language description of the accident.
8 street Shows the street name in address field.
9 city Shows the city in address field.
10 state Shows the state in address field.
11 country Shows the country in address field.
12 weather_condition Shows the weather condition (rain, snow, thunderstorm, fog, etc.)
13 sunrise_sunset Shows the period of day (i.e. day or night) based on sunrise/sunset.

image

Partitioning and Clustering: image

  • Partition by column start_date, more specifically by year to obtain annual granularity
  • Clustering by column country to group data that have the same country value

Benefits of combining clustered and partitioned tables: Combining clustered and partitioned tables

Data visualization: Dashboards

Main Questions

  1. Which State/City/Street in US has reported most number of Accident Cases between 2016 and 2021?
  2. How are the weather conditions in most of the accident cases in US?
  3. Did most accidents occur at night or during the day?

US Crash Accidents by State, City and Street - Dashboard

image

US Crash Accidents by Severity, Weather Conditions, Day/Night and Date (Year and Month)

image

More detailed analysis of the results obtained: Data Analysis

How to reproduce this project?

Step 1: Clone this repo and install necessary requirements

  1. Clone the repo into your local machine:
git clone [email protected]:tmaferreira/DataEngineeringZoomCampProject.git
  1. Install all required dependencies into your environment
pip3 install -r requirements.txt

Step 2: Setup of GCP

  1. Create a Google Cloud Platform (GCP) free account with your Google e-mail

  2. Create a new GCP project with the name dezoomcamp-finalproject (Note: Save the assigned Project ID. Projects have a unique ID and for that reason another ID will be assigned)

  3. Create a Service Account:

    • Go to IAM & Admin > Service accounts > Create service account
    • Provide a service account name and grant the roles: Viewer + BigQuery Admin + Storage Admin + Storage Object Admin
    • Download the Service Account json file
    • Download SDK for local setup
    • Set environment variable to point to your downloaded GCP keys:
    export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
    # Refresh token/session, and verify authentication
    gcloud auth application-default login
  4. Enable the following APIs:

Step 3: Creation of a GCP Infrastructure

  1. Install Terraform
  2. Copy files (main.tf and variables.tf) for the infrastructure creation (Use files created in Zoomcamp course: Terraform files)
  3. In the file variables.tf change variable BQ_DATASET to: us_traffic_accidents_data
  4. Execute the following commands to plan the creation of the GCP infrastructure:
# Initialize state file (.tfstate)
terraform init

# Check changes to new infra plan
# -var="project=<your-gcp-project-id>"

terraform plan -var="project=dezoomcamp-finalproject"
# Create new infra
# -var="project=<your-gcp-project-id>"

terraform apply -var="project=dezoomcamp-finalproject"

It is possible to see in the GCP console that the Infrastructure was correctly created.

Step 4: Setup of Kaggle API

  1. Create a Kaggle free account

  2. Create an API token:

    • Click on your avatar
    • Go to Account menu
    • Click on the option "Create New API Token"
    • Download the json file for local setup
  3. In your local setup, copy the file into the path:

~/.kaggle/
  1. For your security, ensure that other users of your computer do not have read access to your credentials:
chmod 600 ~/.kaggle/kaggle.json

To see all available API options and commands:

 kaggle --help

Step 5: Setup orchestration using Prefect

  1. Setup the prefect server so that you can access the UI. Run the following command in a CL terminal:
 prefect orion start
  1. Access the UI in your browser: http://127.0.0.1:4200/
  2. For the connection with GCP Buckets it is necessary to create a block:
  • In the side menu click on the option Blocks

  • Click on the '+' button and select the GCS Bucket option

  • Fill in the required fields: image

  • In the Gcp Credentials field click on the Add button

  • Fill in the Block Name field: image

  • Using the service account json file that was downloaded in step 2, copy its content and paste it in the Service Account Info field

  • Click on the Create button and you will be redirected to the previous GCS Bucket block creation page:

  • In the Gcp Credentials field select the Gcp credential created previously: image

  • Click on the Create button to create the block

  1. To execute the flow, run the following commands in a different CL terminal than step 1:
python prefect/flows/api_to_gcs_to_bq.py

Step 6: Running the dbt flow

  1. Create a dbt cloud free account
  2. Clone this repo
  3. In the command line of dbt running the following command:
dbt run

dbt lineage generated:

Validation of created tables

Production Table

Check Data in BigQuery:

  • The data will be available at dezoomcamp-finalproject.dbt_us_traffic_accidents
  • The production version will be available at dezoomcamp-finalproject.production.dim_us_traffic_accidents (dimension table) and dezoomcamp-finalproject.production.stg_us_traffic_accidents (staging table)

Improvements

  • Add unit tests
  • Add CI/CD pipeline
  • Containerize the project
  • Perform deeper data analysis

dataengineeringzoomcampproject's People

Contributors

tmaferreira avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.