GithubHelp home page GithubHelp logo

ritaafranco / dtc-project-brooklyn-food-waste Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 19.2 MB

An analysis on Brooklyn food waste. DE Project for DTC Data Engineering Zoomcamp.

HCL 1.04% Dockerfile 1.17% Python 5.84% Jupyter Notebook 91.93% Shell 0.01%

dtc-project-brooklyn-food-waste's Introduction

DTC Project: brooklyn food waste

This repo contains my code for DataTalkClub's DE Zoomcamp Project (project assignment here).

A huge thanks to DataTalks Club Team for creating this great DE Zoompcamp!!

Table of Contents

  1. Goal
  2. Dataset
  3. Architecture
  4. Dashboard
  5. Recreating the project
  6. Future Development

Goal

This project aims to create a data pipeline that processes data that will help answer the following questions:

  • Which food products are most wasted in Brooklyn?
  • Which kind of food is wasted the most? (ready-to-eat, perishable, packaged or shelf stable)

Dataset

The dataset can be found here, it contains data used to research the relationship between food waste and the date labels found on those wasted food items. The data was collected by picking up items directly from retailer trash piles at random in the Downtown Brooklyn neighborhood in New York City.

Architecture

Techonologies used for the project:

  • Cloud: GCP
  • Infrastructure as code (IaC): Terraform
  • Workflow orchestration: Airflow
  • Data Wareshouse: BigQuery
  • Batch processing: Spark
  • Containerization: Docker

image

Dashboard

The dashboard can be found here. In case the data is not showing it means the free credits in my Google Cloud account are over, then please consult image bellow. This dashboard allows some conclusions on the data such as:

  • The most wasted food product are yogurts.
  • The average price of wasted products is approximatly $5.57.
  • The majority of produtcs is collected on expiration date day, however some are wasted before that date.
  • Around 60% of wasted food products are perishable.

image

Recreating the project

For recreating this project please follow the instructions bellow.

Kaggle Credentials

The dataset used for this project is from Kaggle, so in order to download it we need to use Kaggle API and pass env variables to docker.

  1. Create a Kaggle account here.
  2. Log in to your account, navigate to account settings and create a new API Token: image
  3. A json file called kaggle.json containing your credentials will be downloaded. Please save it under ~/.kaggle/kaggle.json.

Google Cloud Account

For this project a free GCP account is all you need.

  1. Create an account with your Google email ID
  2. Setup your first project if you haven't already
    • eg. "dtc-project-ritaafranco"
  3. Copy the project id to the Docker Compose file, changing both variables as shown bellow (lines 70 and 71).
GCP_PROJECT_ID: '<your-gcp-project-id>'
GCP_GCS_BUCKET: 'dtc-project-data_<your-gcp-project-id>'
  1. Setup service account & authentication for this project

    • Grant Viewer role to begin with.
    • Download service-account-keys (.json) for auth. And save it under ~/.google/credentials/google_credentials.json
  2. IAM Roles for Service account:

    • Go to the IAM section of IAM & Admin https://console.cloud.google.com/iam-admin/iam
    • Click the Edit principal icon for your service account.
    • Add these roles in addition to Viewer : Storage Admin + Storage Object Admin + BigQuery Admin
  3. Enable these APIs for your project:

Terraform

Terraform is being used to create the insfrastructure inside GCP project. For that run the following commands on your terminal:

  1. Go to terraform folder:
cd ~/dtc-project-brooklyn-food-waste/01_terraform
  1. Init terraform
terraform init
  1. Run terraform plan to check if all changes are acording to plan.
terraform plan -var="project=<your-gcp-project-id>"
  1. Run terraform apply to enforce the changes.
terraform apply -var="project=<your-gcp-project-id>"

Wait for the command to complete and then move on to Airflow!

After the work is done, you can use terraform destroy to delete the services created and avoid costs on any running services.

Airflow

Airflow is orchestraing the whole data pipeline: data ingestion to data lake, data transformation and data storage in the data warehouse. To run Airflow you need docker compose and run the following commands on your terminal:

  1. Navigate to airflow folder:
cd ~/dtc-project-brooklyn-food-waste/02_airfow
  1. Build de image:
docker compose build
  1. Initiate airflow
docker compose up -d

After the containers are up navigate to localhost:8080 and log in to airflow. You should be able to see 3 DAGs that are paused. image Please enable all 3. Wait for them to start, just refresh the page. Do not trigger them manually.. DAGs will be triggered automatically once the previous one is finnished.

  • data_ingestion_to_gcs: first DAG, fetches data from kaggle and stores it in the Data Lake (GCS bucket). Triggers process-food-waste-data DAG.
  • process-food-waste-data: second DAG, processed all data using Spark, and stores it in the bucket. Triggers data_to_dw_bq DAG.
  • data_to_dw_bq: final DAG, fetches the processed data from the Data Lake and creates the Big Query (data warehouse) tables.

Once the pipeline is completed you can create a Data Studio report to check the data.

Data Studio

You can now create a new report and add the recently created Big Query tables as your data source.

Future Development

  • Add complementary data to make the analysis more interesting, eg: CO2 emissions produced by this food waste.
  • Adjust data schema to create an optimized data model for the dashboard.
  • Run Spark on GCP instead of inside Airflow Container. And then use Airflow to trigger that spark cluster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.