DTC Project: brooklyn food waste

This repo contains my code for DataTalkClub's DE Zoomcamp Project (project assignment here).

A huge thanks to DataTalks Club Team for creating this great DE Zoompcamp!!

Goal
Dataset
Architecture
Dashboard
Recreating the project
Future Development

Goal

This project aims to create a data pipeline that processes data that will help answer the following questions:

Which food products are most wasted in Brooklyn?
Which kind of food is wasted the most? (ready-to-eat, perishable, packaged or shelf stable)

Dataset

The dataset can be found here, it contains data used to research the relationship between food waste and the date labels found on those wasted food items. The data was collected by picking up items directly from retailer trash piles at random in the Downtown Brooklyn neighborhood in New York City.

Architecture

Techonologies used for the project:

Cloud: GCP
Infrastructure as code (IaC): Terraform
Workflow orchestration: Airflow
Data Wareshouse: BigQuery
Batch processing: Spark
Containerization: Docker

Dashboard

The dashboard can be found here. In case the data is not showing it means the free credits in my Google Cloud account are over, then please consult image bellow. This dashboard allows some conclusions on the data such as:

The most wasted food product are yogurts.
The average price of wasted products is approximatly $5.57.
The majority of produtcs is collected on expiration date day, however some are wasted before that date.
Around 60% of wasted food products are perishable.

Recreating the project

For recreating this project please follow the instructions bellow.

Kaggle Credentials

The dataset used for this project is from Kaggle, so in order to download it we need to use Kaggle API and pass env variables to docker.

Create a Kaggle account here.
Log in to your account, navigate to account settings and create a new API Token:
A json file called kaggle.json containing your credentials will be downloaded. Please save it under ~/.kaggle/kaggle.json.

Google Cloud Account

For this project a free GCP account is all you need.

Create an account with your Google email ID
Setup your first project if you haven't already
- eg. "dtc-project-ritaafranco"
Copy the project id to the Docker Compose file, changing both variables as shown bellow (lines 70 and 71).

GCP_PROJECT_ID: '<your-gcp-project-id>'
GCP_GCS_BUCKET: 'dtc-project-data_<your-gcp-project-id>'

Setup service account & authentication for this project
- Grant Viewer role to begin with.
- Download service-account-keys (.json) for auth. And save it under ~/.google/credentials/google_credentials.json
IAM Roles for Service account:
- Go to the IAM section of IAM & Admin https://console.cloud.google.com/iam-admin/iam
- Click the Edit principal icon for your service account.
- Add these roles in addition to Viewer : Storage Admin + Storage Object Admin + BigQuery Admin
Enable these APIs for your project:

Terraform

Terraform is being used to create the insfrastructure inside GCP project. For that run the following commands on your terminal:

Go to terraform folder:

cd ~/dtc-project-brooklyn-food-waste/01_terraform

Init terraform

terraform init

Run terraform plan to check if all changes are acording to plan.

terraform plan -var="project=<your-gcp-project-id>"

Run terraform apply to enforce the changes.

terraform apply -var="project=<your-gcp-project-id>"

Wait for the command to complete and then move on to Airflow!

After the work is done, you can use terraform destroy to delete the services created and avoid costs on any running services.

Airflow

Airflow is orchestraing the whole data pipeline: data ingestion to data lake, data transformation and data storage in the data warehouse. To run Airflow you need docker compose and run the following commands on your terminal:

Navigate to airflow folder:

cd ~/dtc-project-brooklyn-food-waste/02_airfow

Build de image:

docker compose build

Initiate airflow

docker compose up -d

After the containers are up navigate to localhost:8080 and log in to airflow. You should be able to see 3 DAGs that are paused. Please enable all 3. Wait for them to start, just refresh the page. Do not trigger them manually.. DAGs will be triggered automatically once the previous one is finnished.

data_ingestion_to_gcs: first DAG, fetches data from kaggle and stores it in the Data Lake (GCS bucket). Triggers process-food-waste-data DAG.
process-food-waste-data: second DAG, processed all data using Spark, and stores it in the bucket. Triggers data_to_dw_bq DAG.
data_to_dw_bq: final DAG, fetches the processed data from the Data Lake and creates the Big Query (data warehouse) tables.

Once the pipeline is completed you can create a Data Studio report to check the data.

Data Studio

You can now create a new report and add the recently created Big Query tables as your data source.

Future Development

Add complementary data to make the analysis more interesting, eg: CO2 emissions produced by this food waste.
Adjust data schema to create an optimized data model for the dashboard.
Run Spark on GCP instead of inside Airflow Container. And then use Airflow to trigger that spark cluster.

ritaafranco / dtc-project-brooklyn-food-waste Goto Github PK