This repo contains my code for DataTalkClub's DE Zoomcamp Project (project assignment here).
A huge thanks to DataTalks Club Team for creating this great DE Zoompcamp!!
This project aims to create a data pipeline that processes data that will help answer the following questions:
- Which food products are most wasted in Brooklyn?
- Which kind of food is wasted the most? (ready-to-eat, perishable, packaged or shelf stable)
The dataset can be found here, it contains data used to research the relationship between food waste and the date labels found on those wasted food items. The data was collected by picking up items directly from retailer trash piles at random in the Downtown Brooklyn neighborhood in New York City.
Techonologies used for the project:
- Cloud: GCP
- Infrastructure as code (IaC): Terraform
- Workflow orchestration: Airflow
- Data Wareshouse: BigQuery
- Batch processing: Spark
- Containerization: Docker
The dashboard can be found here. In case the data is not showing it means the free credits in my Google Cloud account are over, then please consult image bellow. This dashboard allows some conclusions on the data such as:
- The most wasted food product are yogurts.
- The average price of wasted products is approximatly $5.57.
- The majority of produtcs is collected on expiration date day, however some are wasted before that date.
- Around 60% of wasted food products are perishable.
For recreating this project please follow the instructions bellow.
The dataset used for this project is from Kaggle, so in order to download it we need to use Kaggle API and pass env variables to docker.
- Create a Kaggle account here.
- Log in to your account, navigate to account settings and create a new API Token:
- A json file called
kaggle.json
containing your credentials will be downloaded. Please save it under~/.kaggle/kaggle.json
.
For this project a free GCP account is all you need.
- Create an account with your Google email ID
- Setup your first project if you haven't already
- eg. "dtc-project-ritaafranco"
- Copy the project id to the Docker Compose file, changing both variables as shown bellow (lines 70 and 71).
GCP_PROJECT_ID: '<your-gcp-project-id>'
GCP_GCS_BUCKET: 'dtc-project-data_<your-gcp-project-id>'
-
Setup service account & authentication for this project
- Grant
Viewer
role to begin with. - Download service-account-keys (.json) for auth. And save it under
~/.google/credentials/google_credentials.json
- Grant
-
IAM Roles for Service account:
- Go to the IAM section of IAM & Admin https://console.cloud.google.com/iam-admin/iam
- Click the Edit principal icon for your service account.
- Add these roles in addition to Viewer : Storage Admin + Storage Object Admin + BigQuery Admin
-
Enable these APIs for your project:
Terraform is being used to create the insfrastructure inside GCP project. For that run the following commands on your terminal:
- Go to terraform folder:
cd ~/dtc-project-brooklyn-food-waste/01_terraform
- Init terraform
terraform init
- Run terraform plan to check if all changes are acording to plan.
terraform plan -var="project=<your-gcp-project-id>"
- Run terraform apply to enforce the changes.
terraform apply -var="project=<your-gcp-project-id>"
Wait for the command to complete and then move on to Airflow!
After the work is done, you can use terraform destroy
to delete the services created and avoid costs on any running services.
Airflow is orchestraing the whole data pipeline: data ingestion to data lake, data transformation and data storage in the data warehouse. To run Airflow you need docker compose
and run the following commands on your terminal:
- Navigate to airflow folder:
cd ~/dtc-project-brooklyn-food-waste/02_airfow
- Build de image:
docker compose build
- Initiate airflow
docker compose up -d
After the containers are up navigate to localhost:8080 and log in to airflow. You should be able to see 3 DAGs that are paused. Please enable all 3. Wait for them to start, just refresh the page. Do not trigger them manually.. DAGs will be triggered automatically once the previous one is finnished.
data_ingestion_to_gcs
: first DAG, fetches data from kaggle and stores it in the Data Lake (GCS bucket). Triggersprocess-food-waste-data
DAG.process-food-waste-data
: second DAG, processed all data using Spark, and stores it in the bucket. Triggersdata_to_dw_bq
DAG.data_to_dw_bq
: final DAG, fetches the processed data from the Data Lake and creates the Big Query (data warehouse) tables.
Once the pipeline is completed you can create a Data Studio report to check the data.
You can now create a new report and add the recently created Big Query tables as your data source.
- Add complementary data to make the analysis more interesting, eg: CO2 emissions produced by this food waste.
- Adjust data schema to create an optimized data model for the dashboard.
- Run Spark on GCP instead of inside Airflow Container. And then use Airflow to trigger that spark cluster.