Check the dashboard ***HERE***.
One of my primary medium-term goals is to purchase my own home. Being data-driven, I actively sought out data to analyze the real estate market, exploring opportunities within my city and country to ensure I make the most informed decision tailored to my needs and budget.
Unfortunately, there are no freely accessible data sources for my country comparable to those available for the United States. So, I decided to create my own dataset by scraping a property sales website.
Because new properties hit the market every day, I'll run my scraping script every two weeks (I'll automate this). The fresh data will then go through an automated data processing pipeline before getting stored in a data warehouse, ready to fuel an analytics dashboard.
-
Web Scraping:
- Python
- Scrapy
-
Cloud:
- Google Cloud Platform (GCP)
- Terraform (IaC)
-
Data Ingestion (batch):
- Python
- Google Cloud Storage
-
Data Warehousing:
- Google BigQuery
-
Data Transformations and Processing:
- Pyspark (Dataproc)
-
Orchestration and Automation:
- Airflow (Cloud Composer)
-
Dashboarding:
- Google Looker Studio
Check the dashboard ***here***.
If you consider yourserlf a visual learner, I put together this brief animation of the architecture and workflow.
- project_workflow.pyPython Script with the Airflow DAG and tasks to orchestrate and automate all the steps of the project.
- pipeline.ipynbJupyter Notebook for data cleaning and transformation. This file's purpose is to test the transformations and execute it partially. If you want to run the script, make sure you have Apache Spark installed (I used a GCP Virtual Machine).
- pipeline.pyClean Python script with all the transformations and BigQuery connector. This script will be submitted as a Dataproc Job.
- clean.parquetProcessed data, ready to be ingested into BigQuery.
- raw.csvRaw data, right out of the scraping process.
- main.tfMain configuration file for Terraform, defining the infrastructure as code to provision (or destroy) all the necessary resources in the cloud.
- variables.tfFile containing variable definitions used in the main Terraform file (main.tf), facilitating code customization and reuse.
This folder contains all the Scrapy spiders and configuration for data extraction.
It also contains the Dockerfile used to build the image and submit it to Google Cloud Run.
For the instructions and walkthrough, please refer to TUTORIAL.md .