Real Estate Market in Peru

Check the dashboard ***HERE***.

Context:

One of my primary medium-term goals is to purchase my own home. Being data-driven, I actively sought out data to analyze the real estate market, exploring opportunities within my city and country to ensure I make the most informed decision tailored to my needs and budget.

Problem:

Unfortunately, there are no freely accessible data sources for my country comparable to those available for the United States. So, I decided to create my own dataset by scraping a property sales website.

Strategy:

Because new properties hit the market every day, I'll run my scraping script every two weeks (I'll automate this). The fresh data will then go through an automated data processing pipeline before getting stored in a data warehouse, ready to fuel an analytics dashboard.

Tools:

Web Scraping:
- Python
- Scrapy
Cloud:
- Google Cloud Platform (GCP)
- Terraform (IaC)
Data Ingestion (batch):
- Python
- Google Cloud Storage
Data Warehousing:
- Google BigQuery
Data Transformations and Processing:
- Pyspark (Dataproc)
Orchestration and Automation:
- Airflow (Cloud Composer)
Dashboarding:
- Google Looker Studio
Check the dashboard ***here***.

Architecture

If you consider yourserlf a visual learner, I put together this brief animation of the architecture and workflow.

Folder navigation

airflow

- project_workflow.py

Python Script with the Airflow DAG and tasks to orchestrate and automate all the steps of the project.

batch_processing

- pipeline.ipynb

Jupyter Notebook for data cleaning and transformation. This file's purpose is to test the transformations and execute it partially. If you want to run the script, make sure you have Apache Spark installed (I used a GCP Virtual Machine).

- pipeline.py

Clean Python script with all the transformations and BigQuery connector. This script will be submitted as a Dataproc Job.

datasets

- clean.parquet

Processed data, ready to be ingested into BigQuery.

- raw.csv

Raw data, right out of the scraping process.

terraform

- main.tf

Main configuration file for Terraform, defining the infrastructure as code to provision (or destroy) all the necessary resources in the cloud.

- variables.tf

File containing variable definitions used in the main Terraform file (main.tf), facilitating code customization and reuse.

web_scraping

This folder contains all the Scrapy spiders and configuration for data extraction.

It also contains the Dockerfile used to build the image and submit it to Google Cloud Run.

Tutorial

For the instructions and walkthrough, please refer to TUTORIAL.md .

jchz6 / de-zcamp-project Goto Github PK

de-zcamp-project's Introduction

Real Estate Market in Peru

Context:

Problem:

Strategy:

Tools:

Architecture

Folder navigation

airflow

batch_processing

datasets

terraform

web_scraping

Tutorial

de-zcamp-project's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Jobs