GithubHelp home page GithubHelp logo

jchz6 / de-zcamp-project Goto Github PK

View Code? Open in Web Editor NEW
15.0 2.0 5.0 115.47 MB

Dockerfile 0.01% Python 95.47% C 1.57% Cython 2.00% C++ 0.36% Shell 0.01% Batchfile 0.01% HTML 0.13% XSLT 0.29% Roff 0.01% GAP 0.03% JavaScript 0.01% VBScript 0.01% Classic ASP 0.01% PowerShell 0.05% Jupyter Notebook 0.05% HCL 0.01%

de-zcamp-project's Introduction

Real Estate Market in Peru

Check the dashboard ***HERE***.

Context:

One of my primary medium-term goals is to purchase my own home. Being data-driven, I actively sought out data to analyze the real estate market, exploring opportunities within my city and country to ensure I make the most informed decision tailored to my needs and budget.

Problem:

Unfortunately, there are no freely accessible data sources for my country comparable to those available for the United States. So, I decided to create my own dataset by scraping a property sales website.

Strategy:

Because new properties hit the market every day, I'll run my scraping script every two weeks (I'll automate this). The fresh data will then go through an automated data processing pipeline before getting stored in a data warehouse, ready to fuel an analytics dashboard.

Tools:

  1. Web Scraping:
    • Python
    • Scrapy
  2. Cloud:
    • Google Cloud Platform (GCP)
    • Terraform (IaC)
  3. Data Ingestion (batch):
    • Python
    • Google Cloud Storage
  4. Data Warehousing:
    • Google BigQuery
  5. Data Transformations and Processing:
    • Pyspark (Dataproc)
  6. Orchestration and Automation:
    • Airflow (Cloud Composer)
  7. Dashboarding:
    • Google Looker Studio

    Check the dashboard ***here***.

Architecture

If you consider yourserlf a visual learner, I put together this brief animation of the architecture and workflow.

Architecture

Folder navigation

airflow

- project_workflow.py

Python Script with the Airflow DAG and tasks to orchestrate and automate all the steps of the project.

batch_processing

- pipeline.ipynb

Jupyter Notebook for data cleaning and transformation. This file's purpose is to test the transformations and execute it partially. If you want to run the script, make sure you have Apache Spark installed (I used a GCP Virtual Machine).

- pipeline.py

Clean Python script with all the transformations and BigQuery connector. This script will be submitted as a Dataproc Job.

datasets

- clean.parquet

Processed data, ready to be ingested into BigQuery.

- raw.csv

Raw data, right out of the scraping process.

terraform

- main.tf

Main configuration file for Terraform, defining the infrastructure as code to provision (or destroy) all the necessary resources in the cloud.

- variables.tf

File containing variable definitions used in the main Terraform file (main.tf), facilitating code customization and reuse.

web_scraping

This folder contains all the Scrapy spiders and configuration for data extraction.

It also contains the Dockerfile used to build the image and submit it to Google Cloud Run.

Tutorial

For the instructions and walkthrough, please refer to TUTORIAL.md .

de-zcamp-project's People

Contributors

jchz6 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.