GithubHelp home page GithubHelp logo

arturomtz8 / de-zoomcamp-project Goto Github PK

View Code? Open in Web Editor NEW
12.0 1.0 0.0 1.13 GB

My personal project for data engineering zoomcamp

Home Page: https://github.com/DataTalksClub/data-engineering-zoomcamp

Python 73.95% HCL 26.05%
big-query data-engineering dbt github-actions google-cloud-platform google-cloud-storage praw-reddit prefect python reddit-crawler

de-zoomcamp-project's Introduction

Paranormal subreddits ELT

Objective

This project involves extracting, loading, and transforming top posts and comments from four subreddits: Ghoststories, Ghosts, Paranormal, and ParanormalEncounters. By analyzing the data, I aim to determine the hourly distribution of posts, the correlation between post scores and comment numbers, the average number of comments per post, the maximum post score, the number of unique authors, and the most frequently used words in posts' text. To access the data, I use PRAW, which is the Python Reddit API Wrapper that simplifies the process of interacting with Reddit's posts, comments, and subreddits.

Architecture

Technologies used

Data pipeline

A lot of the orchestration of the project is done via Github Actions located in .github/workflows/ and Prefect flows which are in flows/.

Github Actions is mainly used for running jobs (python scripts and dbt commands) via a cronjob and Prefect is responsible for creating the flows and connecting to Google Cloud services in a secure way using Blocks and Secrets.

To ensure data consistency, each query made to PRAW stores the results in both Google Cloud Storage and Big Query. I also take care to exclude any posts or comments already stored in these locations to avoid duplication.

At four intervals throughout the day (3 am, 9 am, 3 pm, and 9 pm), I execute a query on the top posts from the selected subreddits.

At 3:50 am, I query the comments of each post to extract information about the most commonly used words and other details.

Finally, at 4:10 am, I use dbt to clean and prepare the data stored in Big Query for analysis, and then serve it in Google Looker Studio.

With all the posts and comments saved in Google Cloud Storage, I generate wordclouds and graphs that display the most commonly used words found in the post titles, post text, and comment bodies, which are saved here:

Prerequisites for running the project

  • In case that you don't have one, create a project in Google Cloud, you can follow the docs

  • Download Terraform and move into the terraform folder cd terraform/ and run terraform init for initializing a working directory containing Terraform configuration files

    • Create one environment variable in your terminal with the name TF_VAR_project that contains your project id from Google Cloud:
      $ export TF_VAR_project=your_project_id_from_google_cloud
    • Then build the infraestructure with the following commands:
      $ terraform plan -out init_infra.tfplan
      $ terraform apply init_infra.tfplan
  • Python >=3.9 and <3.11

  • To set up the required dependencies for running the Python scripts in this repository, follow the instructions in the documentation to install Poetry. Then, navigate to the root directory of the repository and run the command poetry install. This command creates a new environment based on the dependencies specified in pyproject.toml, ensuring that you have the same library versions as I use and making it easy to run the scripts

  • You will also have to create an agent that will interact with PRAW, here are the steps for doing it: https://praw.readthedocs.io/en/stable/getting_started/quick_start.html#prerequisites

  • Use Prefect Cloud for configuring Secrets and Blocks that are used in the flows scripts. Here is a guide for configuring Prefect Cloud. And here are explained the concepts of Blocks and Secrets

  • Create a project in dbt with the name dbt_reddit:

    $ poetry run dbt init dbt_reddit
  • Optional If you want to run it in Github Actions, you will have to create the following secrets for your repo:

    • DBT_ENV_SECRET_GOOGLE_DATASET (the google dataset where is your table)
    • DBT_ENV_SECRET_PROJECT_ID (your google project id)
    • EMAIL (your email linked to Github)
    • KEYFILE_CONTENTS (the contents of the json file from your google service account)
    • PREFECT_API_KEY (your api key for connecting to Prefect Cloud)
    • PREFECT_WORKSPACE (the name of your workspace in Prefect Cloud)

Run the scripts

It is mandatory to have done the steps in Prerequisites section

  • To run the prefect scripts and scrape the top posts and comments of the subreddits simply type in the root dir of the repo:

    $ poetry run python flows/reddit_posts_scraper.py
    $ poetry run python flows/reddit_comments_scraper.py
  • To update the Big Query tables with the contents of Google Cloud Storage, you will have to run:

    # for updating the Big Query table of posts
    $ poetry run python flows/update_posts_in_bq.py
    # for updating the Big Query table of comments
    $ poetry run python flows/update_comments_in_bq.py
  • To generate plots of the most commonly used words in the titles, text, and comments of the posts, run:

    $ poetry run python flows/create_plots.py
  • To run dbt models:

    $ cd dbt_reddit/
    $ poetry run dbt run

The dashboard

All of the graphs are created using data from Big Query and are designed to be modified dynamically in response to user inputs via the tiles interface. However, the wordcloud of the posts' body is a static representation of the most frequent words and does not change dynamically in response to the tiles.

Link to the dashboard

de-zoomcamp-project's People

Contributors

arturomtz8 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.