GithubHelp home page GithubHelp logo

airflow_youtube's Introduction

Youtube API Data Engineering Project

Problem

This is a simple project which takes data from Youtube API for 10 data analysts channels transforms and load the data into a datawarehouse.

Dataset

The chosen dataset for this project is the Youtube API. The API contains data about youtube channels, video details like number of views, likes, published data, title, etc.

Technologies

  • Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure-as-Code (IaC)
  • Docker: Containerization
  • SQL: Data Analysis & Exploration
  • Airflow: Pipeline Orchestration

Project details and implementation

This project makes use of Google Cloud Platform, particularly Google Cloud Storage (GCS) and BigQuery (BQ).

Cloud infrastructure is mostly managed with Terraform, except for Airflow.

Data ingestion is carried out by an Airflow DAG. The DAG downloads new data daily and ingests it to a Cloud Storage bucket which behaves as the Data Lake for the project. The dataset pulled from the API is saved in a parquet format and uploaded to GCS then creating an external table in BigQuery for querying the details inside the parquet files.

Reproduce the project

Prerequisites

The following requirements are needed to reproduce the project:

  1. A Youtube API Key.
  2. A Google Cloud Platform account.
  3. (Optional) The Google Cloud SDK. Instructions for installing it are below.
    • Most instructions below will assume that you are using the SDK for simplicity.
    • If you use a VM instance on Google Cloud Platform the Google Cloud SDK comes installed by default, don't have to perform this step.
  4. (Optional) A SSH client.
    • All the instructions listed below assume that you are using a Terminal and SSH.
    • I'm using Git Bash where you can donwload here.
  5. (Optional) VSCode with the Remote-SSH extension.
    • Any other IDE should work, but VSCode makes it very convenient to forward ports in remote VM's.

Development and testing were carried out using a Google Cloud Compute VM instance. I strongly recommend that a VM instance is used for reproducing the project as well. All the instructions below will assume that a VM is used.

Create a Google Cloud Project

Access the Google Cloud dashboard and create a new project from the dropdown menu on the top left of the screen, to the right of the Google Cloud Platform text.

Generate a Youtube API Key

  1. Access the Google Cloud console with your google account.
  2. Enable the Youtube Data API v3
  3. Go to Credentials on the left panel like shown below: youtube_api
  4. Click on Create Credentials button and select API Key youtube_api2
  5. Your API Key is created. You should save this information because we will need to copy this key to the project.

Create a Service Account

After you create the project, you will need to create a Service Account with the following roles:

  • BigQuery Admin
  • Storage Admin
  • Storage Object Admin
  • Viewer
  • To create a Service Account go to Google Cloud Platform console and in the left panel select the option IAM & Admin -> Service Accounts myimage-alt-tag

  • Click on Create Service Account

  • Define a Service account name and description to help you describe what this service account will do

  • On step 2 add the following roles showed on the printscreen below: myimage-alt-tag

  • In the Service account dashboard click on Actions -> Manage keys myimage-alt-tag

  • Click on Add key -> Create new key myimage-alt-tag

  • Choose key type JSON and click on Create myimage-alt-tag

  • When saving the file rename it to google_credentials.json and store it in your home folder, in $HOME/.google/credentials/ .

IMPORTANT: if you're using a VM as recommended, you will have to upload this credentials file to the VM.

You will also need to activate the following APIs:

Generate a SSH Key

  • Create a .ssh directory using Git Bash if you're on a Windows environment
  • cd .ssh
  • Run the command changing to the desired KEY_FILENAME and USER ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USER -b 2048
  • A file with the structure key_filename.pub is saved into the .ssh folder
  • Now we have to put the public key in Google Cloud Platform
  • Go to Navigation Menu -> Compute Engine -> Metadata
  • Print the key using bash command in your environment: cat key_filename.pub
  • Copy the value to GCP and save

(Optional) Install and setup Google Cloud SDK

Note: This step is only required if you don't use a Virtual Machine instance on Google Cloud Platform, because in that case the software is already installed.

  1. Download Gcloud SDK from this link and install it according to the instructions for your OS.
  2. Initialize the SDK following these instructions.
    1. Run gcloud init from a terminal and follow the instructions.
    2. Make sure that your project is selected with the command gcloud config list

Creating a Virtual Machine on GCP

  1. From your project's dashboard, go to Cloud Compute > VM instance
  2. Create a new instance:
    • Any name of your choosing
    • Pick your favourite region. You can check out the regions in this link.

      IMPORTANT: make sure that you use the same region for all of your Google Cloud components.

    • Pick a E2 series instance. A e2-standard-4 instance is recommended (4 vCPUs, 16GB RAM)
    • Change the boot disk to Ubuntu. The Ubuntu 20.04 LTS version is recommended. Also pick at least 30GB of storage.
    • Leave all other settings on their default value and click on Create.

Set up SSH access to the VM

  1. Start your instance from the VM instances dashboard in Google Cloud.
  2. Copy the external IP address from the VM instances dashboard.
  3. Go to the terminal and type ssh -i ~/.ssh/gcp username@external_ip where gcp corresponds to the key_filename.

Creating SSH config file

  1. Open a Git Bash terminal
  2. Change to the folder .ssh: cd .ssh
  3. Create a configuration file: touch config
  4. Open the configuration file with your default IDE (in my case is VSCode): code config
  5. Insert the following code changing the name, IP address, user and IdentityFile to your own
Host de-zoomcamp
    Hostname 34.77.77.161
    User u10054206
    IdentityFile C:/Users/u10054206/.ssh/gcp
  1. Execute the ssh command to connect to the Virtual Machine using alias name ssh de-zoomcamp
  • Note: When you stop the VM instance, the external IP address can change, in that case you have to perform the steps 4-6 again updating to the new IP address.

(Optional) Configure VSCode to access VM in Google Cloud Platform

  1. Open VSCode
  2. Go to extensions on the left panel
  3. Search for remote ssh and install the following extension

myimage-alt-tag

  1. Open the remote window, clicking on the left bottom green button

myimage-alt-tag

  1. Select Connect to host... option myimage-alt-tag

  2. Select the host that is presented on the printscreen below myimage-alt-tag

Installing the required software in the VM

  1. Run this first in your SSH session: sudo apt update && sudo apt -y upgrade
    • It's a good idea to run this command often, once per day or every few days, to keep your VM up to date.

Docker:

  1. Run sudo apt install docker.io to install it.
  2. Change your settings so that you can run Docker without sudo:
    1. Run sudo groupadd docker
    2. Run sudo gpasswd -a $USER docker
    3. Log out of your SSH session and log back in.
    4. Run sudo service docker restart
    5. Test that Docker can run successfully with docker run hello-world
    6. If you want to test something more useful please try docker run -it ubuntu bash

Docker compose:

  1. Go to https://github.com/docker/compose/releases and copy the URL for the docker-compose-linux-x86_64 binary for its latest version.
  2. Create a folder for binary files for your Linux user:
    1. Create a subfolder bin in your home account with mkdir ~/bin
    2. Go to the folder with cd ~/bin
  3. Download the binary file with wget <compose_url> -O docker-compose
    • If you forget to add the -O option, you can rename the file with mv <long_filename> docker-compose
    • Make sure that the docker-compose file is in the folder with ls
  4. Make the binary executable with chmod +x docker-compose
    • Check the file with ls again; it should now be colored green. You should now be able to run it with ./docker-compose version
  5. Go back to the home folder with cd ~
  6. Run nano .bashrc to modify your path environment variable:
    1. Scroll to the end of the file
    2. Add this line at the end:
       export PATH="${HOME}/bin:${PATH}"
    3. Press CTRL + o in your keyboard and press Enter afterwards to save the file.
    4. Press CTRL + x in your keyboard to exit the Nano editor.
  7. Reload the path environment variable with source .bashrc
  8. You should now be able to run Docker compose from anywhere; test it with docker-compose version

Terraform:

  1. Run curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
  2. Run sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
  3. Run sudo apt-get update && sudo apt-get install terraform

Upload Google service account credentials file to VM instance

  1. Copy the file from local machine using sftp
    1. sftp de-zoomcamp
    2. put google-credentials.json

Creating an environment variable for the credentials

Create an environment variable called GOOGLE_APPLICATION_CREDENTIALS and assign it to the path of your json credentials file (covered on Create a Service Account section), which should be $HOME/.google/credentials/ . Assuming you're running bash:

  1. Open .bashrc:
    nano ~/.bashrc
  2. At the end of the file, add the following line:
    export GOOGLE_APPLICATION_CREDENTIALS="<path/to/authkeys>.json"
  3. Exit nano with Ctrl+X. Follow the on-screen instructions to save the file and exit.
  4. Log out of your current terminal session and log back in, or run source ~/.bashrc to activate the environment variable.
  5. Refresh the token and verify the authentication with the GCP SDK:
    gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

Clone the repo in the VM

Log in to your VM instance and run the following from your $HOME folder:

git clone https://github.com/FilipeTheAnalyst/airflow_youtube.git

IMPORTANT: I strongly suggest that you fork my project and clone your copy so that you can easily perform changes on the code, because you will need to customize a few variables in order to make it run with your own infrastructure.

Set up project infrastructure with Terraform

Make sure that the credentials are updated and the environment variable is set up.

  1. Go to the terraform folder.

  2. Open variables.tf and edit line 11 under the variable "region" block so that it matches your preferred region.

  3. Initialize Terraform:

    terraform init
  4. Plan the infrastructure and make sure that you're creating a bucket in Cloud Storage as well as a dataset in BigQuery

    terraform plan
  5. If the plan details are as expected, apply the changes.

    terraform apply

You should now have a bucket called dtc_data_lake_youtube_data and a dataset called youtube_data in BigQuery.

Set up data ingestion with Airflow

  1. Go to the airflow folder.
  2. Run the following command and write down the output:
    echo -e "AIRFLOW_UID=$(id -u)"
  3. Open the .env file and change the value of AIRFLOW_UID for the value of the previous command.
  4. Change also the value of API_KEY for your Youtube API Key generated above.
  5. Open the docker-compose.yaml file and change the values of GCP_PROJECT_ID and GCP_GCS_BUCKET on lines 65 and 66 for the correct values of your configuration
  6. Build the custom Airflow Docker image:
    docker-compose build
  7. Initialize the Airflow configs:
    docker-compose up airflow-init
  8. Run Airflow
    docker-compose up

You may now access the Airflow GUI by browsing to localhost:8080. Username and password are both airflow .

IMPORTANT: this is NOT a production-ready setup! The username and password for Airflow have not been modified in any way; you can find them by searching for _AIRFLOW_WWW_USER_USERNAME and _AIRFLOW_WWW_USER_PASSWORD inside the docker-compose.yaml file.

  • If you can't connect to Airflow you need to forward 8080 port to your local machine. You can this on VSCode following these steps:
    • Open terminal
    • Click on Ports
    • Select the option Forward a port and select port 8080

myimage-alt-tag

Perform the data ingestion

If you performed all the steps of the previous section, you should now have a web browser with the Airflow dashboard.

The DAG is set up to download all data starting from 2022-06-06. You may change this date by modifying line 43 of airflow/dags/data_ingestion_youtube.py. Should you change the DAG date, you will have to delete the DAG in the Airflow UI and wait a couple of minutes so that Airflow can pick up the changes in the DAG.

To trigger the DAG, simply click on the switch icon next to the DAG name. The DAG will retrieve all data from the youtube channels stated on airflow/dags/ingest_youtube.py code and their respective videos from each channel.

The DAG consists on the followin tasks:

  • 1 BashOperator to execute the python code to collect the data from Youtube API
  • 2 PythonOperator tasks to upload the data into a GCS bucket (1 for channels data and the other for videos details)
  • 2 BigQueryOperator tasks to create external tables into BigQuery with the channels data and video details data

Airflow

After successfully running the Airflow workflow you should get the following parquet files created on GCP bucket for each day: gcs_bucket

And the tables channels and videos created inside youtube_data dataset: gcp_bq

After the data ingestion, you may shut down Airflow by pressing Ctrl+C on the terminal running Airflow and then running docker-compose down, or you may keep Airflow running if you want to update the dataset every day.

airflow_youtube's People

Contributors

filipetheanalyst avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.