Youtube API Data Engineering Project

Problem

This is a simple project which takes data from Youtube API for 10 data analysts channels transforms and load the data into a datawarehouse.

Dataset

The chosen dataset for this project is the Youtube API. The API contains data about youtube channels, video details like number of views, likes, published data, title, etc.

Technologies

Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
Terraform: Infrastructure-as-Code (IaC)
Docker: Containerization
SQL: Data Analysis & Exploration
Airflow: Pipeline Orchestration

Project details and implementation

This project makes use of Google Cloud Platform, particularly Google Cloud Storage (GCS) and BigQuery (BQ).

Cloud infrastructure is mostly managed with Terraform, except for Airflow.

Data ingestion is carried out by an Airflow DAG. The DAG downloads new data daily and ingests it to a Cloud Storage bucket which behaves as the Data Lake for the project. The dataset pulled from the API is saved in a parquet format and uploaded to GCS then creating an external table in BigQuery for querying the details inside the parquet files.

Reproduce the project

Prerequisites

The following requirements are needed to reproduce the project:

A Youtube API Key.
A Google Cloud Platform account.
(Optional) The Google Cloud SDK. Instructions for installing it are below.
- Most instructions below will assume that you are using the SDK for simplicity.
- If you use a VM instance on Google Cloud Platform the Google Cloud SDK comes installed by default, don't have to perform this step.
(Optional) A SSH client.
- All the instructions listed below assume that you are using a Terminal and SSH.
- I'm using Git Bash where you can donwload here.
(Optional) VSCode with the Remote-SSH extension.
- Any other IDE should work, but VSCode makes it very convenient to forward ports in remote VM's.

Development and testing were carried out using a Google Cloud Compute VM instance. I strongly recommend that a VM instance is used for reproducing the project as well. All the instructions below will assume that a VM is used.

Create a Google Cloud Project

Access the Google Cloud dashboard and create a new project from the dropdown menu on the top left of the screen, to the right of the Google Cloud Platform text.

Generate a Youtube API Key

Access the Google Cloud console with your google account.
Enable the Youtube Data API v3
Go to Credentials on the left panel like shown below:
Click on Create Credentials button and select API Key
Your API Key is created. You should save this information because we will need to copy this key to the project.

Create a Service Account

After you create the project, you will need to create a Service Account with the following roles:

BigQuery Admin
Storage Admin
Storage Object Admin
Viewer

To create a Service Account go to Google Cloud Platform console and in the left panel select the option IAM & Admin -> Service Accounts
Click on Create Service Account
Define a Service account name and description to help you describe what this service account will do
On step 2 add the following roles showed on the printscreen below:
In the Service account dashboard click on Actions -> Manage keys
Click on Add key -> Create new key
Choose key type JSON and click on Create
When saving the file rename it to google_credentials.json and store it in your home folder, in $HOME/.google/credentials/ .

IMPORTANT: if you're using a VM as recommended, you will have to upload this credentials file to the VM.

You will also need to activate the following APIs:

Generate a SSH Key

Create a .ssh directory using Git Bash if you're on a Windows environment
cd .ssh
Run the command changing to the desired KEY_FILENAME and USER ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USER -b 2048
A file with the structure key_filename.pub is saved into the .ssh folder
Now we have to put the public key in Google Cloud Platform
Go to Navigation Menu -> Compute Engine -> Metadata
Print the key using bash command in your environment: cat key_filename.pub
Copy the value to GCP and save

(Optional) Install and setup Google Cloud SDK

Note: This step is only required if you don't use a Virtual Machine instance on Google Cloud Platform, because in that case the software is already installed.

Download Gcloud SDK from this link and install it according to the instructions for your OS.
Initialize the SDK following these instructions.
1. Run gcloud init from a terminal and follow the instructions.
2. Make sure that your project is selected with the command gcloud config list

Creating a Virtual Machine on GCP

From your project's dashboard, go to Cloud Compute > VM instance
Create a new instance:
- Any name of your choosing
- Pick your favourite region. You can check out the regions in this link.
  
  IMPORTANT: make sure that you use the same region for all of your Google Cloud components.
- Pick a E2 series instance. A e2-standard-4 instance is recommended (4 vCPUs, 16GB RAM)
- Change the boot disk to Ubuntu. The Ubuntu 20.04 LTS version is recommended. Also pick at least 30GB of storage.
- Leave all other settings on their default value and click on Create.

Set up SSH access to the VM

Start your instance from the VM instances dashboard in Google Cloud.
Copy the external IP address from the VM instances dashboard.
Go to the terminal and type ssh -i ~/.ssh/gcp username@external_ip where gcp corresponds to the key_filename.

Creating SSH config file

Open a Git Bash terminal
Change to the folder .ssh: cd .ssh
Create a configuration file: touch config
Open the configuration file with your default IDE (in my case is VSCode): code config
Insert the following code changing the name, IP address, user and IdentityFile to your own

Host de-zoomcamp
    Hostname 34.77.77.161
    User u10054206
    IdentityFile C:/Users/u10054206/.ssh/gcp

Execute the ssh command to connect to the Virtual Machine using alias name ssh de-zoomcamp

Note: When you stop the VM instance, the external IP address can change, in that case you have to perform the steps 4-6 again updating to the new IP address.

(Optional) Configure VSCode to access VM in Google Cloud Platform

Open VSCode
Go to extensions on the left panel
Search for remote ssh and install the following extension

Open the remote window, clicking on the left bottom green button

Select Connect to host... option
Select the host that is presented on the printscreen below

Installing the required software in the VM

Run this first in your SSH session: sudo apt update && sudo apt -y upgrade
- It's a good idea to run this command often, once per day or every few days, to keep your VM up to date.

Docker:

Run sudo apt install docker.io to install it.
Change your settings so that you can run Docker without sudo:
1. Run sudo groupadd docker
2. Run sudo gpasswd -a $USER docker
3. Log out of your SSH session and log back in.
4. Run sudo service docker restart
5. Test that Docker can run successfully with docker run hello-world
6. If you want to test something more useful please try docker run -it ubuntu bash

Docker compose:

Go to https://github.com/docker/compose/releases and copy the URL for the docker-compose-linux-x86_64 binary for its latest version.
- At the time of writing, the last available version is v2.6.0 and the URL for it is https://github.com/docker/compose/releases/download/v2.6.0/docker-compose-linux-x86_64
Create a folder for binary files for your Linux user:
1. Create a subfolder bin in your home account with mkdir ~/bin
2. Go to the folder with cd ~/bin
Download the binary file with wget <compose_url> -O docker-compose
- If you forget to add the -O option, you can rename the file with mv <long_filename> docker-compose
- Make sure that the docker-compose file is in the folder with ls
Make the binary executable with chmod +x docker-compose
- Check the file with ls again; it should now be colored green. You should now be able to run it with ./docker-compose version
Go back to the home folder with cd ~
Run nano .bashrc to modify your path environment variable:
1. Scroll to the end of the file
2. Add this line at the end:
```
 export PATH="${HOME}/bin:${PATH}"
```
3. Press CTRL + o in your keyboard and press Enter afterwards to save the file.
4. Press CTRL + x in your keyboard to exit the Nano editor.
Reload the path environment variable with source .bashrc
You should now be able to run Docker compose from anywhere; test it with docker-compose version

Terraform:

Run curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
Run sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
Run sudo apt-get update && sudo apt-get install terraform

Upload Google service account credentials file to VM instance

Copy the file from local machine using sftp
1. sftp de-zoomcamp
2. put google-credentials.json

Creating an environment variable for the credentials

Create an environment variable called GOOGLE_APPLICATION_CREDENTIALS and assign it to the path of your json credentials file (covered on Create a Service Account section), which should be $HOME/.google/credentials/ . Assuming you're running bash:

Open .bashrc:
```
nano ~/.bashrc
```

At the end of the file, add the following line:

export GOOGLE_APPLICATION_CREDENTIALS="<path/to/authkeys>.json"

Exit nano with Ctrl+X. Follow the on-screen instructions to save the file and exit.
Log out of your current terminal session and log back in, or run source ~/.bashrc to activate the environment variable.

Refresh the token and verify the authentication with the GCP SDK:

gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

Clone the repo in the VM

git clone https://github.com/FilipeTheAnalyst/airflow_youtube.git

IMPORTANT: I strongly suggest that you fork my project and clone your copy so that you can easily perform changes on the code, because you will need to customize a few variables in order to make it run with your own infrastructure.

Set up project infrastructure with Terraform

Make sure that the credentials are updated and the environment variable is set up.

Go to the terraform folder.
Open variables.tf and edit line 11 under the variable "region" block so that it matches your preferred region.
Initialize Terraform:
```
terraform init
```
Plan the infrastructure and make sure that you're creating a bucket in Cloud Storage as well as a dataset in BigQuery
```
terraform plan
```
If the plan details are as expected, apply the changes.
```
terraform apply
```

You should now have a bucket called dtc_data_lake_youtube_data and a dataset called youtube_data in BigQuery.

Set up data ingestion with Airflow

Go to the airflow folder.
Run the following command and write down the output:
```
echo -e "AIRFLOW_UID=$(id -u)"
```
Open the .env file and change the value of AIRFLOW_UID for the value of the previous command.
Change also the value of API_KEY for your Youtube API Key generated above.
Open the docker-compose.yaml file and change the values of GCP_PROJECT_ID and GCP_GCS_BUCKET on lines 65 and 66 for the correct values of your configuration
Build the custom Airflow Docker image:
```
docker-compose build
```
Initialize the Airflow configs:
```
docker-compose up airflow-init
```
Run Airflow
```
docker-compose up
```

You may now access the Airflow GUI by browsing to localhost:8080. Username and password are both airflow .

IMPORTANT: this is NOT a production-ready setup! The username and password for Airflow have not been modified in any way; you can find them by searching for _AIRFLOW_WWW_USER_USERNAME and _AIRFLOW_WWW_USER_PASSWORD inside the docker-compose.yaml file.

If you can't connect to Airflow you need to forward 8080 port to your local machine. You can this on VSCode following these steps:
- Open terminal
- Click on Ports
- Select the option Forward a port and select port 8080

Perform the data ingestion

If you performed all the steps of the previous section, you should now have a web browser with the Airflow dashboard.

The DAG is set up to download all data starting from 2022-06-06. You may change this date by modifying line 43 of airflow/dags/data_ingestion_youtube.py. Should you change the DAG date, you will have to delete the DAG in the Airflow UI and wait a couple of minutes so that Airflow can pick up the changes in the DAG.

To trigger the DAG, simply click on the switch icon next to the DAG name. The DAG will retrieve all data from the youtube channels stated on airflow/dags/ingest_youtube.py code and their respective videos from each channel.

The DAG consists on the followin tasks:

1 BashOperator to execute the python code to collect the data from Youtube API
2 PythonOperator tasks to upload the data into a GCS bucket (1 for channels data and the other for videos details)
2 BigQueryOperator tasks to create external tables into BigQuery with the channels data and video details data

After successfully running the Airflow workflow you should get the following parquet files created on GCP bucket for each day:

And the tables channels and videos created inside youtube_data dataset:

After the data ingestion, you may shut down Airflow by pressing Ctrl+C on the terminal running Airflow and then running docker-compose down, or you may keep Airflow running if you want to update the dataset every day.

filipetheanalyst / airflow_youtube Goto Github PK

airflow_youtube's Introduction