This is a simple project which takes data from Youtube API for 10 data analysts channels transforms and load the data into a datawarehouse.
The chosen dataset for this project is the Youtube API. The API contains data about youtube channels, video details like number of views, likes, published data, title, etc.
- Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
- Terraform: Infrastructure-as-Code (IaC)
- Docker: Containerization
- SQL: Data Analysis & Exploration
- Airflow: Pipeline Orchestration
This project makes use of Google Cloud Platform, particularly Google Cloud Storage (GCS) and BigQuery (BQ).
Cloud infrastructure is mostly managed with Terraform, except for Airflow.
Data ingestion is carried out by an Airflow DAG. The DAG downloads new data daily and ingests it to a Cloud Storage bucket which behaves as the Data Lake for the project. The dataset pulled from the API is saved in a parquet format and uploaded to GCS then creating an external table in BigQuery for querying the details inside the parquet files.
The following requirements are needed to reproduce the project:
- A Youtube API Key.
- A Google Cloud Platform account.
- (Optional) The Google Cloud SDK. Instructions for installing it are below.
- Most instructions below will assume that you are using the SDK for simplicity.
- If you use a VM instance on Google Cloud Platform the Google Cloud SDK comes installed by default, don't have to perform this step.
- (Optional) A SSH client.
- All the instructions listed below assume that you are using a Terminal and SSH.
- I'm using Git Bash where you can donwload here.
- (Optional) VSCode with the Remote-SSH extension.
- Any other IDE should work, but VSCode makes it very convenient to forward ports in remote VM's.
Development and testing were carried out using a Google Cloud Compute VM instance. I strongly recommend that a VM instance is used for reproducing the project as well. All the instructions below will assume that a VM is used.
Access the Google Cloud dashboard and create a new project from the dropdown menu on the top left of the screen, to the right of the Google Cloud Platform text.
- Access the Google Cloud console with your google account.
- Enable the Youtube Data API v3
- Go to Credentials on the left panel like shown below:
- Click on Create Credentials button and select API Key
- Your API Key is created. You should save this information because we will need to copy this key to the project.
After you create the project, you will need to create a Service Account with the following roles:
BigQuery Admin
Storage Admin
Storage Object Admin
Viewer
-
To create a Service Account go to Google Cloud Platform console and in the left panel select the option IAM & Admin -> Service Accounts
-
Click on Create Service Account
-
Define a Service account name and description to help you describe what this service account will do
-
On step 2 add the following roles showed on the printscreen below:
-
In the Service account dashboard click on Actions -> Manage keys
-
When saving the file rename it to
google_credentials.json
and store it in your home folder, in$HOME/.google/credentials/
.
IMPORTANT: if you're using a VM as recommended, you will have to upload this credentials file to the VM.
You will also need to activate the following APIs:
- https://console.cloud.google.com/apis/library/iam.googleapis.com
- https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com
- Create a .ssh directory using Git Bash if you're on a Windows environment
cd .ssh
- Run the command changing to the desired KEY_FILENAME and USER
ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USER -b 2048
- A file with the structure key_filename.pub is saved into the .ssh folder
- Now we have to put the public key in Google Cloud Platform
- Go to Navigation Menu -> Compute Engine -> Metadata
- Print the key using bash command in your environment:
cat key_filename.pub
- Copy the value to GCP and save
Note: This step is only required if you don't use a Virtual Machine instance on Google Cloud Platform, because in that case the software is already installed.
- Download Gcloud SDK from this link and install it according to the instructions for your OS.
- Initialize the SDK following these instructions.
- Run
gcloud init
from a terminal and follow the instructions. - Make sure that your project is selected with the command
gcloud config list
- Run
- From your project's dashboard, go to Cloud Compute > VM instance
- Create a new instance:
- Any name of your choosing
- Pick your favourite region. You can check out the regions in this link.
IMPORTANT: make sure that you use the same region for all of your Google Cloud components.
- Pick a E2 series instance. A e2-standard-4 instance is recommended (4 vCPUs, 16GB RAM)
- Change the boot disk to Ubuntu. The Ubuntu 20.04 LTS version is recommended. Also pick at least 30GB of storage.
- Leave all other settings on their default value and click on Create.
- Start your instance from the VM instances dashboard in Google Cloud.
- Copy the external IP address from the VM instances dashboard.
- Go to the terminal and type
ssh -i ~/.ssh/gcp username@external_ip
where gcp corresponds to the key_filename.
- Open a Git Bash terminal
- Change to the folder .ssh:
cd .ssh
- Create a configuration file:
touch config
- Open the configuration file with your default IDE (in my case is VSCode):
code config
- Insert the following code changing the name, IP address, user and IdentityFile to your own
Host de-zoomcamp
Hostname 34.77.77.161
User u10054206
IdentityFile C:/Users/u10054206/.ssh/gcp
- Execute the ssh command to connect to the Virtual Machine using alias name
ssh de-zoomcamp
- Note: When you stop the VM instance, the external IP address can change, in that case you have to perform the steps 4-6 again updating to the new IP address.
- Open VSCode
- Go to extensions on the left panel
- Search for remote ssh and install the following extension
- Open the remote window, clicking on the left bottom green button
- Run this first in your SSH session:
sudo apt update && sudo apt -y upgrade
- It's a good idea to run this command often, once per day or every few days, to keep your VM up to date.
- Run
sudo apt install docker.io
to install it. - Change your settings so that you can run Docker without
sudo
:- Run
sudo groupadd docker
- Run
sudo gpasswd -a $USER docker
- Log out of your SSH session and log back in.
- Run
sudo service docker restart
- Test that Docker can run successfully with
docker run hello-world
- If you want to test something more useful please try
docker run -it ubuntu bash
- Run
- Go to https://github.com/docker/compose/releases and copy the URL for the
docker-compose-linux-x86_64
binary for its latest version.- At the time of writing, the last available version is
v2.6.0
and the URL for it is https://github.com/docker/compose/releases/download/v2.6.0/docker-compose-linux-x86_64
- At the time of writing, the last available version is
- Create a folder for binary files for your Linux user:
- Create a subfolder
bin
in your home account withmkdir ~/bin
- Go to the folder with
cd ~/bin
- Create a subfolder
- Download the binary file with
wget <compose_url> -O docker-compose
- If you forget to add the
-O
option, you can rename the file withmv <long_filename> docker-compose
- Make sure that the
docker-compose
file is in the folder withls
- If you forget to add the
- Make the binary executable with
chmod +x docker-compose
- Check the file with
ls
again; it should now be colored green. You should now be able to run it with./docker-compose version
- Check the file with
- Go back to the home folder with
cd ~
- Run
nano .bashrc
to modify your path environment variable:- Scroll to the end of the file
- Add this line at the end:
export PATH="${HOME}/bin:${PATH}"
- Press
CTRL
+o
in your keyboard and press Enter afterwards to save the file. - Press
CTRL
+x
in your keyboard to exit the Nano editor.
- Reload the path environment variable with
source .bashrc
- You should now be able to run Docker compose from anywhere; test it with
docker-compose version
- Run
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
- Run
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
- Run
sudo apt-get update && sudo apt-get install terraform
- Copy the file from local machine using sftp
sftp de-zoomcamp
put google-credentials.json
Create an environment variable called GOOGLE_APPLICATION_CREDENTIALS
and assign it to the path of your json credentials file (covered on Create a Service Account section), which should be $HOME/.google/credentials/
. Assuming you're running bash:
- Open
.bashrc
:nano ~/.bashrc
- At the end of the file, add the following line:
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/authkeys>.json"
- Exit nano with
Ctrl+X
. Follow the on-screen instructions to save the file and exit. - Log out of your current terminal session and log back in, or run
source ~/.bashrc
to activate the environment variable. - Refresh the token and verify the authentication with the GCP SDK:
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
Log in to your VM instance and run the following from your $HOME
folder:
git clone https://github.com/FilipeTheAnalyst/airflow_youtube.git
IMPORTANT: I strongly suggest that you fork my project and clone your copy so that you can easily perform changes on the code, because you will need to customize a few variables in order to make it run with your own infrastructure.
Make sure that the credentials are updated and the environment variable is set up.
-
Go to the
terraform
folder. -
Open
variables.tf
and edit line 11 under thevariable "region"
block so that it matches your preferred region. -
Initialize Terraform:
terraform init
-
Plan the infrastructure and make sure that you're creating a bucket in Cloud Storage as well as a dataset in BigQuery
terraform plan
-
If the plan details are as expected, apply the changes.
terraform apply
You should now have a bucket called dtc_data_lake_youtube_data
and a dataset called youtube_data
in BigQuery.
- Go to the
airflow
folder. - Run the following command and write down the output:
echo -e "AIRFLOW_UID=$(id -u)"
- Open the
.env
file and change the value ofAIRFLOW_UID
for the value of the previous command. - Change also the value of
API_KEY
for your Youtube API Key generated above. - Open the
docker-compose.yaml
file and change the values ofGCP_PROJECT_ID
andGCP_GCS_BUCKET
on lines 65 and 66 for the correct values of your configuration - Build the custom Airflow Docker image:
docker-compose build
- Initialize the Airflow configs:
docker-compose up airflow-init
- Run Airflow
docker-compose up
You may now access the Airflow GUI by browsing to localhost:8080
. Username and password are both airflow
.
IMPORTANT: this is NOT a production-ready setup! The username and password for Airflow have not been modified in any way; you can find them by searching for
_AIRFLOW_WWW_USER_USERNAME
and_AIRFLOW_WWW_USER_PASSWORD
inside thedocker-compose.yaml
file.
- If you can't connect to Airflow you need to forward 8080 port to your local machine. You can this on VSCode following these steps:
- Open terminal
- Click on Ports
- Select the option Forward a port and select port 8080
If you performed all the steps of the previous section, you should now have a web browser with the Airflow dashboard.
The DAG is set up to download all data starting from 2022-06-06. You may change this date by modifying line 43 of airflow/dags/data_ingestion_youtube.py
. Should you change the DAG date, you will have to delete the DAG in the Airflow UI and wait a couple of minutes so that Airflow can pick up the changes in the DAG.
To trigger the DAG, simply click on the switch icon next to the DAG name. The DAG will retrieve all data from the youtube channels stated on airflow/dags/ingest_youtube.py
code and their respective videos from each channel.
The DAG consists on the followin tasks:
- 1 BashOperator to execute the python code to collect the data from Youtube API
- 2 PythonOperator tasks to upload the data into a GCS bucket (1 for channels data and the other for videos details)
- 2 BigQueryOperator tasks to create external tables into BigQuery with the channels data and video details data
After successfully running the Airflow workflow you should get the following parquet files created on GCP bucket for each day:
And the tables channels and videos created inside youtube_data dataset:
After the data ingestion, you may shut down Airflow by pressing Ctrl+C
on the terminal running Airflow and then running docker-compose down
, or you may keep Airflow running if you want to update the dataset every day.