GithubHelp home page GithubHelp logo

vincenzogalante / magic-the-gathering Goto Github PK

View Code? Open in Web Editor NEW
38.0 2.0 3.0 64.88 MB

A complete pipeline to pull data from Scryfall's "Magic: The Gathering"-API, via Prefect orchestration and dbt transformation.

Home Page: https://lookerstudio.google.com/u/0/reporting/ebdf68e1-27f7-435b-8add-a4018681f801/page/BkBJD

HCL 29.91% Python 70.09%
card-game magic-the-gathering python bigquery dbt google-cloud looker-studio prefect api scryfall-api gcp google-cloud-platform

magic-the-gathering's Introduction

TL;DR: This project is analyzing MTG card data. Follow the steps mentioned under How to make it work? to set it up.

What is this about?

If you have never seen the logo above, here a very quick intro: Magic: The Gathering (Magic or MTG) is the first collectible card game of its kind, created in 1993. It can be played by two or more players. The exact rules differ a bit by the format played but in general: players battle against each other with their card decks, trying to defeat the other players by casting spells, artifacts and creatures. Curious? Learn how to play...

With this project I aim to built an end-to-end orchestrated data pipeline. I am calling the Scryfall API to get a complete export of the latest card information available, and save the data to my Google Cloud Storage. Afterwards, I filter the data on the needed columns, enforce a data type and push it to BigQuery. In there, I use DBT to derive new columns and make sure that only the most up-to-date dataset is used for visualization in the dashboard.

What questions am I trying to answer?

As this project looks at MTG from a meta-perspective we are diving into the following:

  1. How many cards are being released over time?
  2. What is the color distribution?
  3. Which set has the highest count of cards?
  4. Who are the most common artists?
  5. What are the most expensive cards?

What technologies are being used?

  • Cloud: Google Cloud
  • Infrastructure: Terraform
  • Orchestration: Prefect
  • Data lake: Google Cloud Storage
  • Data transformation: DBT
  • Data warehouse: BigQuery
  • Data visualization: Google Looker Studio

Dashboard example

Click here to see my Looker dashboard.

What is the structure of the production table?

Column Description
primary_key Unique surrogate key from card_id and released_at data points
card_id Card ID in database, IDs can be repeated due to reprintings
name The name of this card
released_at The date this card was first released
color_identity This card’s color identity
color_category Based on the color_identity: Black, Blue, White, Green, Red, Colorless or Mixed
set_name This card’s full set name
artist The name of the illustrator of this card face
price Price information of this card in US Dollar
data_update Timestamp when the data was updated in the database
  • Here the dbt lineage graph
  • Partitioned on the released_at column - in favor of question 1 and 3 - assuming that in most cases, cards with the same release date are from the same set
  • Clustered on the color_category column - in favor of question 2 - assuming that within one set the number of colors is lower than the numbers of unique prices and artists

How to make it work?

  1. Setup your Google Cloud environment
export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login
  1. Install all required dependencies into your environment
pip install -r requirements.txt
  1. Setup your infrastructure
  • Assuming you are using Linux AMD64 run the following commands to install Terraform - if you are using a different OS please choose the correct version here and exchange the download link and zip file name
sudo apt-get install unzip
cd ~/bin
wget https://releases.hashicorp.com/terraform/1.4.1/terraform_1.4.1_linux_amd64.zip
unzip terraform_1.4.1_linux_amd64.zip
rm terraform_1.4.1_linux_amd64.zip
  • To initiate, plan and apply the infrastructure, adjust and run the following Terraform commands
cd terraform/
terraform init
terraform plan -var="project=<your-gcp-project-id>"
terraform apply -var="project=<your-gcp-project-id>"
  1. Setup your orchestration
  • If you do not have a prefect workspace, sign-up for the prefect cloud and create a workspace here
  • Create the prefect blocks via the cloud UI or adjust the variables in /prefect/prefect_blocks.py and run
python magic-the-gathering/prefect/prefect_blocks.py
  • Adjust the keyfile location at dbt/profiles.yml to the path of your Google Cloud credentials JSON
  • To execute the flow, run the following commands in two different CL terminals
prefect agent start -q 'default'
python prefect/api_to_gcs_to_bq.py
  1. Data deep dive
  • The data will be available in BigQuery at mtg_card_data_dbt.dbt_mtg_latest_data
  • Query the data in-place or build a dashboard

Potential next steps

With a growing database, I would be able to further explore the following:

  • What is the color distribution over time?
  • What is the price development of specific cards / colors / sets over time?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.