Welcome to my data engineering project where I use the Lichess API to analyze chess games played on their platform. Chess has seen a historic rise in popularity in recent years due to the ease and accessibility of quick online games that allow players to play with opponents from all around the world. One of the most popular forms of online chess is blitz and bullet chess, where games are played with time limits of 3 minutes or less.
Lichess, a non-profit and open-source online chess platform, has quickly become a favorite among chess enthusiasts due to its commitment to fairness, openness, and transparency. As an open-source platform, Lichess provides an API that allows developers to access a wealth of data about the games played on their platform.
In this Data Engineering project, I will use the Lichess API to collect, transform, and visualize data on chess games played on the platform. By analyzing this data, I hope to gain insights into the openings used by top players, as well as identify trends and patterns in the game that could be useful in improving player performance.
When it comes to getting data from the Lichess API, it's crucial to keep in mind the limitations that come with it. These include:
- A throttling limit of only 20 games per second for game data retrieval.
- Only one request can be made at a time, with a 1-minute timeout if this limit is exceeded.
- Computer evaluation data is only accessible if the game has been previously requested to "Run Computer Analysis".
By being aware of these limitations, we can better design our data ingestion process to work within these constraints and avoid running into issues with API usage.
In the ingestion phase of the project, I've set up a Python script that fetches data from the Lichess API, specifically focusing on bullet, blitz, and ultrabullet games of user 'penguingim1'. The script operates in 21-second intervals (to comply with rate limits) for a duration of 10 minutes. This collected data is loaded into a Kafka topic, which is then transferred into Snowflake via a Confluent connector, setting the stage for dbt modeling and analysis.
The infrastructure/producer
folder holds the code for the customer Kafka producer I created for data ingestion. This includes the Dockerfile that I use to host the producer apart from my dbt project on AWS.
The warehouse/snowflake
folder holds the full dbt project which is also dockerized for AWS.
Before creating a dashboard ask yourself a few questions:
- What is the primary focus?
- What type of decisions will be made based on the data?
- How detailed does your dashboard need to be?
The primary focus of this particular dashboard is to help the viwer choose new Chess Openings to try out based on what the selected talented player uses.
The completed visualization includes a few graphs:
- Win percentage by chess opening, sorted by play count (Table)
- Chess rating over time
- Win Rate by day
- Top 10 Chess Openings play rate