This repository contains an Airflow DAG for orchestrating the execution of an ETL process. The ETL process extracts data from a ClickHouse database, performs necessary transformations, and loads the results into a SQLite database.
moniepoint-etl/
|-- dags/
| |-- main.py
|-- scripts/
| |-- etl.py
|-- .env
|-- requirements.txt
|-- README.md
- dags/main.py: Airflow DAG definition script.
- scripts/etl.py: Python script containing the ETL logic.
- .env: Configuration file for storing environment variables.
- requirements.txt: List of Python dependencies.
- README.md: Project documentation file.
-
Clone the repository:
git clone [email protected]:CliffLolo/moniepoint-etl.git
-
cd into directory:
cd moniepoint-etl
-
Install dependencies:
pip install -r requirements.txt
-
Create a .env file in the project root and set the following environment variables:
CLICKHOUSE_CLOUD_HOSTNAME=your-clickhouse-hostname
CLICKHOUSE_PORT=your-clickhouse-port
CLICKHOUSE_USERNAME=your-clickhouse-username
CLICKHOUSE_PASSWORD=your-clickhouse-password
DATABASE_NAME=your-sqlite-database-name.db
The ETL logic is defined in the scripts/etl.py file. It connects to a ClickHouse database, executes a SQL query, and then stores the results in a SQLite database.
The SQL query extracts aggregated statistics from the tripdata table
SELECT
DATE_FORMAT(pickup_date, '%Y-%m') AS month,
AVG(CASE WHEN DAYOFWEEK(pickup_date) = 7 THEN 1 ELSE 0 END) AS sat_mean_trip_count,
AVG(CASE WHEN DAYOFWEEK(pickup_date) = 7 THEN fare_amount END) AS sat_mean_fare_trip,
AVG(CASE WHEN DAYOFWEEK(pickup_date) = 7 THEN TIMESTAMPDIFF('SECOND', pickup_datetime, dropoff_datetime) END) AS sat_mean_duration_per_trip,
AVG(CASE WHEN DAYOFWEEK(pickup_date) = 1 THEN 1 ELSE 0 END) AS sun_mean_trip_count,
AVG(CASE WHEN DAYOFWEEK(pickup_date) = 1 THEN fare_amount END) AS sun_mean_fare_trip,
AVG(CASE WHEN DAYOFWEEK(pickup_date) = 1 THEN TIMESTAMPDIFF('SECOND', pickup_datetime, dropoff_datetime) END) AS sun_mean_duration_per_trip
FROM
tripdata
WHERE
pickup_date BETWEEN '2014-01-01' AND '2016-12-31'
GROUP BY
DATE_FORMAT(pickup_date, '%Y-%m')
ORDER BY
month;
The results are stored in a SQLite table named moniepoint_metrics with the following schema:
CREATE TABLE IF NOT EXISTS moniepoint_metrics (
month TEXT,
sat_mean_trip_count REAL,
sat_mean_fare_trip REAL,
sat_mean_duration_per_trip REAL,
sun_mean_trip_count REAL,
sun_mean_fare_trip REAL,
sun_mean_duration_per_trip REAL
);