Air Quality Index Prediction

This is study project for learning some ML technic elements. You can get more information in wiki (rus)

About project

There are many pollutants in the air that worsen the quality of life of people. We can monitor some of them or all. In this case, we choose to use Air Quality Index instead of several pollutants. If you want you can read the article from references to understand which diseases are connected with one or another pollutant.

Some points which you should know: For calculation AQI we use ozone, SO2, NO2, CO, and PM10, PM2.5. The last ones are airborne particulate matter (PM). Those with a diameter of 10 microns or less (PM10) are inhalable into the lungs and can induce adverse health effects. Fine particulate matter is defined as particles that are 2.5 microns or less in diameter (PM2.5). Therefore, PM2.5 comprises a portion of PM10.

Structure

So what do we do? We use api from discomap.eea.europa.eu to get some historical data, and also to get up-to-date data to forecast AQI. In metadata are saved configurations, where there are country, station, pollutant, and period for getting requests from api.

DVC run pipeline with

data filtering by station
merging with new data if needed
cleaning data
calculation
train model
evaluate

Repo structure:

(cookiecutter style)

data
- external - up-to-date data
- interim - intermediate data that has been transformed
  - cleaned - cleaned data after merging
  - filtered - filtered data by station code
  - updated - merged data historical with current
- processed - the final, canonical data sets for modeling
- raw - historical data
metadata - meta information for scripts
models - trained and serialized models, model predictions, or model summaries
notebooks - jupyter notebooks
references - paper for research
reports - generated analysis
src - source code for use in this project
- app - script for model service with fastapi
- data - scripts to download or generate data
- features - scripts to turn raw data into features for modeling
- models - scripts to train models and then use trained models to make predictions

Any information about docker containers, model service or experiments you can find in Wiki.

How to use:

Full way (if you want to reproduce whole pipeline)

Download data (about 9 Gb unziped) from link and put it in data/raw.
After creation venv istall all libraries

poetry install

If not conda than run

poetry shell

Run pipeline

dvc repro

python service/main.py

Go to http://127.0.0.1:8000/ and look at statistics and prediction.

Easy way to run

Install several libs

pip install service/requirements.txt

python service/main.py

Go to http://127.0.0.1:8000/ and look at statistics and prediction. Because there is saved model in repo.

Here you will see previous data and prediction for next day.

Experiments

This is old project. Base models was tested some time ago. Now we added CatBoost and some DL model. Two widely used error measures are Mean Squared Error (MSE), and Root Mean Square Error (RMSE). These two measures give greater weight to large errors than to small ones. To overcome this problem, another widely used measure is the Mean Absolute Error (MAE). So we used both.

The tabels below present best models. In notebooks there are many experiments with number of layers, learning rate and other parameters (it depends on model).

For one day prediction

Model	Features	How much days is used before	RMSE	MAE
Naive (baseline)	Index	One day	13.8	7.7
SARIMAX	Index	All train data	12.8	7.4
RandomForest	Pollutants	5 days	13.8	8.4
SVR	Pollutants	5 days	12.7	6.9
XGBRegressor	Pollutants	5 days	10.9	6.8
CatBoostRegressor	Pollutants	5 days	13.1	7.8
LSTM (keras, custom architecture)	Index	All train data	12.2	7.2
PyTorch Forecasting	Index	All train data	16.3	11.4
FEDOT	Index + PM2.5	All train data	15.7	9.8

As we can see best model for one day prediction is XGBoost. In our experiment for whole test set FEDOT make prediction with very low errors: RMSE - 6.4, 5.1. But for one day, it's very high.

For 5 days prediction:

Model	Features	How much days is used before	RMSE	MAE
SARIMAX	Index	All train data	18.7	11.5
PyTorch Forecasting	Index	All train data	16.4	11.5
FEDOT	Index + PM2.5	All train data	18.2	11.7

For several day results are not very good, RMSE is about or more than std for Index.

Speed

On Macbook Air M1 8 cores RAM 8gb. Time - 1e-4. Std - 2e-5

Code style

We use CI to check code style. There are several checks:

black
flack8

Tests work local.

xenan / mlops_project Goto Github PK

mlops_project's Introduction

Air Quality Index Prediction

About project

Structure

Repo structure:

(cookiecutter style)

How to use:

Full way (if you want to reproduce whole pipeline)

Easy way to run

Experiments

Speed

Code style

mlops_project's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Jobs