GithubHelp home page GithubHelp logo

mlops_project's Introduction

Air Quality Index Prediction

This is study project for learning some ML technic elements. You can get more information in wiki (rus)

About project

There are many pollutants in the air that worsen the quality of life of people. We can monitor some of them or all. In this case, we choose to use Air Quality Index instead of several pollutants. If you want you can read the article from references to understand which diseases are connected with one or another pollutant.

Some points which you should know: For calculation AQI we use ozone, SO2, NO2, CO, and PM10, PM2.5. The last ones are airborne particulate matter (PM). Those with a diameter of 10 microns or less (PM10) are inhalable into the lungs and can induce adverse health effects. Fine particulate matter is defined as particles that are 2.5 microns or less in diameter (PM2.5). Therefore, PM2.5 comprises a portion of PM10.

Structure

So what do we do? We use api from discomap.eea.europa.eu to get some historical data, and also to get up-to-date data to forecast AQI. In metadata are saved configurations, where there are country, station, pollutant, and period for getting requests from api.

DVC run pipeline with

  • data filtering by station
  • merging with new data if needed
  • cleaning data
  • calculation
  • train model
  • evaluate

Repo structure:

(cookiecutter style)

  • data
    • external - up-to-date data
    • interim - intermediate data that has been transformed
      • cleaned - cleaned data after merging
      • filtered - filtered data by station code
      • updated - merged data historical with current
    • processed - the final, canonical data sets for modeling
    • raw - historical data
  • metadata - meta information for scripts
  • models - trained and serialized models, model predictions, or model summaries
  • notebooks - jupyter notebooks
  • references - paper for research
  • reports - generated analysis
  • src - source code for use in this project
    • app - script for model service with fastapi
    • data - scripts to download or generate data
    • features - scripts to turn raw data into features for modeling
    • models - scripts to train models and then use trained models to make predictions

Any information about docker containers, model service or experiments you can find in Wiki.

How to use:

Full way (if you want to reproduce whole pipeline)

  1. Download data (about 9 Gb unziped) from link and put it in data/raw.
  2. After creation venv istall all libraries
poetry install
  1. If not conda than run
poetry shell
  1. Run pipeline
dvc repro
  1. Run
python service/main.py
  1. Go to http://127.0.0.1:8000/ and look at statistics and prediction.

Easy way to run

  1. Install several libs
pip install service/requirements.txt
  1. Run
python service/main.py
  1. Go to http://127.0.0.1:8000/ and look at statistics and prediction. Because there is saved model in repo.

Here you will see previous data and prediction for next day.

Experiments

This is old project. Base models was tested some time ago. Now we added CatBoost and some DL model. Two widely used error measures are Mean Squared Error (MSE), and Root Mean Square Error (RMSE). These two measures give greater weight to large errors than to small ones. To overcome this problem, another widely used measure is the Mean Absolute Error (MAE). So we used both.

The tabels below present best models. In notebooks there are many experiments with number of layers, learning rate and other parameters (it depends on model).

For one day prediction

Model Features How much days is used before RMSE MAE
Naive (baseline) Index One day 13.8 7.7
SARIMAX Index All train data 12.8 7.4
RandomForest Pollutants 5 days 13.8 8.4
SVR Pollutants 5 days 12.7 6.9
XGBRegressor Pollutants 5 days 10.9 6.8
CatBoostRegressor Pollutants 5 days 13.1 7.8
LSTM (keras, custom architecture) Index All train data 12.2 7.2
PyTorch Forecasting Index All train data 16.3 11.4
FEDOT Index + PM2.5 All train data 15.7 9.8

As we can see best model for one day prediction is XGBoost. In our experiment for whole test set FEDOT make prediction with very low errors: RMSE - 6.4, 5.1. But for one day, it's very high.

For 5 days prediction:

Model Features How much days is used before RMSE MAE
SARIMAX Index All train data 18.7 11.5
PyTorch Forecasting Index All train data 16.4 11.5
FEDOT Index + PM2.5 All train data 18.2 11.7

For several day results are not very good, RMSE is about or more than std for Index.

Speed

On Macbook Air M1 8 cores RAM 8gb. Time - 1e-4. Std - 2e-5

image

Code style

We use CI to check code style. There are several checks:

  • black
  • flack8

Tests work local.

mlops_project's People

Contributors

xenan avatar artjanichev avatar qnbhd avatar

Stargazers

 avatar Evgeny Tsydenov avatar Newta avatar

Watchers

 avatar  avatar Artem avatar  avatar Newta avatar

Forkers

laasan sgouda0412

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.