GithubHelp home page GithubHelp logo

theoviel / kaggle_otto_rs Goto Github PK

View Code? Open in Web Editor NEW
71.0 2.0 12.0 835 KB

3rd Place Solution for the OTTO – Multi-Objective Recommender System Competition

License: MIT License

Jupyter Notebook 70.25% Python 29.75%

kaggle_otto_rs's Introduction

3rd place solution to the OTTO – Multi-Objective Recommender System Kaggle Competition - Theo's Part

Status :

  • Document code : Done ✅
  • Clean notebooks : Done ✅
  • Make ReadMe : Done ✅
  • Rerun full pipeline to make sure everything works : To do 📝

Introduction - Adapted from kaggle

The pipeline follows the classical candidates extraction & reranker scheme.

  • CV = 0.5917 - [0.5621, 0.4438, 0.6706] -> LB 0.6028

Clicks is single model, I blend a few XGBs for carts & orders but the boost is small. Blending with models from my teammates gave our Public 0.60437 / Private 0.60382 LB !

Candidates

I use the candidates from Chris (link), as well as a slightly modified version of the ones from his public kernel. This results in approx. 80 candidates per sessions.

Feature engineering

Most of my (744) features come from the following process :

  • Compute item-item scores (such as w2v similarities, matrix factorization similarity, Chris' covisitation matrices coefficients) between the candidate and items in the session
  • Compute a weight adding information about to the item position in the session, timestamp, and type
  • Aggregate !

Features are computed per batch on a 32Gb V100 using RAPIDS. It's fast :)

Overall pipeline

I tune an Optuna for each fold (which is not a good practice, but I had a really reliable CV setup), pipeline can be a bit long to run but actually, the bottleneck is reading huge parquet files. Heavy downsampling makes it possible to have everything in RAM, and to train on GPU using the tricks Chris shared publicly.

How to use the repository

Prerequisites

  • Clone the repository

  • Requirements :

    • RAPIDS ! Using the latest stable version should work.
    • pip install -r requirements.txt
    • Bunch of stuff that doesn't really matter that much
  • Download the data :

    • Put the competition data from Kaggle in the input folder

Run The pipeline

Most of the pipelines is handled in notebooks. The order in which they should be run is specified in the name. Pipeline should run fine in a machine with a 32GB.

  • Prepare the data using 1-Preparation.ipynb.
  • Create covisitation matrices using 2-Matrices_Chris.ipynb and 2-Matrices_Theo.ipynb. Notebooks have to be run with MODE="val" and MODE="test"
  • Create candidates matrices using 3-Candidates.ipynb. Notebooks have to be run with MODE="val", MODE="test" and MODE="extra".
  • Create embeddings matrices using 4-Matrix_Factorization.ipynb, 4-Seq2Seq_Giba.ipynb and 4-Word2Vec.ipynb. Notebooks have to be run with MODE="val" and MODE="test"
  • Create features using the fe_main.py script in the src folder. Use python fe_main.py --mode MODE with modes val, test and extra.
  • Train an XGBoost model using 6-XGB.ipynb. You need to train a models with the 3 targets, the main parameter to tweak is POS_RATIO.
  • Evaluate your ensembles and generate submission files using 7-Blend.ipynb

If you run into memory issues :

  • For matrix computation, increase the PIECES values.
  • For candidates, Chris' candidates use a lot of ram but you can refactor the code to work by chunk (not implemented).
  • For feature engineering, reduce CHUNK_SIZE.
  • For training, validation data can be downsampled. I already downsample it for carts and clicks in the utils/load/load_parquets_cudf_folds function but you can downsample more.

Code structure

If you wish to dive into the code, the repository naming should be straight-forward. Each function is documented. The structure is the following :

src
├── data
│   ├── candidates_chris.py         # Chris' candidates utils
│   ├── candidates.py               # Theo's candidates utils
│   ├── covisitation.py             # Theo's covistation matrices
│   ├── fe.py                       # Feature engineering
│   └── preparation.py              # Data preparation utils
├── inference           
│   ├── boosting.py                 # Main file
│   └── predict.py                  # Predict function
├── model_zoo 
│   ├── __init__.py
│   ├── lgbm.py                     # LGBM Ranker kept for legacy
│   └── xgb.py                      # XGBoost classifier
├── otto_src                        
│   ├── evaluate.py                 # From the competition repo
│   ├── labels.py                   # From the competition repo
│   ├── my_split.py                 # My custom splitting functions
│   └── testset.py                  # From the competition repo
├── training           
│   └── boosting.py                 # Trains a boosting model
├── utils          
│   ├── load.py                     # Data loading utils 
│   ├── logger.py                   # Logging utils
│   ├── metrics.py                  # Metrics for the competition
│   ├── plot.py                     # Plotting utils
│   └── torch.py                    # Torch utils
│
├── fe_main.py                      # Main for feature engineering
└── params.py                       # Main parameters

kaggle_otto_rs's People

Contributors

theoviel avatar

Stargazers

Min avatar Aakash Patel avatar ds wook avatar  avatar jppppp avatar Jonathan Morales Vélez avatar Lin Jie avatar vegetable avatar  avatar  avatar 李炫霖 avatar  avatar  avatar  avatar  avatar Maybe avatar Yujie Cao avatar  avatar stefan321 avatar Gunja_AIML avatar  avatar  avatar  avatar Hakuna matata avatar Qing Wei avatar  avatar  avatar potter  avatar kib avatar  avatar ZeroHot avatar zeroooooo avatar Victor avatar  avatar Ken Aoki avatar  avatar Zhimin Lin avatar  avatar  avatar normal avatar  avatar peng hao avatar Hang Zhou avatar  avatar Bastien Dechamps avatar  avatar Jui-Hung Yuan avatar Bui Van Hop avatar rain avatar yanqiangmiffy avatar 蜡笔小xi avatar Heng Zheng avatar Turbo avatar Shuang Liang avatar Dylan Chen avatar Ruslans Aleksejevs avatar Vadim Irtlach avatar Chris Deotte avatar Sunghyun Jun avatar Bernhard J. Conzelmann avatar Yodai Kishimoto avatar Manish avatar daiki chiba avatar SeshurajuP avatar GiangDo avatar Philipp Normann avatar Yasufumi Nakama avatar Apa avatar JS Choi avatar RyoMorita avatar Iheb Chaabane avatar

Watchers

Gilberto Titericz Junior avatar  avatar

kaggle_otto_rs's Issues

How file `input/folds_4.csv` is generated?

Hi, thank you very much for the code, I'm having a bit of a problem running the code and would appreciate it if you could solve my problem!

In the 6-XGB.ipynb notebook, the fold file input/folds_4.csv is used in code, but how was this file generated and what data was used to generate that file?

class Config:
    ...
    folds_file = "../input/folds_4.csv"
    ...

click_df reference

Thank you very much for your work!
In file 3-Candidates.ipynb, click_df that appears in suggest_clicks and suggest_orders has not been defined. Can you kindly show me how to make it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.