GithubHelp home page GithubHelp logo

kweonwooj / kaggle_santander_value_prediction_challenge Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 2.0 1.46 MB

28th place in Kaggle Satander Value Prediction Challenge

Home Page: https://www.kaggle.com/c/santander-value-prediction-challenge

Python 10.06% Jupyter Notebook 89.94%
kaggle study reproducible gold-medal

kaggle_santander_value_prediction_challenge's Introduction

Kaggle/Santander Value Prediction



Abstract

Kaggle Santander Value Prediction Competition

  • Host : Santander, British bank, wholly owned by the Spanish Santander Group.
  • Prize : $ 60,000
  • Problem : Regression
  • Evaluation : Root Mean Squared Log Error
  • Period : June 19 2018 ~ Aug 21 2018 (63 days)

Santander Bank aims to predict the value of the transactions for each potential customers.

Competition data is completely anonymized, and size of the train set is quite small (~4k rows). Given the task, anonymized data must be a time-series data encrypted in specific method. Kagglers have identified a data leakage (or specifically how the data has been encrypted) and utilized the lag data which is often a strong predictor in time-series. Top scoring methods must include data leakage information, otherwise the score is too low to compete.

I share a baseline method, with no Feature Engineering and simple RandomForest regressor. I conduct simple feature engineering ideas and use LightGBM model for next version. Additional feature engineering ideas and using XGBoost and CatBoost further pushes the score to around Private LB 1.37 Next, we use leakage data to obtain better Private LB scores.

I have decided to accept the nature of data leakage in Kaggle competition. Instead of avoiding competitions that include leakage, I would like to learn how kagglers have found the leakage and explored the leakage, as they are the product of extensive data exploration, which I admire in terms of skill-set.

Result

Submission CV LogLoss Public LB Rank Private LB Rank
baseline 1.99915 1.93257 4,125 1.87086 4,106
[Exp 01] Feature Selection & Feature Interaction + LightGBM 1.54058 1.57676 3,377 1.53769 3,379
[Exp 02] Feature Selection & PCA & Statistical features + CatBoost/XGBoost/LightGBM 1.33945 1.41484 2,211 1.37273 2,206
leakage model (Gold medal) - 0.48785 58 0.53032 28

How to Run

  • Python 2.7
# install pre-requisites
pip install -r requirements.txt

# for baseline,
python code/baseline.py

# for [Exp 01], follow
code/[LB 1.53769] [FE] feature selection, feature interaction [Model] LightGBM.ipynb

# for [Exp 02], follow
code/[LB 1.37246] [FE] feature selection, pca, statistical features [Model] Catboost, XGBoost, LightGBM.ipynb

# for leakage model,
# (python version of leakage model is very slow (about 8+ hours) due to pandas merge op over [40k, 4k] x [40k, 4k] in Line 68
python code/leakage_model.py

What I've learnt

  • experienced porting R code into Python code. R code does perform faster in merge operations. I conducted profiling to optimize Python code, but could not resolve bottleneck of pandas merge on multiple columns. Cudos to Jack for his elegant and efficient code!
  • improved pandas skills while porting to Python, especially on pd.merge()
  • lessons learnt : anonymized data with leakage has little room for feature engineering. still amazed how Giba and other kagglers found the initial clue to the leakage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.