GithubHelp home page GithubHelp logo

gabeochieng / first_place_tic-heap-cirta-particle_solution Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jedidimohamed/first_place_tic-heap-cirta-particle_solution

0.0 1.0 0.0 9 KB

Python 2.99% Jupyter Notebook 97.01%

first_place_tic-heap-cirta-particle_solution's Introduction

TIC-HEAP Cirta Particle Classification Challenge

This challenge aimed to predict which type of particle is present in 10x10 images. The images are collected using detectors in the collider following many events ( collisions ). the metric for this challenge is : Log Loss And this is a description of our classification solution that achieved first place with a score of 1.521101146 on private leaderboard.

Final solution :

The final solution which we used to reach such a score was a fine-tuned Catboost Classifier. We used an ensemble of catboosts each train on a different portion of the data for some classes ( the majority ones ).

Approach :

The main problem in this challenge was the class imbalance. You can quickly notice that there are 3 dominant classes with tens of thousands of samples, while 2 classes were having over 1300 samples and the other one around 3k samples. It is also mentioned that the test set was made to be balanced. If you train a model with the data as it is, your model will be predicting the majority classes only due to the severe class imbalance.

Data Preprocessing : the images' pixels were flattened and stacked in a dataframe, and each column represented a single pixel, and each row represented an image. We then were able to use sklearn and boosting algorithms on the newly created tabular data to perform classification.

1st approach : Oversampling

The oversampling approach didn't yield good results since there were not too many samples in the minority classes to be able to match the number of samples in the majority classes.

2nd approach : Undersampling

By undersampling all the classes to 1300 samples ( equal to least populated class ) , you get a score of 1.56 with a default catboost classifier. It is a good score compared to what people achieved back then, but there was another way to improve that score a lot more. Undersampling to the 2nd least populated class. Setting the data to 3k samples per class except for the 1k3 one yielded much better results. A default classifier with this approach yielded a 1.535 score.

Conclusion : 1300 samples weren't enough for the other 4 classes to be well classified by the model. Setting the undersampling threshold to 3k helped the model classify these 4 classes better than the first approach, but you definitely lose some power in predicting the least populated class.

first_place_tic-heap-cirta-particle_solution's People

Contributors

jedidimohamed avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.