GithubHelp home page GithubHelp logo

truocphamkhac-agilityio / kaggle-vsb-power Goto Github PK

View Code? Open in Web Editor NEW

This project forked from maxhalford/kaggle-vsb-power

0.0 2.0 0.0 32 KB

:zap: Abandoned but interesting attempt

Jupyter Notebook 88.63% Python 11.37%

kaggle-vsb-power's Introduction

Kaggle VSB power line fault detection

My solution is very standard and consists in manually extracting features before feeding them to LightGBM. This worked quite well at first and I managed to reach the top 10 of the competition. However RNNs seem to be the right way to go, but I'm not very interested in deep learning. Usually I don't upload Kaggle solutions that didn't do well, but I'm making an exception for this one as I'm quite satisfied with the feature extraction pipeline I put in place. If you want to run the code make sure you are using Python 3 and have installed the dependencies listed in the requirements.txt file.

Splitting the signals

>>> python scripts/split_signals.py

The data provided by the competition is stored in an HDF5 file. Reading from the HDF5 file is anything but fast. My idea was to first split the signals into separate numpy files using the numpy.load method. Loading the signals using numpy.save then takes something in the range of microseconds. This is extremely important because throughout the competition the data will be loaded in memory many times.

Aligning the signals

>>> python scripts/find_signal_origins.py

Although each signal represents one period of an electrical sine wave, they don't all start at the same time. I decided to align them so that they all started from 0 and started by going upwards. This could be useful as some features could be based on a particular region of the signal. I didn't really exploit this as I gave up the competition when RNNs arrived. To align the signals I used a simple method which starts by searching for the two points where the signal crosses 0. Because there is a lot of noise I used a k-means clustering scheme with k = 2 to approximate the two positions. I then decided which of the two crossings was the one that I wanted by looking left and right from both crossings.

Extracting features

>>> python scripts/extract_solo_features.py

I won't go into detail about which features I extracted as I'm sure some people did better and will talk about it when the competition is over. The only thing I want to mention is how I extracted the features. As mentioned above the signals were split into separate .npy file using numpy and were stored in the data directory. I then simply looped over the files and extracted the features in parallel using a ThreadPoolExecutor from the concurrent.futures module from Python's standard library. The trick is that before computing the features I first loaded the ones that had already been computed so that I had didn't recompute them unnecessarily. This is definitely not rocket science but I thought the code to be quite concise and rather readable so I deemed it worthy of being shared online.

Cross-validation folds

>>> python scripts/make_folds.py

I like generating CV folds before doing the machine learning. I save these as a JSON file called folds.json in the oof directory and load them during the machine learning phase. This is practical because you can share the folds with others and use them with multiple models. I made sure that the folds didn't "leak" by putting signals from the same measurement in the training folds as well as the validation folds.

Machine learning

I started by trying to learn the labels of each signals individually. I then converted the problem to a multi-class classification problem by joining the signals of each measurement together. Because each signal has a binary label and there were 3 signals per label this resulted in an 2^3 = 8 class problem. What's more by permuting the 3 signals I was able to augment the multi-label dataset by a factor of 3! = 6. The code is available in the Solution.ipynb notebook. I didn't comment it but it should be readable. In the end this will produce a submission in the submissions directory and out-of-fold predictions in the oof directory.

kaggle-vsb-power's People

Contributors

maxhalford avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.