GithubHelp home page GithubHelp logo

nick917 / numerai Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pangyuteng/numerai

0.0 2.0 0.0 21 KB

A baseline project for the numer.ai ML competition

License: GNU General Public License v3.0

Python 100.00%

numerai's Introduction

Numerai Baseline Project

About

This project is what has evolved from my participation in the numer.ai machine learning competitions. It's nothing more than experimentation and thus it only contains some scripts and not a full-fledged project.

Feel free to use this project as a baseline to continue with! I only ask that if you do, release the source under GPLv3 as I have done, so improvements to it can help others learn more. If you only wish to copy snippets, feel free to do so without publishing your work, but please refer to this repository if you.

Pull-requests are more than welcome, since I personally am not particularly skilled in the ML field, but wish to learn more.

Installation

Miniconda3 has been my goto for a while so I'll include a setup example for that:

sovaa@stink ~ $ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sovaa@stink ~ $ bash Miniconda3-latest-Linux-x86_64.sh # assuming defaults accepted and path added to .bashrc
sovaa@stink ~ $ source ~/.bashrc

Clone the project:

sovaa@stink ~ $ git clone https://github.com/sovaa/numerai.git
sovaa@stink ~ $ cd numerai/

Create your environment and install the requirements:

sovaa@stink ~/numerai $ conda create -n numerai python=3.5
sovaa@stink ~/numerai $ source activate numerai
(numerai) sovaa@stink ~/numerai $ pip install -r requirements.txt

Script assumes files called data_train.csv and data_predict.csv:

(numerai) sovaa@stink ~/numerai $ mv numerai_training_data.csv data_train.csv
(numerai) sovaa@stink ~/numerai $ mv numerai_tournament_data.csv data_predict.csv

Model Description

The model is a 2 level stack; first level uses a couple of classifiers, which (as of writing) is:

  • XGBoost 1,
  • XGBoost 2,
  • Random Forest
  • Ada Boost (with Extra Randomized Trees as base classifier)

The level 2 blender is:

  • XGBoost

Training of the first level is done by splitting the training data by era, and train each of the above L1 classifiers on that era, then predict all the validation examples. When the iteration has finished, the resulting validation predictions will thus be a much larger matrix than the original training data.

Preprocessing of this larger matrix is done by averaging the predictions per training ID, giving us the original number of training examples again. The features are then transformed by first transforming them into polynomial features, then a 5 component PCA transformation.

Lastly, the L2 blender will then train on the training PCA data and predict on the prediction PCA data.

Example Run

Train and output the tournament_results.csv file:

(numerai) sovaa@stink ~/numerai $ time python stacking.py

After the stacking.py script has run, you'll have the output from level 1 of the stacker in these two files:

l2_x_train.csv
l2_y_train.csv
l2_x_test.csv
blend_x_train.csv
blend_x_test.csv

The level 2 blender is e.g. LR/XGBoost; the X input for it is the blend_x_train.csv and blend_x_test.csv. These two matrices are transformations of the l2_x_* files like this pseudo code describes:

blend_x_train = PCA(polynomial_features(l2_x_train))

The reason why the l2_x_* files are saved at all is because the second script, keras_for_l2.py, used them as input for a NN:

(numerai) sovaa@stink ~/numerai $ time python keras_for_l2.py

Resource

Some resources to learn more:

numerai's People

Contributors

sovaa avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.