GithubHelp home page GithubHelp logo

ppalmes / combineml.jl Goto Github PK

View Code? Open in Web Editor NEW
42.0 5.0 5.0 148 KB

Create ensembles of machine learning models from scikit-learn, caret, and julia

License: Other

Julia 48.21% Jupyter Notebook 51.78% HTML 0.01%

combineml.jl's Introduction

Copyright for portions of project CombineML.jl are held by Samuel Jenkins, 2014 as part of project Orchestra.jl. All other copyright for project CombineML.jl are held by Paulito Palmes, 2016.

The CombineML.jl package is licensed under the MIT "Expat" License:

You may also be interested to TSML (Time Series Machine Learning) package.

CombineML

DOI Join the chat at https://gitter.im/CombineML-jl/Lobby

Julia 1.0/Linux/OSX: Build Status Coverage Status

CombineML is a heterogeneous ensemble learning package for the Julia programming language. It is driven by a uniform machine learner API designed for learner composition.

Getting Started

See the notebook for demo: MetaModeling.ipnyb

We will cover how to predict on a dataset using CombineML.

Obtain Data

A tabular dataset will be used to obtain our features and labels.

This will be split it into a training and test set using holdout method.

import CombineML
using CombineML.Util
using CombineML.Transformers

try
  import RDatasets
catch
  using Pkg
  Pkg.add("RDatasets")
  import RDatasets
end

# use shorter module names
CU=CombineML.Util
CT=CombineML.Transformers
RD=RDatasets

# Obtain features and labels
dataset = RD.dataset("datasets", "iris")
features = convert(Matrix,dataset[:, 1:(end-1)])
labels = convert(Array,dataset[:, end])

# Split into training and test sets
(train_ind, test_ind) = CU.holdout(size(features, 1), 0.3)

Create a Learner

A transformer processes features in some form. Coincidentally, a learner is a subtype of transformer.

A transformer can be created by instantiating it, taking an options dictionary as an optional argument.

All transformers, including learners are called in the same way.

# Learner with default settings
learner = CT.PrunedTree()

# Learner with some of the default settings overriden
learner = CT.PrunedTree(Dict(
  :impl_options => Dict(
    :purity_threshold => 1.0
  )
))

# All learners are called in the same way.
learner = CT.StackEnsemble(Dict(
  :learners => [
    CT.PrunedTree(), 
    CT.RandomForest(),
    CT.DecisionStumpAdaboost()
  ], 
  :stacker => CT.RandomForest()
))

Create a Pipeline

Normally we may require the use of data pre-processing before the features are passed to the learner.

We shall use a pipeline transformer to chain many transformers in sequence.

In this case we shall one hot encode categorical features, impute NA values and numerically standardize before we call the learner.

# Create pipeline
pipeline = CT.Pipeline(Dict(
  :transformers => [
    CT.OneHotEncoder(), # Encodes nominal features into numeric
    CT.Imputer(), # Imputes NA values
    CT.StandardScaler(), # Standardizes features 
    learner # Predicts labels on features
  ]
))

Train and Predict

Training is done via the fit! function, predicton via transform!.

All transformers, provide these two functions. They are always called the same way.

# Train
CT.fit!(pipeline, features[train_ind, :], labels[train_ind])

# Predict
predictions = CT.transform!(pipeline, features[test_ind, :])

Assess

Finally we assess how well our learner performed.

# Assess predictions
result = CU.score(:accuracy, labels[test_ind], predictions)

Available Transformers

Outlined are all the transformers currently available via CombineML.

CombineML

Baseline (CombineML.jl Learner)

Baseline learner that by default assigns the most frequent label.

try
  import StatsBase
catch
  using Pkg
  Pkg.add("StatsBase")
  import StatsBase
end

learner = CT.Baseline(Dict(
  # Output to train against
  # (:class).
  :output => :class,
  # Label assignment strategy.
  # Function that takes a label vector and returns the required output.
  :strategy => StatsBase.mode
))

Identity (CombineML.jl Transformer)

Identity transformer passes the features as is.

transformer = CT.Identity()

VoteEnsemble (CombineML.jl Learner)

Set of machine learners that majority vote to decide prediction.

learner = CT.VoteEnsemble(Dict(
  # Output to train against
  # (:class).
  :output => :class,
  # Learners in voting committee.
  :learners => [CT.PrunedTree(), CT.DecisionStumpAdaboost(), CT.RandomForest()]
))

StackEnsemble (CombineML.jl Learner)

Ensemble where a 'stack' learner learns on a set of learners' predictions.

learner = CT.StackEnsemble(Dict(
  # Output to train against
  # (:class).
  :output => :class,
  # Set of learners that produce feature space for stacker.
  :learners => [CT.PrunedTree(), CT.DecisionStumpAdaboost(), CT.RandomForest()],
  # Machine learner that trains on set of learners' outputs.
  :stacker => CT.RandomForest(),
  # Proportion of training set left to train stacker itself.
  :stacker_training_proportion => 0.3,
  # Provide original features on top of learner outputs to stacker.
  :keep_original_features => false
))

BestLearner (CombineML.jl Learner)

Selects best learner out of set. Will perform a grid search on learners if options grid is provided.

learner = CT.BestLearner(Dict(
  # Output to train against
  # (:class).
  :output => :class,
  # Function to return partitions of instance indices.
  :partition_generator => (features, labels) -> kfold(size(features, 1), 5),
  # Function that selects the best learner by index.
  # Arg learner_partition_scores is a (learner, partition) score matrix.
  :selection_function => (learner_partition_scores) -> findmax(mean(learner_partition_scores, 2))[2],      
  # Score type returned by score() using respective output.
  :score_type => Real,
  # Candidate learners.
  :learners => [CT.PrunedTree(), CT.DecisionStumpAdaboost(), CT.RandomForest()],
  # Options grid for learners, to search through by BestLearner.
  # Format is [learner_1_options, learner_2_options, ...]
  # where learner_options is same as a learner's options but
  # with a list of values instead of scalar.
  :learner_options_grid => nothing
))

OneHotEncoder (CombineML.jl Transformer)

Transforms nominal features into one-hot form and coerces the instance matrix to be of element type Float64.

transformer = CT.OneHotEncoder(Dict(
  # Nominal columns
  :nominal_columns => nothing,
  # Nominal column values map. Key is column index, value is list of
  # possible values for that column.
  :nominal_column_values_map => nothing
))

Imputer (CombineML.jl Transformer)

Imputes NaN values from Float64 features.

transformer = CT.Imputer(Dict(
  # Imputation strategy.
  # Statistic that takes a vector such as mean or median.
  :strategy => mean
))

Pipeline (CombineML.jl Transformer)

Chains multiple transformers in sequence.

transformer = CT.Pipeline(Dict(
  # Transformers as list to chain in sequence.
  :transformers => [CT.OneHotEncoder(), CT.Imputer()],
  # Transformer options as list applied to same index transformer.
  :transformer_options => nothing
))

Wrapper (CombineML.jl Transformer)

Wraps around an CombineML transformer.

transformer = Wrapper(Dict(
  # Transformer to call.
  :transformer => CT.OneHotEncoder(),
  # Transformer options.
  :transformer_options => nothing
))

Julia

PrunedTree (DecisionTree.jl Learner)

Pruned CART decision tree.

learner = CT.PrunedTree(Dict(
  # Output to train against
  # (:class).
  :output => :class,
  # Options specific to this implementation.
  :impl_options => Dict(
    # Merge leaves having >= purity_threshold combined purity.
    :purity_threshold => 1.0,
    # Maximum depth of the decision tree (default: no maximum).
    :max_depth => -1,
    # Minimum number of samples each leaf needs to have.
    :min_samples_leaf => 1,
    # Minimum number of samples in needed for a split.
    :min_samples_split => 2,
    # Minimum purity increase needed for a split.
    :min_purity_increase => 0.0
  ) 
))

RandomForest (DecisionTree.jl Learner)

Random forest (CART).

learner = CT.RandomForest(Dict(
  # Output to train against
  # (:class).
  :output => :class,
  # Options specific to this implementation.
  :impl_options => Dict(
    # Number of features to train on with trees (default: 0, keep all).
    # Good values are square root or log2 of total number of features, rounded.
    # Number of trees in forest.
    :num_trees => 10,
    # Proportion of trainingset to be used for trees.
    :partial_sampling => 0.7,
    # Maximum depth of each decision tree (default: no maximum).
    :max_depth => -1
  )
))

DecisionStumpAdaboost (DecisionTree.jl Learner)

Adaboosted decision stumps.

learner = CT.DecisionStumpAdaboost(Dict(
  # Output to train against
  # (:class).
  :output => :class,
  # Options specific to this implementation.
  :impl_options => Dict(
    # Number of boosting iterations.
    :num_iterations => 7
  )
))

PCA (DimensionalityReduction.jl Transformer)

Principal Component Analysis rotation on features. Features ordered by maximal variance descending.

Fails if zero-variance feature exists. Based on MultivariateStats PCA

transformer = CT.PCA(Dict(
  :pratio => 1.0,
  :maxoutdim => 5
))

StandardScaler (MLBase.jl Transformer)

Standardizes each feature using (X - mean) / stddev. Will produce NaN if standard deviation is zero.

transformer = CT.StandardScaler(Dict(
  # Center features
  :center => true,
  # Scale features
  :scale => true
))

Python

See the scikit-learn API for what options are available per learner.

SKLLearner (scikit-learn 0.15 Learner)

Wrapper for scikit-learn that provides access to most learners.

Options for the specific scikit-learn learner is to be passed in options[:impl_options] dictionary.

Available Classifiers:

  • AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier, LDA, LogisticRegression, PassiveAggressiveClassifier, RidgeClassifier, RidgeClassifierCV, SGDClassifier, KNeighborsClassifier, RadiusNeighborsClassifier, NearestCentroid, QDA, SVC, LinearSVC, NuSVC, DecisionTreeClassifier, GaussianNB, MultinomialNB, ComplementNB, BernoulliNB

Available Regressors:

  • SVR, Ridge, RidgeCV, Lasso, ElasticNet, Lars, LassoLars, OrthogonalMatchingPursuit, BayesianRidge, ARDRegression, SGDRegressor, PassiveAggressiveRegressor, KernelRidge, KNeighborsRegressor, RadiusNeighborsRegressor, GaussianProcessRegressor, DecisionTreeRegressor, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor, IsotonicRegression, MLPRegressor
# classifier example
learner = CT.SKLLearner(Dict(
  # Output to train against
  # (classification).
  :output => :class,
  :learner => "LinearSVC",
  # Options specific to this implementation.
  :impl_options => Dict()
))

# regression example
learner = CT.SKLLearner(Dict(
  # Output to train against
  # (regression).
  :output => :reg,
  :learner => "GradientBoostingRegressor",
  # Options specific to this implementation.
  :impl_options => Dict()
))

R

RCall is used to interface with caret learners.

R 'caret' library offers more than 100 learners. See here for more details.

CRTLearner (caret 6.0 Learner)

CARET wrapper that provides access to all learners.

Options for the specific CARET learner is to be passed in options[:impl_options] dictionary.

learner = CT.CRTLearner(Dict(
  # Output to train against
  # (:class).
  :output => :class,
  :learner => "svmLinear",
  :impl_options => Dict()
))

Known Limitations

Learners have only been tested on numeric features.

Inconsistencies may result in using nominal features directly without a numeric transformation (i.e. OneHotEncoder).

Misc

The links provided below will only work if you are viewing this in the GitHub repository.

Changes

See CHANGELOG.yml.

License

MIT "Expat" License. See LICENSE.md.

combineml.jl's People

Contributors

bensadeghi avatar femtocleaner[bot] avatar gitter-badger avatar ppalmes avatar tlienart avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

combineml.jl's Issues

IPYNB has issues

When going through the notebook, there were several issues, some I could fix, some I could not.

  1. use of pipeline is now prohibited (clashes with Base)
  2. the RandomForest should have :num_subfeatures => 0 and not => nothing afaiu
  3. I could not run the Scikitlearn cell note that both *_AVAILABLE bools are false (maybe I didn't do something but couldn't find different way)
julia> CombineML.System.LIB_SKL_AVAILABLE
false
julia> CombineML.System.LIB_CRT_AVAILABLE
false

and so SKLLearner was not available.

  1. use of mean must now be accompanied by using Statistics
  2. I could not run @parallel, I'm assuming it comes from using Distributed but may have changed name

The fixes on the notebook: https://github.com/tlienart/CombineML.jl , now in a PR fixing (1, 2, 4) #19

Fix caret wrapper using RCall.jl instead of PyCall.jl rpy

Currently, caret relies on rpy which is a wrapper in python for R. The existing implementation uses PyCall to import rpy objects which in turn being used to load caret package. By removing this dependency, caret access using RCall will be easier to maintain.

pycall api update

PyCall 1.90.0 is now released, which change o[:foo] and o["foo"] to o.foo and o."foo", respectively, for python objects o; see also JuliaPy/PyCall.jl#629.

The old getindex methods still work but are deprecated, so you'll want to put out a new release that uses the new methods and REQUIREs PyCall 1.90.0 to avoid having zillions of deprecation messages.

Test failing with DecisionTree v0.6.5+

First off, thank you for maintaining this package, awesome work!!

There has been some fruitful effort recently in improving the performance (in execution time and memory allocation) for the DecisionTree package, for both classification and regression, and we're now seeing 4-10x speed up. This has required a rewrite of the _split() functions, where now the splits and hence predictions aren't identical to those of DT v0.6.5.

As a result, one of the CombineML's tests ("Pipeline works with fixture data.") is now failing, and I'm having trouble isolating the issue.

Need your help and guidance in troubleshooting this failing test and getting CombineML onto the soon to be DT v0.7.2

To reproduce the test error, just pull from master: Pkg.checkout("DecisionTree")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.