mlr-org / mlr Goto Github PK

View Code? Open in Web Editor NEW

1.6K 106.0 403.0 618.69 MB

Machine Learning in R

Home Page: https://mlr.mlr-org.com

License: Other

R 97.99% C 0.31% Shell 0.05% HTML 1.65%

machine-learning data-science tuning cran r-package predictive-modeling classification regression statistics r

mlr's Introduction

mlr

Package website: release | dev

Machine learning in R.

Deprecated

{mlr} is considered retired from the mlr-org team. We won't add new features anymore and will only fix severe bugs. We suggest to use the new mlr3 framework from now on and for future projects.

Not all features of {mlr} are already implemented in {mlr3}. If you are missing a crucial feature, please open an issue in the respective mlr3 extension package and do not hesitate to follow-up on it.

Installation

Release

install.packages("mlr")

Development

remotes::install_github("mlr-org/mlr")

Citing {mlr} in publications

Please cite our JMLR paper [bibtex].

Some parts of the package were created as part of other publications. If you use these parts, please cite the relevant work appropriately. An overview of all {mlr} related publications can be found here.

Introduction

R does not define a standardized interface for its machine-learning algorithms. Therefore, for any non-trivial experiments, you need to write lengthy, tedious and error-prone wrappers to call the different algorithms and unify their respective output.

Additionally you need to implement infrastructure to

resample your models
optimize hyperparameters
select features
cope with pre- and post-processing of data and compare models in a statistically meaningful way.

As this becomes computationally expensive, you might want to parallelize your experiments as well. This often forces users to make crummy trade-offs in their experiments due to time constraints or lacking expert programming skills.

{mlr} provides this infrastructure so that you can focus on your experiments! The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering. It is written in a way that you can extend it yourself or deviate from the implemented convenience methods and construct your own complex experiments or algorithms.

Furthermore, the package is nicely connected to the OpenML R package and its online platform, which aims at supporting collaborative machine learning online and allows to easily share datasets as well as machine learning tasks, algorithms and experiments in order to support reproducible research.

Features

Clear S3 interface to R classification, regression, clustering and survival analysis methods
Abstract description of learners and tasks by properties
Convenience methods and generic building blocks for your machine learning experiments
Resampling methods like bootstrapping, cross-validation and subsampling
Extensive visualizations (e.g. ROC curves, predictions and partial predictions)
Simplified benchmarking across data sets and learners
Easy hyperparameter tuning using different optimization strategies, including potent configurators like
- iterated F-racing (irace)
- sequential model-based optimization
Variable selection with filters and wrappers
Nested resampling of models with tuning and feature selection
Cost-sensitive learning, threshold tuning and imbalance correction
Wrapper mechanism to extend learner functionality in complex ways
Possibility to combine different processing steps to a complex data mining chain that can be jointly optimized
OpenML connector for the Open Machine Learning server
Built-in parallelization
Detailed tutorial

Miscellaneous

Simple usage questions are better suited at Stackoverflow using the mlr tag.

Please note that all of us work in academia and put a lot of work into this project - simply because we like it, not because we are paid for it.

New development efforts should go into {mlr3}. We have a own style guide which can easily applied by using the mlr_style from the styler package. See our wiki for more information.

Talks, Workshops, etc.

mlr-outreach holds all outreach activities related to {mlr} and {mlr3}.

mlr's People

Contributors

Stargazers

Watchers

Forkers

ruit jianf7 ben-haim daron-wan mmadsen narayana1208 spock74 linearregression arturochian libardo1 devendradesale rachanabagde christinacyris rishy optixlab danula sandipank dickoa erenturunc nitinl anandaramanan saurabh7 arinik9 gandalf-thegrey igauravsehrawat dalpozz luizfox aydindemircioglu dmoliveira wintercmin gvanzin hjanime dwaynepaschall micseb zachmayer elephann ppr10 thomaskisler luism78 juriskl dagola daisukeichikawa ljudt 99701a0554 florianfendt dshen1 hetong007 mnwright dukuiran suvojyoti linxihui gragusa hoardboard junjiemao emaasit ele0nard albert1988 abhik1368 sandy4321 sandeep433 pjpan ecrutchfield yang-tradelab ageek tris-sondon adeshola quantscientist3 rahulsam92 mikajoh tijoseymathew joelm chuhoting rahul-sindhu kerschke dlazarou pieces201020 ssrajpurohit coorsaa christophergandrud nurur philippstats imagingearth mariaerdmann jimhester 000nelson000 dotterbart philipppro rahmanhafeezul houcy saikath1 hudsonchaves danielkuehn87 yangkangyf stevebronder slang1998 disha7003 thanhleviet fabulthuis giuseppec saurabhbhatt

mlr's Issues

Add multi criteria optimization support for tuning hyper parameters

configureMlr: take defaults of args from options

caret::train can shadow mlr::train

It is annoying if a learner from caret is loaded, caret's train methods shadows mlr's method. This only happens in the global user namespace, but it leads to an unintuitiv error message for users.

I currently see no real fix except for renaming - which I dislike.
Lets think about it.

Do we really need the special code in regr.randomForest for se estimation if we have that in BaggingWrapper?

Show how nice the new bagging wrapper is

Add a page for such stuff in wiki

MultiClass AUC

See wether we can create an AUC measure for more than 2 classes in mlr.

Here is a hint, sent by Markus by mail.

 learner <- makeLearner('classif.lda', predict.type="prob")
 task <- makeClassifTask(data=iris, target="Species")
 mod <- train(learner=learner, task=task)
 pred.obj <- predict(mod, newdata=iris)
 library(HandTill2001)
 predicted <- as.matrix(pred.obj$data[,paste("prob.", levels(pred.obj$data$response), sep="")])
 colnames(predicted)<-levels(pred.obj$data$response)
 auc(multcap(response=pred.obj$data$response, predicted=predicted))

Investigate
Add Measure, doc. and test it
Briefly describe in tutorial / ROC part.

Reread Michel's new imputation code and add a section in tutorial

read code
read roxygen help
correct errors in both and extend docs a bit
add section in tutorial to explain how it works

these are the files:

Impute.R
ImputeMethods.R
PreprocImputeWrapper.R

Also ImputeWrapper is probably a better and short name than PreprocImputeWrapper.

check printers for TuneResult and FeatSel result

Whether they look right

Feature filtering

Check that filtering is nicely explained in the tutorial
Can we access the filtered features after training of filter wrapper
add MRMR, maybe also fmrmr

Add over / undersampling wrapper to mlr

For imbalanced classes. What are good and simple stragegies here?

Carefully check all parameters / param set of classif.nnet and regr.nnet

I think there were some minor issues with lower bounds or other stuff

Tutorial: set show.info=FALSE to reduce some unnecessary output

In some cases, e.g., calling resample in later tutorial sections, show.info should be set to false, so we do not get so much crap on the page.

Only, when he output is very long and the reader does not really gain any additional understanding from seeing it.

explain configureMlr briefly in tutorial if not already done

Feature col returned in getTaskData(target.extra = TRUE, ...)

https://github.com/berndbischl/mlr/blob/master/R/SupervisedTask_operators.R#L140

I guess this is unwanted?

Observation weighting

mlr already supports weighted observations. Learner have a property that tells you wheter they can be fitted in a weighted way. listLearners can give you all such learners.

Better describe in the tutorial how this works, probably in the "learner" part.
train and resample allow the passing of the weights.
tuneParams, selectFeatures and the corresponding wrappers do not.
Discuss and then extend. Maybe one wants to set the weights also in the task? Less annoying in some cases.

Add multi criteria optimization support for feature selection

Probability thresholding should be explained in a better way in help

Also in tutorial.

What if the user wants to set a certain constant threshold value for a learner?
Wasnt there an option for that? Check again.

Explain parallelization in tutorial

add bench.exp again to compare learner on different tasks

Code is in todo-files/benchmark

performance function: Allow list of measures and return vector of numerics

use ```splus instead of ```r

In all github wiki / readme / tutorial files

Improve R doc page of makeFeatSelControl

Proof-read whole page
Explanation of forward / backward search is basically non-existent, write it
Cross-check with respective part in tutorial

Extract probability matrix from prediction obejct

It seems helpful to either have a method to extract the probability matrix from a prediction object oder to store it directly as a matrix / data.frame.

Example

learner <- makeLearner('classif.lda', predict.type="prob")
task <- makeClassifTask(data=iris, target="Species")
mod <- train(learner=learner, task=task)
pred.obj <- predict(mod, newdata=iris)
as.matrix(pred.obj$data[,paste("prob.", levels(pred.obj$data$response), sep="")])

This does not seem like an elegant solution, neither does anything I can come up with at the moment.

I think pred.obj$pred should return a matrix. But I don't know how that would interfere with existing methods.

add "range" as a new aggregation function

Similar to this

my.range.aggr = mlr:::makeAggregation(id="test.range", 
  fun = function (task, perf.test, perf.train, measure, group, pred) max(perf.test) - min(perf.test))

Possibly export makeAggregation so the user can do this, too.

Also explain how to do this in tutorial

Tutorial proof-reading thread

Checked by Bernd:

Add a (web / github) example to show multicriteria evaluation with mlr

Here is an example how to simultaneously look at mmce and the range of errors over resampling.

library(mlr)
library(mlbench)
library(ggplot2)
task = makeClassifTask(data=iris, target="Species")
lrn = makeLearner("classif.rpart")
rdesc = makeResampleDesc("CV", iters=2)
ms1 = mmce
my.range.aggr = mlr:::makeAggregation(id="test.range", 
  fun = function (task, perf.test, perf.train, measure, group, pred) max(perf.test) - min(perf.test))
ms2 =  setAggregation(mmce, my.range.aggr)
res = selectFeatures(lrn, task, rdesc, measures=list(ms1, ms2),  control=makeFeatSelControlExhaustive())
perf.data = as.data.frame(res$opt.path)
p = ggplot(aes(x=mmce.test.mean, mmce.test.range), data=perf.data) + 
  geom_point()
print(p)

Suppress warnings in learners?

Can sometimes be annoying. Add option to configureMlr?

Feature request: getMlrOptions

Output as list or a generic for conversion would be nice.

Better "visualization" of feature forward / backward selection

Basically, one wants to see which feature gets add removed and how that changes performance

The code to get started is here:

https://github.com/berndbischl/mlr/blob/master/R/analyzeFeatSelResult.R

Only the first 2 functions, the rest should be checked and possibly removed if not so useful.

mlr seems to use ParallelMap, which also ignores show.info.

result = resample(learner=lrn, task=tsk, resampling=rsmpl, show.info=FALSE)
Loading packages on slaves: mlr

Or look over here to see what happens.

Totorial links to html R docs currently do not work

Click on the links on any page of the tutorial

Stratified CV does not distribute observations to folds equally

This code snippet is called on each class label separately:

instantiateResampleInstance.CVDesc = function(desc, size) {
  test.inds = sample(size)
  # don't warn when we can't split evenly
  test.inds = suppressWarnings(split(test.inds, seq_len(desc$iters)))
  makeResampleInstanceInternal(desc, size, test.inds=test.inds)
}

Remaining obs are distributed to first folds. After joining the separate splits you can end up with up to [iters] more observations in the first fold than in the others.

[survival] getTaskFormula: interface change required

Using the survival::Surv function on the LHS of the formula is the preferred way to construct formulas required by most survival packages as this does not inflict copies of the input data.
But the argument delete.envis a hindrance here: with no environment attached the survival package is not in the search path and the function lookup will fail. On the other hand, I'd like to not carry these environments around for obvious reasons.

Is it okay to touch the interface of this function? The parameter delete.envis never used in mlr, but might be used in other projects.
I'd opt to replace with new parameter env defaulting to NULL or emptyenv(). I could then set this to as.environment("package:survival") which should have a similar effect but will allow the function lookup.

Nested Resampling

How can I implement a version of nested resampling?

So far, I'm splitting my data into training and test sets (using method = "subsample").
Now, I want to run a feature selection on the training sets, using crossvalidations. Afterwards, I want to evaluate my results on the test sets of the subsamples.

Unfortunately, I can't find anything similar in the tutorial.

Describe how ROC curves can be plotted with mlr ROCR

Construct example, 2 class problem from mlbench, 2 learners.

Crossvalidiate and compare ROC curves in one plot.

Add this example to the tutorial part and in asROCRPredictions.R function @example.

ROCR has examples to show how the plot is constructed, copy a simple one after calling asROCRPredictions.

move some useful stuff from mlrEDA to mlr

parallelMap: level for feture selection is missing

Proposition: Add methods from package 'DiscriMiner'

Methods as e.g. plsDA, geoDA. Perhaps take a look if it's in general an interesting package. Last update was in November 2013.

Also linDA and quDA are available, but I don't know the difference towards MASS lda or qda that already exists in mlr.

use do.call2 in the correct places to save copies

fu** R.

filterFeatures: check in code and document which methods are useful for which tasks

Some methods can be used for regression, some for classification.
Some work with categorical, some with numerical, some with mixed feature sets.

Check this in code and document it on help page.

add unit tests for FeatSelWrapper

getOptPathEl(op, index)$y does not have a name attribute

It might not be really needed but it seems like it had been there and at least analyzeFeatSelResult depended on it.

names(getOptPathEl(opt.path, i))
> NULL

We should consistently define how a slot is stored that is "not there / not used" in S3

Options:

NULL

e.g. numeric(0)

make tutorial somehow knits wrong

Somehow the make tutorial script does not work correctly. Check the performance page
Using knitr manualy out of RStudio works fine.

show.info does not get passed to makeTuneWrapper

Not that tragic. But see here

r = resample(lrn, task, resampling = rout, extract = getTuneResult, show.info = FALSE)

Will generate a lot of output.

Explain over and undersampling in tutorial / cool stuff

Tutorial / tuning: check that it is explained how the whole opt path is accessed.

Maybe show and example of a grid search and convert the the opt path to a data.frame.

Maybe for normal tuning and wrappers.

Users must simply understand how to get all evaluated points.

Unifiy interface of "preprocessing operations before training"

We already have a couple of those:

impute
filter features
over/undersample
what else?

We have to make a list, than make the interface the same, so like

doTheOp(obj, data, target) : generic

doTheOp.data.frame

doTheOp.task

makeOpWrapper: internally calls doTheOp

getOpResults(model): allows the user to access the operation results when the wrapper was trained

integrate LibLinear

http://cran.r-project.org/web/packages/LiblineaR/

Not every learner is compatible with makeBaggingWrapper()

For instance

library("mlr")
data(iris)
tsk = makeClassifTask(data=iris, target="Species")
lrn = makeLearner("classif.fnn")
bagLrn = makeBaggingWrapper(lrn, bag.iters=5, bag.replace=TRUE, bag.size=0.6, bag.feats=3/4, predict.type="prob")
rsmpl = makeResampleDesc("RepCV", reps=5, fold=2)
resample(learner=bagLrn, task=tsk, resampling=rsmpl)

[Resample] repeated cross-validation iter: 1
Fehler in (function (train, test, cl, k = 1, prob = FALSE, algorithm = c("kd_tree",  : 
  dims of 'test' and 'train' differ

I think the predictor dislikes the fact that he gets the full dataset with variables not used while learning (bag.feats=3/4).