GithubHelp home page GithubHelp logo

mlr-org / mlr Goto Github PK

View Code? Open in Web Editor NEW
1.6K 106.0 403.0 618.69 MB

Machine Learning in R

Home Page: https://mlr.mlr-org.com

License: Other

R 97.99% C 0.31% Shell 0.05% HTML 1.65%
machine-learning data-science tuning cran r-package predictive-modeling classification regression statistics r

mlr's Introduction

mlr

Package website: release | dev

Machine learning in R.

tic CRAN_Status_Badge cran checks CRAN Downloads StackOverflow lifecycle codecov

Deprecated

{mlr} is considered retired from the mlr-org team. We won't add new features anymore and will only fix severe bugs. We suggest to use the new mlr3 framework from now on and for future projects.

Not all features of {mlr} are already implemented in {mlr3}. If you are missing a crucial feature, please open an issue in the respective mlr3 extension package and do not hesitate to follow-up on it.

Installation

Release

install.packages("mlr")

Development

remotes::install_github("mlr-org/mlr")

Citing {mlr} in publications

Please cite our JMLR paper [bibtex].

Some parts of the package were created as part of other publications. If you use these parts, please cite the relevant work appropriately. An overview of all {mlr} related publications can be found here.

Introduction

R does not define a standardized interface for its machine-learning algorithms. Therefore, for any non-trivial experiments, you need to write lengthy, tedious and error-prone wrappers to call the different algorithms and unify their respective output.

Additionally you need to implement infrastructure to

  • resample your models
  • optimize hyperparameters
  • select features
  • cope with pre- and post-processing of data and compare models in a statistically meaningful way.

As this becomes computationally expensive, you might want to parallelize your experiments as well. This often forces users to make crummy trade-offs in their experiments due to time constraints or lacking expert programming skills.

{mlr} provides this infrastructure so that you can focus on your experiments! The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering. It is written in a way that you can extend it yourself or deviate from the implemented convenience methods and construct your own complex experiments or algorithms.

Furthermore, the package is nicely connected to the OpenML R package and its online platform, which aims at supporting collaborative machine learning online and allows to easily share datasets as well as machine learning tasks, algorithms and experiments in order to support reproducible research.

Features

  • Clear S3 interface to R classification, regression, clustering and survival analysis methods
  • Abstract description of learners and tasks by properties
  • Convenience methods and generic building blocks for your machine learning experiments
  • Resampling methods like bootstrapping, cross-validation and subsampling
  • Extensive visualizations (e.g. ROC curves, predictions and partial predictions)
  • Simplified benchmarking across data sets and learners
  • Easy hyperparameter tuning using different optimization strategies, including potent configurators like
    • iterated F-racing (irace)
    • sequential model-based optimization
  • Variable selection with filters and wrappers
  • Nested resampling of models with tuning and feature selection
  • Cost-sensitive learning, threshold tuning and imbalance correction
  • Wrapper mechanism to extend learner functionality in complex ways
  • Possibility to combine different processing steps to a complex data mining chain that can be jointly optimized
  • OpenML connector for the Open Machine Learning server
  • Built-in parallelization
  • Detailed tutorial

Miscellaneous

Simple usage questions are better suited at Stackoverflow using the mlr tag.

Please note that all of us work in academia and put a lot of work into this project - simply because we like it, not because we are paid for it.

New development efforts should go into {mlr3}. We have a own style guide which can easily applied by using the mlr_style from the styler package. See our wiki for more information.

Talks, Workshops, etc.

mlr-outreach holds all outreach activities related to {mlr} and {mlr3}.

mlr's People

Contributors

alexengelhardt avatar berndbischl avatar bhvieira avatar coorsaa avatar danielhorn avatar dominikkirchhoff avatar florianfendt avatar gegznav avatar giuseppec avatar hetong007 avatar ja-thomas avatar jackknifex avatar jakob-r avatar jakobbossek avatar karinschork avatar kerschke avatar larskotthoff avatar mariaerdmann avatar masongallo avatar mb706 avatar mllg avatar pat-s avatar pfistfl avatar philipppro avatar pre-commit-ci[bot] avatar schiffner avatar studerus avatar t-8-n avatar web-flow avatar zmjones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlr's Issues

caret::train can shadow mlr::train

It is annoying if a learner from caret is loaded, caret's train methods shadows mlr's method. This only happens in the global user namespace, but it leads to an unintuitiv error message for users.

I currently see no real fix except for renaming - which I dislike.
Lets think about it.

MultiClass AUC

See wether we can create an AUC measure for more than 2 classes in mlr.

Here is a hint, sent by Markus by mail.

 learner <- makeLearner('classif.lda', predict.type="prob")
 task <- makeClassifTask(data=iris, target="Species")
 mod <- train(learner=learner, task=task)
 pred.obj <- predict(mod, newdata=iris)
 library(HandTill2001)
 predicted <- as.matrix(pred.obj$data[,paste("prob.", levels(pred.obj$data$response), sep="")])
 colnames(predicted)<-levels(pred.obj$data$response)
 auc(multcap(response=pred.obj$data$response, predicted=predicted))
  1. Investigate

  2. Add Measure, doc. and test it

  3. Briefly describe in tutorial / ROC part.

Reread Michel's new imputation code and add a section in tutorial

  • read code
  • read roxygen help
  • correct errors in both and extend docs a bit
  • add section in tutorial to explain how it works

these are the files:

  • Impute.R
  • ImputeMethods.R
  • PreprocImputeWrapper.R

Also ImputeWrapper is probably a better and short name than PreprocImputeWrapper.

Feature filtering

  1. Check that filtering is nicely explained in the tutorial

  2. Can we access the filtered features after training of filter wrapper

  3. add MRMR, maybe also fmrmr

Tutorial: set show.info=FALSE to reduce some unnecessary output

In some cases, e.g., calling resample in later tutorial sections, show.info should be set to false, so we do not get so much crap on the page.

Only, when he output is very long and the reader does not really gain any additional understanding from seeing it.

Observation weighting

mlr already supports weighted observations. Learner have a property that tells you wheter they can be fitted in a weighted way. listLearners can give you all such learners.

  1. Better describe in the tutorial how this works, probably in the "learner" part.

  2. train and resample allow the passing of the weights.
    tuneParams, selectFeatures and the corresponding wrappers do not.
    Discuss and then extend. Maybe one wants to set the weights also in the task? Less annoying in some cases.

Extract probability matrix from prediction obejct

It seems helpful to either have a method to extract the probability matrix from a prediction object oder to store it directly as a matrix / data.frame.

Example

learner <- makeLearner('classif.lda', predict.type="prob")
task <- makeClassifTask(data=iris, target="Species")
mod <- train(learner=learner, task=task)
pred.obj <- predict(mod, newdata=iris)
as.matrix(pred.obj$data[,paste("prob.", levels(pred.obj$data$response), sep="")])

This does not seem like an elegant solution, neither does anything I can come up with at the moment.

I think pred.obj$pred should return a matrix. But I don't know how that would interfere with existing methods.

add "range" as a new aggregation function

Similar to this

my.range.aggr = mlr:::makeAggregation(id="test.range", 
  fun = function (task, perf.test, perf.train, measure, group, pred) max(perf.test) - min(perf.test))

Possibly export makeAggregation so the user can do this, too.

Also explain how to do this in tutorial

Add a (web / github) example to show multicriteria evaluation with mlr

Here is an example how to simultaneously look at mmce and the range of errors over resampling.

library(mlr)
library(mlbench)
library(ggplot2)
task = makeClassifTask(data=iris, target="Species")
lrn = makeLearner("classif.rpart")
rdesc = makeResampleDesc("CV", iters=2)
ms1 = mmce
my.range.aggr = mlr:::makeAggregation(id="test.range", 
  fun = function (task, perf.test, perf.train, measure, group, pred) max(perf.test) - min(perf.test))
ms2 =  setAggregation(mmce, my.range.aggr)
res = selectFeatures(lrn, task, rdesc, measures=list(ms1, ms2),  control=makeFeatSelControlExhaustive())
perf.data = as.data.frame(res$opt.path)
p = ggplot(aes(x=mmce.test.mean, mmce.test.range), data=perf.data) + 
  geom_point()
print(p)

Stratified CV does not distribute observations to folds equally

This code snippet is called on each class label separately:

instantiateResampleInstance.CVDesc = function(desc, size) {
  test.inds = sample(size)
  # don't warn when we can't split evenly
  test.inds = suppressWarnings(split(test.inds, seq_len(desc$iters)))
  makeResampleInstanceInternal(desc, size, test.inds=test.inds)
}

Remaining obs are distributed to first folds. After joining the separate splits you can end up with up to [iters] more observations in the first fold than in the others.

[survival] getTaskFormula: interface change required

Using the survival::Surv function on the LHS of the formula is the preferred way to construct formulas required by most survival packages as this does not inflict copies of the input data.
But the argument delete.envis a hindrance here: with no environment attached the survival package is not in the search path and the function lookup will fail. On the other hand, I'd like to not carry these environments around for obvious reasons.

Is it okay to touch the interface of this function? The parameter delete.envis never used in mlr, but might be used in other projects.
I'd opt to replace with new parameter env defaulting to NULL or emptyenv(). I could then set this to as.environment("package:survival") which should have a similar effect but will allow the function lookup.

Nested Resampling

How can I implement a version of nested resampling?

So far, I'm splitting my data into training and test sets (using method = "subsample").
Now, I want to run a feature selection on the training sets, using crossvalidations. Afterwards, I want to evaluate my results on the test sets of the subsamples.

Unfortunately, I can't find anything similar in the tutorial.

Describe how ROC curves can be plotted with mlr ROCR

Construct example, 2 class problem from mlbench, 2 learners.

Crossvalidiate and compare ROC curves in one plot.

Add this example to the tutorial part and in asROCRPredictions.R function @example.

ROCR has examples to show how the plot is constructed, copy a simple one after calling asROCRPredictions.

Proposition: Add methods from package 'DiscriMiner'

Methods as e.g. plsDA, geoDA. Perhaps take a look if it's in general an interesting package. Last update was in November 2013.

Also linDA and quDA are available, but I don't know the difference towards MASS lda or qda that already exists in mlr.

Unifiy interface of "preprocessing operations before training"

We already have a couple of those:

  • impute
  • filter features
  • over/undersample
  • what else?

We have to make a list, than make the interface the same, so like

doTheOp(obj, data, target) : generic

doTheOp.data.frame

doTheOp.task

makeOpWrapper: internally calls doTheOp

getOpResults(model): allows the user to access the operation results when the wrapper was trained

Not every learner is compatible with makeBaggingWrapper()

For instance

library("mlr")
data(iris)
tsk = makeClassifTask(data=iris, target="Species")
lrn = makeLearner("classif.fnn")
bagLrn = makeBaggingWrapper(lrn, bag.iters=5, bag.replace=TRUE, bag.size=0.6, bag.feats=3/4, predict.type="prob")
rsmpl = makeResampleDesc("RepCV", reps=5, fold=2)
resample(learner=bagLrn, task=tsk, resampling=rsmpl)

[Resample] repeated cross-validation iter: 1
Fehler in (function (train, test, cl, k = 1, prob = FALSE, algorithm = c("kd_tree",  : 
  dims of 'test' and 'train' differ

I think the predictor dislikes the fact that he gets the full dataset with variables not used while learning (bag.feats=3/4).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.