Miscellaneous functions for training and plotting classification and regression models. Detailed documentation is at http://topepo.github.io/caret/index.html
install.packages('caret')
pak::pak('topepo/caret/pkg/caret')
caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
Home Page: http://topepo.github.io/caret/index.html
Miscellaneous functions for training and plotting classification and regression models. Detailed documentation is at http://topepo.github.io/caret/index.html
install.packages('caret')
pak::pak('topepo/caret/pkg/caret')
Apparently, devtools 1.6 makes it very easy to setup free unit testing using travis CI. I think it integrates with github, to allow for automated unit testing of pull requests. That would be pretty cool.
https://travis-ci.org/
http://blog.rstudio.org/2014/10/02/devtools-1-6/
In HTML and the man page.
The resamples class is designed to compare between-model resampling distributions so I made the choice that only one configuration of tuning parameters should show up there. For some of the other plotting functions (like xyplot.train
and a few others), I wanted to plot the within-model resamples versus the tuning parameters. That is why xyplot.train
requires resamples = "all"
but resamples needs returnResamp="final"
See presentation
Right now, if extra features are in newdata
it can crash if there was pre-processing. After the formula handling, use predictors(object)
to filter out extra predictors. This may be NULL
, so do some checking first.
So they do not conflict with the same option in train
Via Benjamin Mack (and his code).
Have a function to extract the resamples from a train
, rfe
or sbf
model. Have an option for optimal_only
or something similar.
From email correspondance with Tarek Abdunabi, he provided the code below as a prototype workflow and a ton of information about the methods and possible tuning parameters.
##################################
## I. Regression Problem
## In this example, we are using the gas furnace dataset that
## contains two input and one output variables.
##################################
## Input data: Using the Gas Furnace dataset
## then split the data to be training and testing datasets
data(frbsData)
data.train <- frbsData$GasFurnance.dt[1 : 204, ]
data.tst <- frbsData$GasFurnance.dt[205 : 292, 1 : 2]
real.val <- matrix(frbsData$GasFurnance.dt[205 : 292, 3], ncol = 1)
## Define interval of data
range.data <-apply(data.train,2,range)
## Set the method and its parameters,
## for example, we use Wang and Mendel's algorithm
method.type <- "WM"
control <- list(num.labels = 15, type.mf = "GAUSSIAN", type.defuz = "WAM",
type.tnorm = "MIN", type.snorm = "MAX", type.implication.func = "ZADEH",
name="sim-0")
## Learning step: Generate an FRBS model
object.reg <- frbs.learn(data.train, range.data, method.type, control)
## Predicting step: Predict for newdata
res.test <- predict(object.reg, data.tst)
## Display the FRBS model
summary(object.reg)
## Plot the membership functions
plotMF(object.reg)
##################################
## II. Classification Problem
## In this example, we are using the iris dataset that
## contains four input and one output variables.
##################################
## Input data: Using the Iris dataset
data(iris)
set.seed(2)
## Shuffle the data
## then split the data to be training and testing datasets
irisShuffled <- iris[sample(nrow(iris)),]
irisShuffled[,5] <- unclass(irisShuffled[,5])
tra.iris <- irisShuffled[1:105,]
tst.iris <- irisShuffled[106:nrow(irisShuffled),1:4]
real.iris <- matrix(irisShuffled[106:nrow(irisShuffled),5], ncol = 1)
## Define range of input data. Note that it is only for the input variables.
range.data.input <- apply(iris[,-ncol(iris)], 2,range)
## Set the method and its parameters. In this case we use FRBCS.W algorithm
method.type <- "FRBCS.W"
control <- list(num.labels = 7, type.mf = "GAUSSIAN", type.tnorm = "MIN",
type.snorm = "MAX", type.implication.func = "ZADEH")
## Learning step: Generate fuzzy model
object.cls <- frbs.learn(tra.iris, range.data.input, method.type, control)
## Predicting step: Predict newdata
res.test <- predict(object.cls, tst.iris)
## Display the FRBS model
summary(object.cls)
## Plot the membership functions
plotMF(object.cls)
Including autoencoder, deepnet and darch.
The request:
Dear Max,
When running cross-validation, function trainControl
allows
to set if predictions should be saved or not. This is done
with argument savePredictions
. The result is a data.frame
with observed and predicted values of each cross-validation
run.
Some prediction functions (such as predict.lm
, predict.ar
,
predict.Arima
, predict.glm
, predict.loess
, and so on) have
an argument called se.fit which allow to set if the standard
error of each predicted value should be saved or not. The
current structure of function predictionFunction
does not
allow for setting argument se.fit = TRUE
, and thus the
standard error of each predicted value cannot be accessed.
Is there an easy way of saving the standard errors? Overall,
I believe it would be interesting to save standard errors
of prediction whenever a predict function allows it.
Best,
PhD Candidate
Graduate School in Agronomy - Soil Science
Federal Rural University of Rio de Janeiro
Guest Researcher
ISRIC - World Soil Information
Wageningen, the Netherlands
Homepage: soil-scientist.net Skype: alessandrosamuel
It can be checked against functions sortinghat:::errorest_632plus
, ipred:::errorest
and pep err:::perr
,
Found here. Why are there numbers falling where there are no points?
Some things that would have to change:
train
has a component called modelType
that is currently either "Regression" or "Classification". Another option, "Survivial" would need to be added and a search should be made for any error traps or checks on this value.is.Surv
should be added to make sure the outcome is correctpostResample
too.See here.
This should be feasible now that x
and y
are carried along separately through train
.
You need to do caret:::cluster.resamples(object)
instead
Also, for two classes, run only once
Right now, the footer has Created on <Sexpr session>
.
This needs to be changed to some searchable tag (maybe FOOTFOOT
to be consistent with the other parts of the code) and gsub
'ed in.
Owen,
The documentation is unclear. The score
function takes two vectors (x
and y
) that are an individual predictor and outcome, respectively. For example, anovaScores
is an example scoring function that can be used for classification:
set.seed(1)
dat <- twoClassSim(100)
anovaScores(dat$Linear01, dat$Class)
[1] 0.6580359
Internally, sbf
uses apply
to run this on each column of the predictor matrix and those results are pumped into the filter
function.
So it looks to me that help(caretSBF) is most accurate. I'll update the documentation for the web site.
Thanks,
Max
On Thu, Sep 25, 2014 at 3:22 PM, Owen Solberg [email protected] wrote:
Dear Dr. Kuhn,
First let me express my appreciation for your book and R package — both are rapidly becoming mainstays of my analysis.
I am wondering if there is an inconsistency in some of caret's documentation.
On one hand, it sounds like the Selection By Filter score function is supposed to take a matrix of predictors and return a named vector of results.
From the website (http://topepo.github.io/caret/featureselection.html#filter):The score Function This function takes as inputs the predictors and the outcome in objects called x and y, respectively. The output should be a named vector of scores where the names correspond to the column names of x.
From help(sbfControl)
The score function is used to return a vector of scores with names for each predictor (such as a p-value). Inputs are:
x the predictors for the training samples
y the current training outcomes
Other parts of the documentation says that the score function operates on a single feature at at time:
From help(caretSBF)
anovaScores fits a simple linear model between a single feature and the outcome, then the p-value for the whole model F-test is returned.
This is also they way the code actually works:
> caretSBF$score
function (x, y)
{
if (is.factor(y))
anovaScores(x, y)
else gamScores(x, y)
}
Owen Solberg
Senior Scientist, Bioinformatics
HealthTell, Inc.
T: 510.545.6936
E: [email protected]
HealthTell | LinkedIn | Facebook | Twitter
NOTICE: This message contains information that may be privileged and/or confidential. The information is intended for the use of the addressee only. If you are not the addressee, please be advised that any disclosure, copying, distribution or other use of the contents of this message is prohibited.
on CRAN
Hi,
When I use knnImpute with caret_6.0-34, it returns a error.
The below is the reproduce code, and the same code works with caret_6.0-30 without any problem.
with caret_6.0-34:
> suppressMessages(library(caret))
> data(iris)
>
> iris[1,1] <- NA
> head(iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 NA 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
>
> m = preProcess(iris[,-5], method="knnImpute")
> head(predict(m, iris[,-5]), 3)
Error in nn2(old[, cols, drop = FALSE], new[, cols, drop = FALSE], k = k) :
no points in data!
Calls: head ... predict -> predict.preProcess -> apply -> FUN -> nn2
Execution halted
with caret_6.0-30:
> suppressMessages(library(caret))
> data(iris)
>
> iris[1,1] <- NA
> head(iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 NA 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
>
> m = preProcess(iris[,-5], method="knnImpute")
> head(predict(m, iris[,-5]), 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 -0.7824364 1.0156020 -1.335752 -1.311052
2 -1.1444955 -0.1315388 -1.335752 -1.311052
3 -1.3858682 0.3273175 -1.392399 -1.311052
so that the tuning parameter name is in the strip. Also add to ``ggplot.train`.
Dear Mr. Kuhn!
Recently, while using train() in caret package, I noticed that index variable in trainControl does not accept unnamed lists:
> n = 100;
> m = 10;
> set.seed(1);
> x = matrix(runif(n * m), nrow = n);
> y = runif(n);
> idx <- lapply(1:10, function(x) sample(1:n, 0.9 * n));
> train(x = x, y = y, method = "glm", trControl = trainControl(index = idx));
Error in { : task 1 failed - "argument is of length zero"
> names(idx)
NULL
> names(idx) <- 1:10
> train(x = x, y = y, method = "glm", trControl = trainControl(index = idx));
Generalized Linear Model
100 samples
10 predictors
No pre-processing
Resampling: Bootstrapped (10 reps)
Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...
Resampling results
RMSE Rsquared RMSE SD Rsquared SD
0.302 0.0923 0.0587 0.0825
I know that index
is usually populated by the output from createPartition
or functions alike, and they return lists with names like "Fold1", etc.
I don't know if it was intended that unnamed lists are not accepted. If so, I kindly make a suggestion to mention it somewhere in the documentation, as understanding what the error relates to was highly not obvious, at least for me.
Thank you,
Best regards,
Andrii Maksai
This might be a good idea for keeping docs and man pages in sync. I personally like roxygen2 a lot, as it removes some of the boring parts in writing R manual pages.
on CRAN
The original request is from here. This is the request:
I'm using R's caret package to do some grid search and model evaluation. I have a custom evaluation metric that is a weighted average of absolute error. Weights are assigned at the observation level.
X <- c(1,1,2,0,1) #feature 1
w <- c(1,2,2,1,1) #weights
Y <- 1:5 #target, continuous
#assume I run a model using X as features and Y as target and get a vector of predictions
mymetric <- function(predictions, target, weights){
v <- sum(abs(target-predictions)*weights)/sum(weights)
return(v)
}
Here an example is given on how to use summaryFunction to define a custom evaluation metric for caret's train().
To quote:
The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:
data is a reference for a data frame or matrix with columns called obs
and pred for the observed and predicted outcome values (either numeric
data for regression or character values for classification).
Currently, class probabilities are not passed to the function. The
values in data are the held-out predictions (and their associated
reference values) for a single combination of tuning parameters. If
the classProbs argument of the trainControl object is set to TRUE,
additional columns in data will be present that contains the class
probabilities. The names of these columns are the same as the class
levels. lev is a character string that has the outcome factor levels
taken from the training data. For regression, a value of NULL is
passed into the function. model is a character string for the model
being used (i.e. the value passed to the method argument of train).
I cannot quite figure out how to pass the observation weights to summaryFunction.
Found here
In trainControl
, think about having an option to do sampling after resampling and prior to pre-processing/model fitting with options for upSample
, downSample
, SMOTE
and ROSE
. Make offline code available (the same way models are stored) to avoid more package dependencies.
if tuneLength > 10
make the grid interpolate between 2^{-2} to 2^{-6}
heldout
as a class name? An an option to check indices for consistencies
As suggested by Michael Benesty, this enhancement would allow the function to be used in parallel and with large data sets:
test <- function(dataset){
res <- foreach(name = colnames(dataset), .combine=rbind) %do% {
r <- caret::nearZeroVar(dataset[[name]], saveMetrics = T)
r[,"column" ] <- name
r
}
res[, c(5, 1, 2, 3, 4)]
}
When probabilities are not used with train
models:
library(caret)
library(mlbench)
library(doMC)
registerDoMC(cores=7)
extra <- 3
set.seed(1)
training <- twoClassSim(200, noiseVars = extra)
p <- ncol(training)
ctrl <- trainControl(method = "cv", savePredictions = TRUE)
ctrl2 <- ctrl
ctrl2$allowParallel = FALSE
set.seed(1)
svmRFE <- rfe(training[, -p], training$Class,
sizes = c(2, 6),
rfeControl = rfeControl(functions = caretFuncs,
method = "cv",
verbose = TRUE,
saveDetails = TRUE),
## pass options to train()
method = "svmRadial",
tuneLength = 3,
trControl = ctrl2,
preProc = c("center", "scale"))
You get:
> subset(svmRFE$pred, rowIndex == 1 & Variables == 2)
pred Class1 Class2 obs Variables Resample rowIndex
predictions.857 Class1 NA NA Class1 2 Fold08 1
predictions.1067 Class1 NA NA Class1 2 Fold08 1
with classProbs = TRUE
it works fine.
That the resamples may not be associated with the final model (and perhaps remove them).
Point towards the github IO page and change info in the package description file.
Tuning a gbm model with the poisson distribution does not seem to work. The following code exits with an error:
> set.seed(1)
> n <- 500
> nvar <- 10
> b0 <- 1
> b1 <- 1
> x <- as.data.frame(matrix(rep(runif(n), nvar), ncol=nvar))
> mu <- with(x, exp(b0 + b1 * sin(pi * V1)))
> y <- rpois(n, lambda = mu)
>
> tr <- train(x, y, method = "gbm", distribution = "poisson", verbose = FALSE)
Error in { :
task 1 failed - "arguments imply differing number of rows: 3, 0"
>
I have narrowed it down to the result <- foreach(...) in nominalTrainWorkflow() as the source of the error.
Session Info
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)
attached base packages:
[1] parallel splines stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] gbm_2.1-06 survival_2.37-7 caret_6.0-35 ggplot2_0.9.3.1
[5] lattice_0.20-29
loaded via a namespace (and not attached):
[1] BradleyTerry2_1.0-5 brglm_0.5-9 car_2.0-20
[4] codetools_0.2-8 colorspace_1.2-4 compiler_3.1.1
[7] digest_0.6.4 foreach_1.4.2 grid_3.1.1
[10] gtable_0.1.2 gtools_3.4.1 iterators_1.0.7
[13] lme4_1.1-6 MASS_7.3-33 Matrix_1.1-4
[16] minqa_1.2.3 munsell_0.4.2 nlme_3.1-117
[19] nnet_7.3-8 plyr_1.8.1 proto_0.3-10
[22] Rcpp_0.11.1 RcppEigen_0.3.2.1.2 reshape2_1.4
[25] scales_0.2.4 stringr_0.6.2 tcltk_3.1.1
[28] tools_3.1.1
Older versions of the package listed each model and their tuning parameters in the train
man file. A request was made to add the information back in but on another page. This would be updated at each new version.
To-do:
ggplot
method to include size and add warnings about the number of resamplesPackage located on CRAN
Try to use "knnImpute" to do knn missing value imputation
object <- preProcess(newdata, method = "knnImpute")
newdata1 <- predict(object, newdata)
Error message is : Error in nn2(old[, cols, drop = FALSE], new[, cols, drop = FALSE], k = k) : no points in data!
Have found the bug in the perProcess() function
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.