Random Forest

The Random Forest approach is an ensemble learning method based on many decision trees. I have written about decision trees but in essence you can think of a decision tree as a flow chart where you make decisions based on a set of criteria. To illustrate the process of building a Random Forest classifier, consider a two-dimensional dataset with N cases (rows) that has M variables (columns). The Random Forest algorithm will build independent decision trees but only using a random subset of the data. For each tree, a random subset of n cases is sampled from all available N cases; the cases not used in the tree construction are called the Out Of Bag (OOB) cases. In addition, at each node (decision point) of a tree, a random number of m variables is used from all available M variables. The number of m variables to use is important and typically the square root of M is used, e.g. with 100 variables, only 10 is used in each individual decision tree. Once a set of independent trees have been constructed (the default number of trees is 500 in the randomForest R package), a new case can be classified by applying it each tree and collecting the final classification result. For example if 500 trees were built, there would be 500 individual classifications. The final step is to average the results (called majority voting) and the major classification is assigned to the new case.

More details are provided below.


Before we get into more details, make sure you understand the following terms, which were adapted from the glossary of Machine learning applications in genetics and genomics.

  • Features / predictors / independent variables are single measurements or descriptors of examples used in a machine learning task. For a person, age, height, weight, etc. are all individual features.
  • Labels / classes are the targets of a prediction task; using our person example, we may want to predict whether they like basketball. In classification, the label is discrete (for example, 'likes basketball' or 'dislikes'); in regression, the label is a real value (for example, predicting a person's height).
  • Feature selection refers to the process of choosing a smaller set of features from a larger set, either before applying a machine learning method or as part of training. For example removing useless features (features with little to no variance) or correlated features.
  • Sensitivity (also known as recall) refers to the fraction of positive examples identified; it is calculated by taking the number of positive predictions that are correct and dividing them by the total number of positive examples.
  • Precision refers to the fraction of positive predictions that are correct; it is calculated by the number of positive predictions that are correct divided by the total number of positive predictions.
  • The Precision-recall curve is a curve that plots precision (y-axis) against recall (x-axis) at various classification thresholds and is used for binary classifiers.

Random Forest in more detail

The main idea behind the Random Forest approach was outlined at the start; this section provides a bit more detail.

The name Random Forest is nice because it describes the method; a forest is made up of trees and these trees are randomly build. For a given dataset, a random subset (using bootstrap resampling) is used to build a decision tree and this sample is typically half of the dataset. This process is repeated again to create a second random subset that is used to build a second decision tree. Since these are random subsets, the predictions made by the second tree could be different from the first tree. In the end, we will have hundreds of trees (a forest) that were built from a slightly different subset of the dataset and each generating (potentially) different predictions.

To add more randomness to the trees, a subset of features/predictors/variables are used to produce a split in the decision trees. For example, if there were 100 predictors, a random subset of 10 will be used at each node to define the best split, instead of the full set of 100. Note that a new random subset of predictors are used at each node; this is different from selecting a random subset of predictors and using that random subset to build the entire tree. The number of predictors to consider at each node is a key parameter and it is recommended that empirical tests (i.e. model tuning) be conducted to find the best value; the square root of the number of available predictors is usually recommended as a good starting point. Finally, majority voting (i.e. averaging) is used to combine all the separate predictions made by the individual trees.

Typically in machine learning/predictive modelling, a subset of data is "held out" and used for model validation since this holdout data was not used in training the model. Each tree in the random forest was built using a random subset and thus we automatically have holdout data for that particular tree; this is known as "Out Of Bag" (OOB) data. Every case in the full dataset will be in bag for some trees and out of bag for other trees; this can be used to evaluate the Random Forest classifier. For example, if a particular case, x, was used in 250 trees and not used in another 250 trees, we can apply x to the trees that did not use it for training. Since x was never used to generate any of the 250 trees, the result provides an assessment of the reliability of the Random Forest classifier. This can be carried out across all the cases in the dataset. Due to this OOB feature in the Random Forest algorithm, we do not need to create an additional holdout or testing dataset as pointed out in the paper describing the Random Forest approach:

Therefore, using the out-of-bag estimate removes the need for a set aside test set.

The Random Forest method also provides a measure of how close each case is to another case in the dataset, which is known as the "proximity". The procedure for calculating the proximity of two cases is to drop a pair of records down each individual tree in a Random Forest, and counting the number of times the two cases end up at the same terminal node, i.e. the same classification, and dividing by the number of trees tested. By carrying out this step across all pairs of cases, a proximity can be constructed. The proximity matrix provides a measure of how similar any two cases are in a dataset and can be used with hierarchical clustering to examine the underlying structure of a dataset.

Another useful feature of the Random Forest method is its estimation of relative predictor importance. The method is based on measuring the effect of the classifier if one of the predictors was removed. This is performed by randomly scrambling the values associated to a given predictor; the scrambling is done by moving values from a specific row to another row. The scrambling is performed one predictor at a time (the data is re-scrambled for each predictor being tested) and predictive accuracy is measured to obtain an estimation of relative predictor importance. If the performance of the classifier is unaffected by the scrambling then the predictor is relatively unimportant. In contrast, if the scrambling caused a decrease in performance, then it is relative important.

An example classifying wines

Refer to the R Markdown file, random_forest.Rmd, for more information.


# preparing the data
data_url <- ''
df <- read.table(file=url(data_url), header=FALSE, sep=",")
header <- c('class',
names(df) <- header
df$class <- as.factor(df$class)

# analysis
# install if necessary
# install.packages("randomForest")

my_sample <- sort(sample(x = 1:nrow(df), replace = FALSE, size = nrow(df)/2))
my_sample_comp <- setdiff(1:nrow(df), my_sample)

test <- df[my_sample, ]
train <- df[my_sample_comp, ]

r <- randomForest(class ~ ., data=train, importance=TRUE, do.trace=100)

# plots
# install if necessary
# install.packages(ggplot2)
class_1_importance <- data.frame(feature=names(r$importance[,1]), importance=r$importance[,1])
ggplot(class_1_importance, aes(x=feature, y=importance)) + geom_bar(stat="identity")

class_2_importance <- data.frame(feature=names(r$importance[,2]), importance=r$importance[,2])
ggplot(class_2_importance, aes(x=feature, y=importance)) + geom_bar(stat="identity")

class_3_importance <- data.frame(feature=names(r$importance[,3]), importance=r$importance[,3])
ggplot(class_2_importance, aes(x=feature, y=importance)) + geom_bar(stat="identity")

boxplot(df$colour ~ df$class, main="Colour by class")
boxplot(df$alcohol ~ df$class, main="Alcohol by class")

ggplot(df, aes(x=alcohol, y=colour, colour=class)) + geom_point()

Scatter plot of alcohol versus colour by class

Predictor importance.


Variance Importance Plot


Random Forest can be used for regression tasks, i.e. making numerical predictions, as well as classification. The example below uses the airquality dataset that comes with R.


#   Ozone Solar.R Wind Temp Month Day
# 1    41     190  7.4   67     5   1
# 2    36     118  8.0   72     5   2
# 3    12     149 12.6   74     5   3
# 4    18     313 11.5   62     5   4
# 5    NA      NA 14.3   56     5   5
# 6    28      NA 14.9   66     5   6

# following the example from ?randomForest
ozone.rf <- randomForest(Ozone ~ ., data=airquality, mtry=3, importance=TRUE, na.action=na.omit)
# Call:
#  randomForest(formula = Ozone ~ ., data = airquality, mtry = 3,      importance = TRUE, na.action = na.omit) 
#                Type of random forest: regression
#                      Number of trees: 500
# No. of variables tried at each split: 3
#           Mean of squared residuals: 303.8304
#                     % Var explained: 72.31

# predicted values
predicted <- ozone.rf$predicted

# original values
# not all the original values were predicted
# hence we need to subset based on what was predicted
original  <- airquality$Ozone[sort(as.numeric(names(ozone.rf$predicted)))]

# fit a linear model
# Y ~ X, where Y is the dependent variable and X is the independent variable
fit       <- lm(predicted~original)
# Call:
# lm(formula = predicted ~ original)
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -51.512  -7.919  -1.074   6.660  54.777 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  13.1775     2.2126   5.956 3.22e-08 ***
# original      0.6988     0.0413  16.919  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 14.41 on 109 degrees of freedom
# Multiple R-squared:  0.7242,	Adjusted R-squared:  0.7217 
# F-statistic: 286.3 on 1 and 109 DF,  p-value: < 2.2e-16

plot(original, predicted, pch=19)
abline(fit, col=2)

Random Forest regression

Random Forest on breast cancer data


data_url <- ''
df <- read.table(file=url(data_url), header=FALSE, sep=',')

colnames(df) <- c('sample_code_number',

# remove sample code number
df <- df[,-1]

# change the class
# 2 for benign will be 0 and
# 4 for malignant will be 1
df$class <- factor(ifelse(df$class==4, 1, 0))

r <- randomForest(class ~ ., data=df, importance=TRUE, proximity=TRUE)

# Call:
#  randomForest(formula = class ~ ., data = df, importance = TRUE,      proximity = TRUE, do.trace = 100) 
#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 3
#         OOB estimate of  error rate: 3.15%
# Confusion matrix:
#     0   1 class.error
# 0 443  15  0.03275109
# 1   7 234  0.02904564


# use votes, which are the fraction of (OOB) votes from the random forest
# in the first row, all trees voted for class 0, which is benign
#           0          1
# 1 1.0000000 0.00000000
# 2 0.1222222 0.87777778
# 3 1.0000000 0.00000000
# 4 0.1123596 0.88764045
# 5 0.9940828 0.00591716
# 6 0.0000000 1.00000000

pred <- prediction(r$votes[,2], as.numeric(df$class)-1)
perf <- performance(pred,"tpr","fpr")

# Area under the curve
auc <- performance(pred, measure = "auc")

legend('bottomright', legend = paste("AUC = ", auc@y.values))

ROC curve using ROCR


Proximity is calculated by tallying the number of times a pair of records ends up at the same terminal node for each tree.


data_url <- ''
df <- read.table(file=url(data_url), header=FALSE, sep=',')
colnames(df) <- c('class', 'xbox', 'ybox', 'width', 'high', 'onpix', 'xbar', 'ybar', 'x2bar', 'y2bar', 'xybar', 'x2ybr', 'xy2br', 'xege', 'xegvy', 'yege', 'yegvx')
r <- randomForest(class ~ ., data=df, importance=TRUE, proximity=TRUE)

# proximity matrix
mat <- r$proximity
# name the rows and columns
rownames(mat) <- df$class
colnames(mat) <- df$class

# use the which() function to calculate a
# matrix of which elements in the proximity matrix
# are greater than 0.5 but not 1
w <- which(mat > 0.5 & mat != 1, arr.ind = TRUE)

# function to retrieve row and column names
# from the w matrix
get_row_col <- function(x, m){
  my_row <- rownames(m)[x[1]]
  my_col <- colnames(m)[x[2]]
  paste(my_row, my_col, sep = ':')

table(unname(apply(w, 1, get_row_col, m=mat)))

#    A:A    A:G    B:B    B:H    C:C    D:D    D:H    D:O    E:E    E:G    E:K    E:L    E:S    E:Z    F:F    F:P    G:A 
# 191944      1  14124      1  26514  10512     18      2  28560      4      1      2      1      1  18276     13      1 
#    G:E    G:G    H:B    H:D    H:H    H:K    I:I    I:J    J:I    J:J    K:E    K:H    K:K    K:R    K:X    L:E    L:L 
#      4  10492      1     18  26694      4  81548     18     18  61394      1      4  13878      2      1      2 109546 
#    M:M    M:V    N:N    O:D    O:O    O:Q    P:F    P:P    Q:O    Q:Q    Q:Z    R:K    R:R    S:E    S:S    T:T    T:Y 
#  49488      3  41938      2  12528      7     13  37994      7   6286      1      2  13730      1  10212  42168     13 
#    U:U    V:M    V:V    V:W    V:Y    W:V    W:W    X:K    X:X    Y:T    Y:V    Y:Y    Z:E    Z:Q    Z:Z 
#  77194      3  48160      7    103      7  52370      1  18088     13    103  23840      1      1  55088

Out of all the mismatches, V's and Y's tended to end up together the most often.


There is a combine function in the randomForest package that combines two or more ensembles of trees into one. Therefore, we can train ensembles in parallel and combine them! This is useful for large datasets or for testing different parameters.

Install the required packages if missing and then load them.

my_packages <- c('randomForest', 'foreach', 'doParallel', 'parallel', 'doRNG')

for (my_package in my_packages){
   if(!require(my_package, character.only = TRUE)){
      install.packages(my_package, '/packages')
      library(my_package, character.only = TRUE)

Load data

We will use the Letter Recognition Data Set which has 20,000 cases, 16 features, and 26 labels.

my_url <- ''
my_df <- read.table(file=url(my_url), header=FALSE, sep=',')
colnames(my_df) <- c('class', 'xbox', 'ybox', 'width', 'high', 'onpix', 'xbar', 'ybar', 'x2bar', 'y2bar', 'xybar', 'x2ybr', 'xy2br', 'xege', 'xegvy', 'yege', 'yegvx')
my_df$class <- factor(my_df$class, levels = LETTERS)

## [1] 20000    17
##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T 
## 789 766 736 805 768 775 773 734 755 747 739 761 792 783 753 803 783 758 748 796 
##   U   V   W   X   Y   Z 
## 813 764 752 787 786 734


Write a function to calculate metrics.

metric <- function(rf, type = "accuracy"){
   # not used for now
   # tab <- table(rf$y, rf$predicted)
   # accuracy <- (TP + TN) / (TP + FN + FP + TN)
   # precision <- TP / (TP + FP)
   # recall <- TP / (TP + FN)
   # specificity <- TN / (TN + FP)
   if(type == "accuracy"){
      return(sum(rf$predicted == rf$y) / length(rf$y))
   } else {


Train random forests model with 2,000 trees without parallelisation.

my_time <- system.time(
   my_rf <- randomForest(
      class ~ .,
      data = my_df,
      ntree = 2000

##    user  system elapsed 
##  80.772   6.143  87.121
## [1] 0.96935

Train random forests with 2,000 trees in parallel. Note the line of code registerDoRNG(seed = 1984): this is to ensure that we train the same model even with parallelisation.

cl <- makeCluster(10)
registerDoRNG(seed = 1984)

my_time_par <- system.time(
   my_rf_par <- foreach(
      ntree = rep(200, 10),
      .combine = combine,
      .packages = 'randomForest'
   ) %dopar% {
      randomForest(class ~ ., data = my_df, ntree=ntree)

##    user  system elapsed 
##   5.357   4.252  19.719
## [1] 0.9696

Set .multicombine = TRUE to further increase the speed up. As per the documentation, the .multicombine argument is a:

logical flag indicating whether the .combine function can accept more than two arguments. If an arbitrary .combine function is specified, by default, that function will always be called with two arguments. If it can take more than two arguments, then setting .multicombine to TRUE could improve the performance. The default value is FALSE unless the .combine function is cbind, rbind, or c, which are known to take more than two arguments.

cl <- makeCluster(10)
registerDoRNG(seed = 1984)

my_time_par_mc <- system.time(
   my_rf_par_mc <- foreach(
      ntree = rep(200, 10),
      .combine = combine,
      .multicombine = TRUE,
      .packages = 'randomForest'
   ) %dopar% {
      randomForest(class ~ ., data = my_df, ntree=ntree)

##    user  system elapsed 
##   1.551   1.084  12.873
## [1] 0.9696

As noted in the documentation in the combine section:

The confusion, err.rate, mse and rsq components (as well as the corresponding components in the test compnent, if exist) of the combined object will be NULL.

But we can calculate those ourselves, if we want.

# confusion matrix, not run
# table(my_df$class, my_rf_par_mc$predicted)

setdiff(names(my_rf), names(my_rf_par_mc))
## [1] "err.rate"  "confusion"

Most predictions are the same between the model trained without and with parallelisation.

table(my_rf$predicted == my_rf_par_mc$predicted)
##    98 19902


Check accuracy as a function of the number of trees.

my_models <- list()
my_time_ntree <- system.time(
   for(n in seq(from = 500, to = 2000, by = 500)){
      my_rf <- randomForest(
         class ~ .,
         data = my_df,
         ntree = n
      my_models[[paste0("ntree_", n)]] <- my_rf

##    user  system elapsed 
## 199.319   7.435 207.239
sort(sapply(my_models, metric))
##  ntree_500 ntree_2000 ntree_1000 ntree_1500 
##    0.96925    0.96935    0.96945    0.96955

Check accuracy as a function of the number of trees in parallel.

cl <- makeCluster(4)
registerDoRNG(seed = 1984)

my_time_ntree_par <- system.time(
   my_models_par <- foreach(
      ntree = seq(from = 500, to = 2000, by = 500),
      .packages = 'randomForest'
   ) %dopar% {
      randomForest(class ~ ., data = my_df, ntree = ntree)

##    user  system elapsed 
##   1.136   0.784  93.234
sapply(my_models_par, metric)
## [1] 0.96825 0.97000 0.97000 0.97040

Check accuracy as a function of the number of trees and features in parallel.

cl <- makeCluster(32)
registerDoRNG(seed = 1984)

ntrees <- seq(from = 500, to = 2000, by = 500)
mtrys <- 1:ncol(my_df)
my_grid <- expand.grid(ntrees, mtrys)

my_time_ntree_mtry_par <- system.time(
   my_models_mtry_par <- foreach(ntree = my_grid$Var1, mtry = my_grid$Var2, .packages = 'randomForest') %dopar% {
      randomForest(class ~ ., data = my_df, ntree = ntree, mtry = mtry)

##    user  system elapsed 
##  17.752  20.017 240.348
my_grid$accuracy <- sapply(my_models_mtry_par, metric)

head(my_grid[order(my_grid$accuracy, decreasing = TRUE), ])
##    Var1 Var2 accuracy
## 8  2000    2  0.97140
## 12 2000    3  0.97125
## 11 1500    3  0.97050
## 7  1500    2  0.97040
## 10 1000    3  0.97040
## 15 1500    4  0.96995

Session info

Time built.

## [1] "2022-11-17 04:51:48 UTC"

Session info.

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## other attached packages:
## [1] doRNG_1.8.2          rngtools_1.5.2       doParallel_1.0.17   
## [4] iterators_1.0.14     foreach_1.5.2        randomForest_4.7-1.1
## loaded via a namespace (and not attached):
##  [1] codetools_0.2-18 digest_0.6.30    magrittr_2.0.3   evaluate_0.17   
##  [5] rlang_1.0.6      stringi_1.7.8    cli_3.4.1        rstudioapi_0.14 
##  [9] rmarkdown_2.17   tools_4.2.0      stringr_1.4.1    xfun_0.34       
## [13] yaml_2.3.6       fastmap_1.1.0    compiler_4.2.0   htmltools_0.5.3 
## [17] knitr_1.40

Technical points

The question poses a scenario where there are 1,000 correlated features and one uncorrelated feature all with the same predictive power. The potential problem is that we may not use the uncorrelated feature since it may not be sampled by the trees, which also highlights the importance of setting the number of features to use. While the random selection of features helps mitigates the issue of multi-collinearity, highly correlated features should be removed.

Short answer is no, and initially I thought we had to tune the number of trees to use but since trees are grown to the maximum possible node depth, this helps ensure that enough features are used.

Further reading

