giuseppec / iml Goto Github PK

iml: interpretable machine learning R package

Home Page: https://giuseppec.github.io/iml/

License: Other

R 68.44% Makefile 0.11% Jupyter Notebook 29.67% TeX 1.78%

iml's Introduction

iml

iml is an R package that interprets the behavior and explains predictions of machine learning models. It implements model-agnostic interpretability methods - meaning they can be used with any machine learning model.

Features

Feature importance
Partial dependence plots
Individual conditional expectation plots (ICE)
Accumulated local effects
Tree surrogate
LocalModel: Local Interpretable Model-agnostic Explanations
Shapley value for explaining single predictions

Read more about the methods in the Interpretable Machine Learning book.

Tutorial

Start an interactive notebook tutorial by clicking on this badge

Installation

The package can be installed directly from CRAN and the development version from GitHub:

# Stable version
install.packages("iml")

# Development version
remotes::install_github("giuseppec/iml")

News

Changes of the packages can be accessed in the NEWS file.

Quickstart

First we train a Random Forest to predict the Boston median housing value. How does lstat influence the prediction individually and on average? (Accumulated local effects)

library("iml")
library("randomForest")
data("Boston", package = "MASS")
rf = randomForest(medv ~ ., data = Boston, ntree = 50)
X = Boston[which(names(Boston) != "medv")]
model = Predictor$new(rf, data = X, y = Boston$medv)
effect = FeatureEffects$new(model)
effect$plot(features = c("lstat", "age", "rm"))

Contribute

Please check the contribution guidelines

Citation

If you use iml in a scientific publication, please cite it as:

Molnar, Christoph, Giuseppe Casalicchio, and Bernd Bischl. "iml: An R package for interpretable machine learning." Journal of Open Source Software 3.26 (2018): 786.

BibTeX:

@article{molnar2018iml,
  title={iml: An R package for interpretable machine learning},
  author={Molnar, Christoph and Casalicchio, Giuseppe and Bischl, Bernd},
  journal={Journal of Open Source Software},
  volume={3},
  number={26},
  pages={786},
  year={2018}
}

License

The contents of this repository are distributed under the MIT license. See below for details:

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Funding

This work is funded by the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B)

iml's People

Contributors

Stargazers

Watchers

Forkers

nanaakwasiabayieboateng mutual-ai spark-lin afcarl juliabrosig ezanutto bkutlu maximilianstroh nishantsbi ben-haim atencra mclamence vishalbelsare sandy4321 jzkay12 asiftechie shunangao guhjy prat07 aiwalter liuliuball45 julianflowers taqhee markroepke davidchengo simon-tietze blt12 smily1984 dandls tttt007 limbu2 micseb augustoamerico christofseiler konradbachusz-zz joegaotao dust9 chuangjerrydrum quayau alexzhang-2019 kejin-qian edenhuangsh stjordanis nikolasfritz gakkilovemath alex33261 rutz93 sheikhrabiul algoskynet accustodio qwertz11 rpplayground ttxs402 dhruvnigam93 mnwright pfistfl gumpfly ck37 okothchristopher ankit-da mirka-henninger palexbg minghao2016 serviolimareina jun0609 tomaszdrazil anhmike zeta1999 quantgroup isabelizimm yyleon sumny daniel-rac williamjin1992 kaushikjas10 jefferychenjy gulde91 g-rho rnaimehaom lkampoli jiechensimon sdismuss henrifnk habib61 kornl panli-tju sophhan pitmonticone

iml's Issues

Use name of target instead of y.hat on y-axis of plots

This only works when the name of the target is available in Predictor

Combine TreeSurrogate and LocalModel into SurrogateModel

Create a new class SurrogateModel which alllows both the estimation of a global surrogate model and a local model.
Maybe TreeSurrogate and LocalModel can also be child classes of SurrogateModel.

If no x.interest is provided, automatically estimate a global model. Allow weighting also for the global model.
If x.interest is provided, estimate a local model.

Allow both decision tree and linear models for the surrogate. Ideally make it completely flexible and possible for the user to selet a model.

Question: Should it be allowed to provide mutliple x.interest and return multiple models?

Automatically detect data and y from formula in Predictor$new

I think for convenience it would be nice not having to specify data and y in Predictor$new, because I already specified this in the model via the formula argument. I know that this is not the case for all models but for many.

In case you do want to implement this, here is a reprex showing how it could be done based your example from the vignette:

set.seed(42)
library("iml")
library("randomForest")
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
data("Boston", package = "MASS")
rf = randomForest(medv ~ ., data = Boston, ntree = 50)

X = Boston[which(names(Boston) != "medv")]
Y = Boston$medv

library("Formula")
fml = as.Formula(formula(rf))
formula(fml, lhs = 1, rhs = 0)
#> medv ~ 0
#> <environment: 0x5596bece47d0>
formula(fml, lhs = 0, rhs = 1)
#> ~crim + zn + indus + chas + nox + rm + age + dis + rad + tax + 
#>     ptratio + black + lstat
#> <environment: 0x5596bece47d0>
y = model.part(fml, lhs = 1, rhs = 0, data = Boston)
x = model.part(fml, lhs = 0, rhs = 1, data = Boston)

all.equal(Y, y[, ])
#> [1] TRUE
all.equal(X, x)
#> [1] TRUE

Incompatible with xgboost models trained on xgb.DMatrix

Is there a workaround to implement the Shapley function, for example, on xgboost models trained on xgb.DMatrix objects?

For example,

predictor <- Predictor$new(xgb.model, data = x, y = y,
                           predict.fun = function(object, newdata){
                             predict(object, newdata )})
shapley   <- Shapley$new(predictor, x.interest = x[1,], sample.size = 10, run = TRUE)

results in an error:

Error in xgb.DMatrix(newdata, missing = missing) : 
  xgb.DMatrix: does not support to construct from  list

target variable included in feature effect

mpg has a feature effect here, I think it should not have this.

library(iml)
library(mlr)
data(mtcars)
# ----------------------
tsk = makeRegrTask(data = mtcars, target = "mpg")
lrn = makeLearner("regr.lm")
mod = train(lrn, tsk)

lm = LocalModel$new(prd, mtcars[2,])
lm

Show predicted value in title of Shapley and LocalModel plot

Write in the title for LocalModel:

Prediction: X
Prediction (local):

For Shapley:

Prediction: X
Difference to average prediction: X

Add r.squared measure in LocalModel

The same as for TreeSurrogate.
But the R squared should probably be weighted in the same way as for the local model.

Maybe it makes sense to have two measures

weighted R squared
global, unweighted R squared

Add R squared measure for TreeSurrogate

Implement R squared measure for variance explained by the tree surrogate model. Maybe also add measure for performance difference on real dataset, when the user submitted the y.

The R squared should be a public variable of the TreeSurrogate class.

Add line for mean prediction in Partial as plot() argument

Bug in FeatureEffect with ordered factors

Hi there,

Thanks for releasing this package. It's really great!

Just found a bug when using the FeatureEffect function. After some digging into the source code, I've found that the issue is in the order_levels function in utils.R.

Specifically, on line 240, there is the following if statement if(class(feature.x) == "factor"). This fails when one of the columns is an ordered factor which results in an evaluation of the quantile function in line 244, thereby producing the following error and warning:

Error in quantile.default(feature.x, probs = seq(0, 1, length.out = 100),  : 
  'type' must be 1 or 3 for ordered factors
In addition: Warning message:
In if (class(feature.x) == "factor") { :
  the condition has length > 1 and only the first element will be used

I've provided a simple reproducible problem below:

library(iml)
library(data.table)

rm(list = ls())

data(mtcars)

## make them factors
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear, ordered = TRUE)
mtcars$vs <- factor(mtcars$vs)

frm <- formula('vs ~ mpg + cyl + disp + gear + wt + carb')
rfFit <- randomForest::randomForest(frm, data = mtcars)

modelPredictorsDf = mtcars[, c("mpg", "cyl", "disp", "gear", "wt", "carb")]
predictor = Predictor$new(rfFit, data = modelPredictorsDf, y = mtcars$vs)
aleTest <- FeatureEffect$new(predictor, feature = "cyl")

A simple suggested fix might be to use if(any(class(feature.x) == "factor")) but will leave that to your more than capable hands.

Thanks.

Interaction results differ by an order of magnitude to interact.gbm results

I noticed that when I use the iml Interaction method I get results that are about 10 time smaller than the results from the interact.gbm package. Here is an example:

library(iml)
library(gbm)

# Compare interact.gbm with iml Interaction
mod <- gbm(Species ~ ., data = iris, interaction.depth = 4)
gbm.perf(mod)
int_gbm <- interact.gbm(mod, iris, i.var = c("Petal.Width", "Petal.Length"))
dotplot(int_gbm, )
# versicolor interaction strength = 0.221

iml_mod <- Predictor$new(mod, iris, y = "Species", 
                         predict.fun = function(object, newdata){
                           predict(object, newdata, n.trees = 100)
                           })
iml_int <- Interaction$new(iml_mod, feature = "Petal.Width")
plot(iml_int)

# versicolor interaction strength = 0.029

Based on the documentation they are both trying to calculate Friedman's H-statistic so I am surprised that they are so different.

Implement ALE plots

Implement Accumulated local expectation plots from here: https://cran.r-project.org/web/packages/ALEPlot/vignettes/AccumulatedLocalEffectPlot.pdf

https://arxiv.org/pdf/1612.08468.pdf

Integrate it into the Partial class by adding an additional option aggregation="ale".
It might make sense to make ale the default.

feature imp really needs replications

and an update of the doc page, and a sensible default

Allow setting of 'effect' vs 'weights' in LocalModel plotting method

Also make sure both are included in the results data.frame

Create README.md with examples

Move current README.md to inst/ideas.md

Take examples from inst/framework.R and put into README.md

[JOSS] package outdated in Binder

In the script install.R, the package iml is installed from CRAN using install.packages(), which installs v0.5.1 of the package; however, the Jupyter notebook available via Binder includes an example with Interaction, which is not available in v0.5.1 of the package. This might prove confusing to new users, and may be resolved in one of two ways: (1) updating the CRAN version of iml to v0.5.2, or (2) installing the iml package from GitHub in install.R by including the following:

install.packages("remotes")
remotes::install_github("christophM/iml")

This is an outstanding item with respect to the ongoing JOSS review with respect to the items on "Functionality" and "Example Usage". Please feel free to let me know if there are any questions about this.

Add citation of JOSS paper for citation()

Add documentation for running predictions in parallel

Would it be possible to have this package use a parallel backend when it is available? This would be really helpful for when calculating interactions since there are many independent calls to the predict method. The package pdp does this for partial dependence plots using the foreach package and it is very helpful.
I have never worked with R6 objects before so I have't been able to understand the code enough to know how this could be implemented but I expect it would be pretty straight forward to replace an lapply or for loop with the parallel equivalent.

Confusing difference in Shapley value

See christophM/interpretable-ml-book#56

Fix TravisCI

Problem caused by failing libmagick installations which is needed for lime package.
Either remove lime or make Travis install libmagick

Calculating variable importance needs a lot of memory

Currently trying with a dataset consisting of 1400 obs and 7k features.

Using the mlr procedure based on permutation works fine (one core and it takes a while).

var_imp = generateFeatureImportanceData(task, learner = lrn_xgboost, 
                                        measure = rmse, nmc = 2)

However, the iml approach quickly eats tons of RAM (> 200 GB).

Reprex:

library(mlr)
library(iml)
m = readRDS(url("http://pjs-web.de/files/xgboost.rda")) # 92 KB
task = readRDS(url("http://pjs-web.de/files/task.rda")) # 93 MB
X = task$env$data[which(names(task$env$data) != "defoliation")]
predictor = Predictor$new(m, data = X, y = task$env$data$defoliation)
imp = FeatureImp$new(predictor, loss = "rmse")

[JOSS] package licensing and contributions

As part of the ongoing JOSS review, this package is required to include a plain-text version of an approved open source license as well as guidelines for contributors.

License:
Due to CRAN's policies regarding the MIT license, the file LICENSE cannot actually include the text of the MIT license; however, it is still possible to include the text of this license in the package. For example, one might include the text somewhere in the file README.Rmd as is done, for example, here. This would then allow users to refer to the terms under which the iml package is distributed without actually looking up the text of the MIT license.

Contributions:
The package needs to include clear guidelines for interested contributors. This can take many forms --- a common solution is to include a new file CONTRIBUTING.md as is done, for example, here with an accompanying note in the README.Rmd. Of course, many other solutions are acceptable as well.

These are outstanding items with respect to the JOSS review, but can be resolved very quickly. Please let me know if there are any questions about this.

Add to feature importance description: use on test data

iml methods not working with xgboost

Minimalbsp:

library(mlr)
library(iml)

# files on slack!!
load("titanic.rda") 
source("mylearnerxgboost.R")

titanic$Survived = as.numeric(titanic$Survived)

task = makeRegrTask(data = titanic, target = "Survived")
lrn = makeLearner("regr.xgboost.mod")
lrn = makeLearner("regr.randomForest")
mod = train(lrn, task)

x = titanic[which(names(titanic) != "Survived")]
p = Predictor$new(mod, x, y = titanic$Survived)

pdp = FeatureEffect$new(p, "Sex", method = "pdp")
plot(pdp)

Switch from Metrics package to mlr package for performance measures.

TreeSurrogate throws error when using a Predictor object with a y value

require("randomForest"))
# Fit a Random Forest on the Boston housing data set
data("Boston", package  = "MASS")
rf = randomForest(medv ~ ., data = Boston, ntree = 50)
# Create a model object
 mod = Predictor$new(rf, data = Boston, y = "medv") 

# Fit a decision tree as a surrogate for the whole random forest
dt = TreeSurrogate$new(mod)
# Error in eval(predvars, data, env) : object 'crim' not found

But this works:

require("randomForest"))
# Fit a Random Forest on the Boston housing data set
data("Boston", package  = "MASS")
rf = randomForest(medv ~ ., data = Boston, ntree = 50)
# Create a model object
 mod = Predictor$new(rf, data = Boston) 

# Fit a decision tree as a surrogate for the whole random forest
dt = TreeSurrogate$new(mod)

Implement Marginal Plots (M-Plot)

M-Plots average over the conditional distribution.
Integrate into Partial, with aggregation='mplot'

lime package vs iml

Hi Christoph

I wonder why results obtained using lime package and iml are different.

lime.explain = LocalModel$new(predictor, x.interest = X_test[1,], k = 5)
lime.explain$results
plot(lime.explain)

vs.

explainer <- lime(X_train, model)
explanation <- lime::explain(X_test[1,], explainer, n_features = 5)
plot_features(explanation)

Thanks

Unexpected results of function `FeatureImp()` for classification

The code is based on the examples in the documentation of iml::FeatureImp() function. In the example, class = "virginica" was used. When I changed to class = "versicolor", the results seemed correct. Unfortunately, the results of class = "setosa" were not as I expected and do not seem reasonable:

See the plot: some lines are missing, x-axis limits are incorrect.
In the data frame of results, original.error should be 0.6666666... (i.e., 50 cases of this class out of 150 in total), not 0.
This leads to NaN and Inf values of importance.

Is this behavior expected or is it a bug?

library(rpart)
library(iml)

# FeatureImp also works with multiclass classification. 
# In this case, the importance measurement regards all classes
tree = rpart(Species ~ ., data = iris)
X = iris[-which(names(iris) == "Species")]
y = iris$Species
predict.fun = function(object, newdata) predict(object, newdata, type = "prob")

# For multiclass classification models, you can choose to only compute performance for one class. 
# Make sure to adapt y
mod = Predictor$new(tree, data = X, y = y == "setosa", 
  predict.fun = predict.fun, class = "setosa") 
imp = FeatureImp$new(mod, loss = "ce")
plot(imp)
#> Warning: Removed 3 rows containing missing values (geom_point).
#> Warning: Removed 3 rows containing missing values (geom_segment).

imp
#> Interpretation method:  FeatureImp 
#> error function: ce
#> 
#> Analysed predictor: 
#> Prediction task: unknown 
#> 
#> 
#> Analysed data:
#> Sampling from data.frame with 150 rows and 4 columns.
#> 
#> Head of results:
#>        feature original.error permutation.error importance
#> 1 Petal.Length              0         0.4266667        Inf
#> 2 Sepal.Length              0         0.0000000        NaN
#> 3  Sepal.Width              0         0.0000000        NaN
#> 4  Petal.Width              0         0.0000000        NaN

Created on 2018-07-10 by the reprex package (v0.2.0).

Add test coverage computation (including badge to readme)

Add option to LocalModel to not binarize features

Show distribution of FeatureImp

It would be great that FeatureImp showed not a single point by feature, but also showed the distribution of feature importance after permutations! A boxplot would be much more informative than the mean variable importance

Documentation for custom predict function

Hi Christoph,

The line in the docs for Predictor, for a custom predict function, is:

predict.fun: (function) The function to predict newdata. Only needed if model is not a model from mlr or caret package.

I think it should probably say that the function needs to have the signature predict(model, newdata). I couldn't find this in the docs and had to work this out from a tutorial online.

Similarly, I think it would be helpful if the options for class and type were listed, and even better if they were elaborated upon.

Thanks for creating such a great package!

make desc file better

you have not linked to the issue tracker or the github page in the respective DESC fields

Create a probe function in Predicor, so that multi.class TRUE/FALSE is automatically decided when creating the predictor by calling the model prediction for a single instance

Add option to LocalModel and TreeSurrogate to binarize numerical features

Through binning

Add possibility to Partial to plot the original points

Add to the plot:

geom_point(aes(x = x.original, y = prediction))

Doesn't work for ALEPlots or would need adaptation

LocalModel: kernel not explained in docs

the docs make it sound like only (gower)distances are calculated, you even say that there is no kernel.width then to tune.
nevertheless kernel weighting seems to still be implemented. this needs to be documented under "details"

LocalModel: dist.fun seems inflexible

why use a string here (which is delegated to stats::dist) and not let the user input an R function?

Friedman's H-Statistic

I created a random forest model using the randomForest package. The training set size I used to develop the random forest model is 500000. I would like to observe the two-way interaction using the Friedman's H statistic. Since my training set is large I took a subset of the training set to calculate Friedman's H statistic.
My code is as follow

predictor = Predictor$new(rf, data=x_subset, y=y_subset)
interact = Interaction$new(predictor, feature = "x1")

In the above code x_subset and y_subset are created based on a subset of training data.
But, to my surprise for some interactions I received values greater than 1. I thought Friedman test value should be in between 0 and 1.

Implement feature interaction

Implement feature interaction measures from the paper "Predictive Learning via Rule Ensembles".

Implement test statistics for:

Interaction between chosen feature and rest of features
Interaction between two chosen features

Reuse Partial class by calling intervene and aggregation functions.

Feature imp with xgboost: "Feature names stored in object and newdata are different!"

See https://stackoverflow.com/questions/51980808/r-plotting-importance-feature-using-featureimpnew/52056960#52056960 for a reprex.

Probably caused by internal disordering of the colnames due to shuffling.

Its a known issue that xgboost has problems with this: dmlc/xgboost#1809

when I try method='ranger' in caret::train, FeatureImp occurs error

In caret::train, when method='ranger', FeatureImp occurs error message as belows:

library(tidyverse)
library(recipes)
library(rsample)
library(caret)
library(iml)
# data preparation ----
crd<- credit_data %>% rename_all(tolower)
spl<- initial_split(crd, prop=1/4)
tr<-training(spl)
tr<-recipe(status~.,data=tr) %>%
    step_meanimpute(all_numeric()) %>%
    step_modeimpute(all_nominal()) %>%
    step_center(all_numeric()) %>%
    step_scale(all_numeric()) %>%
    step_dummy(all_nominal(),-status) %>%
    prep(crd,retain=T) %>%
    juice()
# training----
rf1 <- train(status~., data=tr,method='rf',ntree=100,
                trControl=trainControl(method='cv',number=3))
rf2 <- train(status~., data=tr,method='ranger',num.trees=100,
                importance = 'impurity',
                trControl=trainControl(method='cv',number=3))
# Predictor ----
mod1 = Predictor$new(rf1, data=select(tr,-status), y=tr$status)
mod2 = Predictor$new(rf2, data=select(tr,-status), y=tr$status)
# FeatureImp----
imp1 = FeatureImp$new(mod1, loss='ce') # ok
imp2 = FeatureImp$new(mod2, loss='ce') # error occurs
# Error message in imp2----
Error in `[.data.frame`(out,  , obsLevels, drop = FALSE) : 
  undefined columns selected

install.packages('data.table', type = 'source', repos = 'http://Rdatatable.github.io/data.table')
pkg = 'iml'
install.packages('iml', dependencies = TRUE)
library(iml)
example('Partial')
# ...

Error in [.data.table(results.ice, , list(..y.hat = mean(..y.hat)), : ..y.hat in j is looking for y.hat in calling scope, but a column '..y.hat' exists. Column names should not start with ..

See #18 in the NEWS