tidymodels / applicable Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 7.0 4.56 MB

Quantify extrapolation of new samples given a training set

Home Page: https://applicable.tidymodels.org/

License: Other

R 100.00%

applicable's Introduction

tidymodels

Overview

tidymodels is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.

It includes a core set of packages that are loaded on startup:

broom takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
dials has tools to create and manage values of tuning parameters.
dplyr contains a grammar for data manipulation.
ggplot2 implements a grammar of graphics.
infer is a modern approach to statistical inference.
parsnip is a tidy, unified interface to creating models.
purrr is a functional programming toolkit.
recipes is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.
rsample has infrastructure for resampling data so that models can be assessed and empirically validated.
tibble has a modern re-imagining of the data frame.
tune contains the functions to optimize model hyper-parameters.
workflows has methods to combine pre-processing steps and models into a single object.
yardstick contains tools for evaluating models (e.g. accuracy, RMSE, etc.).

A list of all tidymodels functions across different CRAN packages can be found at https://www.tidymodels.org/find/.

You can install the released version of tidymodels from CRAN with:

install.packages("tidymodels")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/tidymodels")

When loading the package, the versions and conflicts are listed:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
#> ✔ broom        1.0.5      ✔ recipes      1.0.10
#> ✔ dials        1.2.1      ✔ rsample      1.2.0 
#> ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
#> ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
#> ✔ infer        1.0.6      ✔ tune         1.2.0 
#> ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
#> ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
#> ✔ purrr        1.0.2      ✔ yardstick    1.3.1
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Learn how to get started at https://www.tidymodels.org/start/

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
Most issues will likely belong on the GitHub repo of an individual package. If you think you have encountered a bug with the tidymodels metapackage itself, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.

applicable's People

Contributors

Stargazers

Watchers

Forkers

marlycormar thecodemasterk mpadge shaoyoucheng rnaimehaom jameshwade

applicable's Issues

some initial notes

Based on our previous conversation

I think that, for unsupervised models, we can use y = NA in these calls to mold().
For this line, you won't need to pass in outcome. The line outcome <- processed$outcomes[[1]] won't be needed either.
fit-implementation.R would include the call to prcomp() and that object would be returned here (instead of the coefs thing that gets automatically populated)
Since we will have multiple ad_* functions, you may want to combine the fit-* files into a single file for PCA (same for the predict-* files too).

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

Help us firm up the list of targetted repositories
Make sure all maintainers are aware of what's coming
Give us an issue to close when the job is done
Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

Additions to the `score_ad_pca_numeric`

# add the distance column
# notes:
# te <- score(mod, test)
# diffs <- sweep(as.matrix(te), 2, means)^2
# sq_diff <- diffs^2
# dists <- apply(sq_diff, 1, function(x) sqrt(sum(x)))

isolation forests

We could add am ad_iso_forest() method that would use an isolation forest to find anomalies.

The isotree package has a lot of features but requires an additional serialization step to save the model. The isolation package might be the best approach.

Pinging: @kevin-m-kent

Hotelling T^2 for Outlier Detection

Feature - Hotelling T² for Outlier Detection

In chemometric models that use PCA/PLS or similar methods, we often use T² for outlier detection. This could be a nice complement the score.pca method that is already implemented.

Here is an example taken from Chapter 6 of Process Improvement using Data by Kevin Dunn.

Is this too niche? Worthwhile to implement? I'm also interested in the isolation methods, such as implementing isolation forests from #25. I'm not as familiar with those methods, so I'd need to learn more about them first.

"pctl" output should either be reported [0, 1] or [0, 100], but not both.

Hello,

First, I love the package and the idea.

In playing around with apd_pca() and score(), I've stumbled across a potentially confusing styling for the format of the output of score().

When calculating the percentile of the PCA distance, once that percentile reaches ~100, the output is converted to "1". Is this the intended behavior?

leverage computations

After our conversation with @bwlewis, let's use the QR decomposition.

Here's some example code:

options(width = 100)
# Use the QR decomposition to get (X'X)^{-1}. Fail if it doesn't work. 
get_inv <- function(X) {
  if (!is.matrix(X)) {
    X <- as.matrix(X)
  }
  XpX <- t(X) %*% X
  XpX_inv<- try(qr.solve(XpX), silent = TRUE)
  if (inherits(XpX_inv, "try-error")) {
    stop(as.character(XpX_inv), call. = FALSE)
  }
  dimnames(XpX_inv) <- NULL
  XpX_inv
}



X1 <- mtcars[, -1]
round(get_inv(X1), 3)
#>         [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
#>  [1,]  0.085 -0.001 -0.001 -0.002  0.049 -0.027  0.120  0.032  0.018 -0.018
#>  [2,] -0.001  0.000  0.000 -0.001 -0.004  0.001  0.001  0.000  0.000  0.001
#>  [3,] -0.001  0.000  0.000  0.000  0.001  0.000 -0.002  0.000 -0.001 -0.001
#>  [4,] -0.002 -0.001  0.000  0.313  0.091 -0.049  0.005 -0.122 -0.086 -0.030
#>  [5,]  0.049 -0.004  0.001  0.091  0.507 -0.086  0.043  0.064  0.088 -0.158
#>  [6,] -0.027  0.001  0.000 -0.049 -0.086  0.031 -0.064  0.021 -0.036  0.031
#>  [7,]  0.120  0.001 -0.002  0.005  0.043 -0.064  0.625  0.142 -0.003  0.020
#>  [8,]  0.032  0.000  0.000 -0.122  0.064  0.021  0.142  0.570 -0.178  0.022
#>  [9,]  0.018  0.000 -0.001 -0.086  0.088 -0.036 -0.003 -0.178  0.265 -0.066
#> [10,] -0.018  0.001 -0.001 -0.030 -0.158  0.031  0.020  0.022 -0.066  0.096

bad <- cbind(int = rep(1, 150), model.matrix(~ .  + 0, data = iris))
get_inv(bad)
#> Error: Error in qr.solve(XpX) : singular matrix 'a' in solve

# A new sample: 
unk <- as.matrix(mtcars[3, -1, drop = FALSE ])

# leverage value
unk %*% get_inv(X1) %*% t(unk)
#>            Datsun 710
#> Datsun 710  0.2191155

# compare to base R:
lm_fit <- lm(mpg ~ . -  1, data = mtcars)
hatvalues(lm_fit)[3]
#> Datsun 710 
#>  0.2191155

^{Created on 2019-07-10 by the reprex package (v0.2.1)}

We might want to have an option for including the intercept or not. I'm on the fence about it.

Breaking changes in dependency `isotree`

I am the maintainer of package isotree which is a dependency of {applicable} in CRAN under 'Suggests':
https://cran.r-project.org/web/packages/applicable/index.html

I would like to push an update to {isotree} which would break one of the unit tests of {applicable}.

In particular, I would like to change the default argument ndim to function isolation.forest:
https://github.com/david-cortes/isotree/blob/ad49b9717b41ce9bab86f2aeebe742679f0fca58/R/isoforest.R#L996
From the current (CRAN) default of min(3, NCOL(data)) to 1.

This would generate a problem in this unit test for {applicable}:
https://github.com/cran/applicable/blob/b66153447194c71778f7c04bc258722cc5cc5257/tests/testthat/test-isolation-fit.R#L26

To get the old behavior, one would now need to pass ndim=2 in this test:

res_rec <- apd_isolation(rec, cells_tr, ntrees = 10, nthreads = 1, ndim = 2),

Would be ideal if an updated version with this change could be submitted to CRAN.

Leaving it as an issue instead of PR as it looks like the code here is out of synch with the CRAN release and doesn't have the problematic file commited here.

move to testthat edition 3

applicable.tidymodels.org not working

The site https://applicable.tidymodels.org is not working. Maybe we are missing a configuration step?

Output an error message when column names don't match

Output a descriptive error message when the selector columns do not exist in the dataset. For example,
in the code below apd_pca(predictors) doesn't contain columns matching "PC00[1-3]".

library(applicable)

predictors <- mtcars[, -1]
mod <- apd_pca(predictors)
autoplot(mod, matches("PC00[1-3]"))

As a result, it throws the following error.

Error: At least one layer must contain all faceting variables: component.

Plot is missing component

Layer 1 is missing component

move to keeping the main branch as the dev branch

A lot of our GitHub infrastructure uses this configuration. For example, the pkgdown site won't update. I think that we discussed this over the summer.

Testing fitting & scoring functions

library(hardhat)
library(dplyr)

# ---------------------------------------------------------
# Testing model constructor
# ---------------------------------------------------------

# Run constructor.R
manual_model <- new_ad_pca("my_coef", default_xy_blueprint())
manual_model
names(manual_model)

manual_model$blueprint

# ---------------------------------------------------------
# Testing model fitting implementation
# ---------------------------------------------------------

# Run pca-fit.R
ad_pca_impl(iris %>% select(Sepal.Width))

# ---------------------------------------------------------
# Simulating user input and pass it to the fit bridge
# ---------------------------------------------------------

# Simulating formula interface
processed_1 <- mold(~., iris)
ad_pca_bridge(processed_1)

# Simulating x interface
iris_sub <- iris %>% select(-Species)
processed_2 <- mold(iris_sub, NA_real_)
ad_pca_bridge(processed_2)

# Simulating multiple outcomes. Error expected.
multi_outcome <- mold(Sepal.Width + Petal.Width ~ Sepal.Length + Species, iris)
ad_pca_bridge(multi_outcome)

# ---------------------------------------------------------
# Testing user facing fitting function
# ---------------------------------------------------------

# Using recipes
library(recipes)

predictors <- iris[c("Sepal.Width", "Petal.Width")]

# Data frame predictor
predictor <- iris['Sepal.Length']
ad_pca(predictor)

# Vector predictor.
# We should get the following error:
# "Error: `ad_pca()` is not defined for a 'numeric'."
predictor <- iris$Sepal.Length
ad_pca(predictor)

# Formula interface
ad_pca(~., iris)

# Using recipes. Fails "Error: No variables or terms were selected.".
library(recipes)
rec <- recipe(~., iris) %>%
  step_log(Sepal.Width) %>%
  step_dummy(Species, one_hot = TRUE)
ad_pca(rec, iris)


# ---------------------------------------------------------
# Testing model scoring implementation
# ---------------------------------------------------------

# Run pca-score.R
model <- ad_pca(Sepal.Width ~ Sepal.Length + Species, iris)
predictors <- forge(iris, model$blueprint)$predictors
predictors <- as.matrix(predictors)
score_ad_pca_numeric(model, predictors)


# ---------------------------------------------------------
# Testing score bridge function
# ---------------------------------------------------------

model <- ad_pca(~., iris)
predictors <- forge(iris, model$blueprint)$predictors
score_ad_pca_bridge("numeric", model, predictors)


# ---------------------------------------------------------
# Testing score interface function
# ---------------------------------------------------------

# Run 0.R
model <- ad_pca(~., iris)
score(model, iris)

# We should get an error:
# "Error: The class of `new_data`, 'factor', is not recognized."
# since `iris$Species` is not a data.frame
score(model, iris$Species)

# We should get an error:
# "Error: The following required columns are missing: 'Sepal.Length'."
# since `Sepal.Length` column is missing.
score(model, subset(iris, select = -Sepal.Length))

# The column `Species` is silently converted to a factor.
iris_character_col <- transform(iris, Species = as.character(Species))
score(model, iris_character_cols)

# We should get an error:
# "Error: Can't cast `x$Species` <double> to `to$Species` <factor<12d60>>."
# since `Species` can't be forced to be a factor
iris_double_col <- transform(iris, Species = 1)
score(model, iris_double_col)

Upkeep for applicable (2023)

Pre-history

usethis::use_readme_rmd()
usethis::use_roxygen_md()
usethis::use_github_links()
usethis::use_pkgdown_github_pages()
usethis::use_tidy_github_labels()
usethis::use_tidy_style()
usethis::use_tidy_description()
urlchecker::url_check()

2020

usethis::use_package_doc()
Consider letting usethis manage your @importFrom directives here.
usethis::use_import_from() is handy for this.
usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
Align the names of R/ files and test/ files for workflow happiness.
The docs for usethis::use_r() include a helpful script.
usethis::rename_files() may be be useful.

2021

usethis::use_tidy_dependencies()
usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
Remove check environments section from cran-comments.md
Bump required R version in DESCRIPTION to 3.6
Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions

2022

usethis::use_tidy_coc()
Handle and close any still-open master --> main issues
Update README badges, instructions in r-lib/usethis#1594
Update errors to rlang 1.0.0. Helpful guides:
https://rlang.r-lib.org/reference/topic-error-call.html
https://rlang.r-lib.org/reference/topic-error-chaining.html
https://rlang.r-lib.org/reference/topic-condition-formatting.html
Update pkgdown site using instructions at https://tidytemplate.tidyverse.org
Ensure pkgdown development is mode: auto in pkgdown config
Re-publish released site; see How to update a released site
Update lifecycle badges with more accessible SVGs: usethis::use_lifecycle()

2023

Necessary:

Double check license file uses '[package] authors' as copyright holder. Run use_mit_license()
Update logo (https://github.com/rstudio/hex-stickers); run use_tidy_logo()
usethis::use_tidy_coc()
usethis::use_tidy_github_actions()

Optional:

Review 2022 checklist to see if you completed the pkgdown updates
Prefer pak::pak("org/pkg") over devtools::install_github("org/pkg") in README
Consider running use_tidy_dependencies() and/or replace compat files with use_standalone()
use_standalone("r-lib/rlang", "types-check") instead of home grown argument checkers
Add alt-text to pictures, plots, etc; see https://posit.co/blog/knitr-fig-alt/ for examples

^{Created on 2023-10-30 with usethis::use_tidy_upkeep_issue(), using usethis v2.2.2}

Upkeep for applicable

Pre-history

2020

usethis::use_package_doc()
Consider letting usethis manage your @importFrom directives here.
usethis::use_import_from() is handy for this.
usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
Align the names of R/ files and test/ files for workflow happiness.
usethis::rename_files() can be helpful.

2021

usethis::use_tidy_dependencies()
usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
Remove check environments section from cran-comments.md
Bump required R version in DESCRIPTION to 3.4
Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions
Add RStudio to DESCRIPTION as funder, if appropriate

Interpretation of PCA scores

First of all, thank you for this package. I find it very useful.

This question is not directly related to the package but the output interpretation.

My goal is to provide to the user the predicted class along with the applicability of the observation. Since I have continuous variables, I decided to use the PCA score and take the distance_pctl column. The percentile interpretation is simple: 95 (only 5% of observations were more different than your query). Still, it raises the next question: what should it be the threshold to define those queries that are acceptable with respect to those that are very different and, therefore, the prediction should be rejected?

I know this is a tricky question, but I think a categorization of the applicability value should be useful to improve the results' interpretation. I thought to set a 95 (for instance) threshold, but I'm wondering whether there are more elegant approaches.

Thank you!!

Unit test is failing

The unit test linked below is failing because the expected and actual error message differ. However, I dont see any difference in the messages. I tested the strings in R and they are equal.

https://github.com/marlycormar/applicable/blob/2ee8debba287ada4d67c92595e69bf8273dbc53e/tests/testthat/test-pca-fit.R#L48