GithubHelp home page GithubHelp logo

tidymodels / applicable Goto Github PK

View Code? Open in Web Editor NEW
46.0 46.0 7.0 4.56 MB

Quantify extrapolation of new samples given a training set

Home Page: https://applicable.tidymodels.org/

License: Other

R 100.00%

applicable's Introduction

tidymodels

R-CMD-check Codecov test coverage CRAN_Status_Badge Downloads lifecycle

Overview

tidymodels is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.

It includes a core set of packages that are loaded on startup:

  • broom takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.

  • dials has tools to create and manage values of tuning parameters.

  • dplyr contains a grammar for data manipulation.

  • ggplot2 implements a grammar of graphics.

  • infer is a modern approach to statistical inference.

  • parsnip is a tidy, unified interface to creating models.

  • purrr is a functional programming toolkit.

  • recipes is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.

  • rsample has infrastructure for resampling data so that models can be assessed and empirically validated.

  • tibble has a modern re-imagining of the data frame.

  • tune contains the functions to optimize model hyper-parameters.

  • workflows has methods to combine pre-processing steps and models into a single object.

  • yardstick contains tools for evaluating models (e.g. accuracy, RMSE, etc.).

A list of all tidymodels functions across different CRAN packages can be found at https://www.tidymodels.org/find/.

You can install the released version of tidymodels from CRAN with:

install.packages("tidymodels")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/tidymodels")

When loading the package, the versions and conflicts are listed:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
#> ✔ broom        1.0.5      ✔ recipes      1.0.10
#> ✔ dials        1.2.1      ✔ rsample      1.2.0 
#> ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
#> ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
#> ✔ infer        1.0.6      ✔ tune         1.2.0 
#> ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
#> ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
#> ✔ purrr        1.0.2      ✔ yardstick    1.3.1
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Learn how to get started at https://www.tidymodels.org/start/

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

applicable's People

Contributors

emilhvitfeldt avatar hfrick avatar juliasilge avatar marlycormar avatar topepo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

applicable's Issues

some initial notes

Based on our previous conversation

  • I think that, for unsupervised models, we can use y = NA in these calls to mold().

  • For this line, you won't need to pass in outcome. The line outcome <- processed$outcomes[[1]] won't be needed either.

  • fit-implementation.R would include the call to prcomp() and that object would be returned here (instead of the coefs thing that gets automatically populated)

  • Since we will have multiple ad_* functions, you may want to combine the fit-* files into a single file for PCA (same for the predict-* files too).

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

  • Help us firm up the list of targetted repositories
  • Make sure all maintainers are aware of what's coming
  • Give us an issue to close when the job is done
  • Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

Additions to the `score_ad_pca_numeric`

# add the distance column
# notes:
# te <- score(mod, test)
# diffs <- sweep(as.matrix(te), 2, means)^2
# sq_diff <- diffs^2
# dists <- apply(sq_diff, 1, function(x) sqrt(sum(x)))

isolation forests

We could add am ad_iso_forest() method that would use an isolation forest to find anomalies.

The isotree package has a lot of features but requires an additional serialization step to save the model. The isolation package might be the best approach.

Pinging: @kevin-m-kent

Hotelling T^2 for Outlier Detection

Feature - Hotelling T2 for Outlier Detection

In chemometric models that use PCA/PLS or similar methods, we often use T2 for outlier detection. This could be a nice complement the score.pca method that is already implemented.

Here is an example taken from Chapter 6 of Process Improvement using Data by Kevin Dunn.

image

Is this too niche? Worthwhile to implement? I'm also interested in the isolation methods, such as implementing isolation forests from #25. I'm not as familiar with those methods, so I'd need to learn more about them first.

"pctl" output should either be reported [0, 1] or [0, 100], but not both.

Hello,

First, I love the package and the idea.

In playing around with apd_pca() and score(), I've stumbled across a potentially confusing styling for the format of the output of score().

When calculating the percentile of the PCA distance, once that percentile reaches ~100, the output is converted to "1". Is this the intended behavior?

Screenshot 2023-10-07 at 3 34 17 PM

leverage computations

After our conversation with @bwlewis, let's use the QR decomposition.

Here's some example code:

options(width = 100)
# Use the QR decomposition to get (X'X)^{-1}. Fail if it doesn't work. 
get_inv <- function(X) {
  if (!is.matrix(X)) {
    X <- as.matrix(X)
  }
  XpX <- t(X) %*% X
  XpX_inv<- try(qr.solve(XpX), silent = TRUE)
  if (inherits(XpX_inv, "try-error")) {
    stop(as.character(XpX_inv), call. = FALSE)
  }
  dimnames(XpX_inv) <- NULL
  XpX_inv
}



X1 <- mtcars[, -1]
round(get_inv(X1), 3)
#>         [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
#>  [1,]  0.085 -0.001 -0.001 -0.002  0.049 -0.027  0.120  0.032  0.018 -0.018
#>  [2,] -0.001  0.000  0.000 -0.001 -0.004  0.001  0.001  0.000  0.000  0.001
#>  [3,] -0.001  0.000  0.000  0.000  0.001  0.000 -0.002  0.000 -0.001 -0.001
#>  [4,] -0.002 -0.001  0.000  0.313  0.091 -0.049  0.005 -0.122 -0.086 -0.030
#>  [5,]  0.049 -0.004  0.001  0.091  0.507 -0.086  0.043  0.064  0.088 -0.158
#>  [6,] -0.027  0.001  0.000 -0.049 -0.086  0.031 -0.064  0.021 -0.036  0.031
#>  [7,]  0.120  0.001 -0.002  0.005  0.043 -0.064  0.625  0.142 -0.003  0.020
#>  [8,]  0.032  0.000  0.000 -0.122  0.064  0.021  0.142  0.570 -0.178  0.022
#>  [9,]  0.018  0.000 -0.001 -0.086  0.088 -0.036 -0.003 -0.178  0.265 -0.066
#> [10,] -0.018  0.001 -0.001 -0.030 -0.158  0.031  0.020  0.022 -0.066  0.096

bad <- cbind(int = rep(1, 150), model.matrix(~ .  + 0, data = iris))
get_inv(bad)
#> Error: Error in qr.solve(XpX) : singular matrix 'a' in solve

# A new sample: 
unk <- as.matrix(mtcars[3, -1, drop = FALSE ])

# leverage value
unk %*% get_inv(X1) %*% t(unk)
#>            Datsun 710
#> Datsun 710  0.2191155

# compare to base R:
lm_fit <- lm(mpg ~ . -  1, data = mtcars)
hatvalues(lm_fit)[3]
#> Datsun 710 
#>  0.2191155

Created on 2019-07-10 by the reprex package (v0.2.1)

We might want to have an option for including the intercept or not. I'm on the fence about it.

Breaking changes in dependency `isotree`

I am the maintainer of package isotree which is a dependency of {applicable} in CRAN under 'Suggests':
https://cran.r-project.org/web/packages/applicable/index.html

I would like to push an update to {isotree} which would break one of the unit tests of {applicable}.

In particular, I would like to change the default argument ndim to function isolation.forest:
https://github.com/david-cortes/isotree/blob/ad49b9717b41ce9bab86f2aeebe742679f0fca58/R/isoforest.R#L996
From the current (CRAN) default of min(3, NCOL(data)) to 1.

This would generate a problem in this unit test for {applicable}:
https://github.com/cran/applicable/blob/b66153447194c71778f7c04bc258722cc5cc5257/tests/testthat/test-isolation-fit.R#L26

To get the old behavior, one would now need to pass ndim=2 in this test:

res_rec <- apd_isolation(rec, cells_tr, ntrees = 10, nthreads = 1, ndim = 2),

Would be ideal if an updated version with this change could be submitted to CRAN.

Leaving it as an issue instead of PR as it looks like the code here is out of synch with the CRAN release and doesn't have the problematic file commited here.

Output an error message when column names don't match

Output a descriptive error message when the selector columns do not exist in the dataset. For example,
in the code below apd_pca(predictors) doesn't contain columns matching "PC00[1-3]".

library(applicable)

predictors <- mtcars[, -1]
mod <- apd_pca(predictors)
autoplot(mod, matches("PC00[1-3]"))

As a result, it throws the following error.

Error: At least one layer must contain all faceting variables: component.

  • Plot is missing component
  • Layer 1 is missing component

Testing fitting & scoring functions

library(hardhat)
library(dplyr)

# ---------------------------------------------------------
# Testing model constructor
# ---------------------------------------------------------

# Run constructor.R
manual_model <- new_ad_pca("my_coef", default_xy_blueprint())
manual_model
names(manual_model)

manual_model$blueprint

# ---------------------------------------------------------
# Testing model fitting implementation
# ---------------------------------------------------------

# Run pca-fit.R
ad_pca_impl(iris %>% select(Sepal.Width))

# ---------------------------------------------------------
# Simulating user input and pass it to the fit bridge
# ---------------------------------------------------------

# Simulating formula interface
processed_1 <- mold(~., iris)
ad_pca_bridge(processed_1)

# Simulating x interface
iris_sub <- iris %>% select(-Species)
processed_2 <- mold(iris_sub, NA_real_)
ad_pca_bridge(processed_2)

# Simulating multiple outcomes. Error expected.
multi_outcome <- mold(Sepal.Width + Petal.Width ~ Sepal.Length + Species, iris)
ad_pca_bridge(multi_outcome)

# ---------------------------------------------------------
# Testing user facing fitting function
# ---------------------------------------------------------

# Using recipes
library(recipes)

predictors <- iris[c("Sepal.Width", "Petal.Width")]

# Data frame predictor
predictor <- iris['Sepal.Length']
ad_pca(predictor)

# Vector predictor.
# We should get the following error:
# "Error: `ad_pca()` is not defined for a 'numeric'."
predictor <- iris$Sepal.Length
ad_pca(predictor)

# Formula interface
ad_pca(~., iris)

# Using recipes. Fails "Error: No variables or terms were selected.".
library(recipes)
rec <- recipe(~., iris) %>%
  step_log(Sepal.Width) %>%
  step_dummy(Species, one_hot = TRUE)
ad_pca(rec, iris)


# ---------------------------------------------------------
# Testing model scoring implementation
# ---------------------------------------------------------

# Run pca-score.R
model <- ad_pca(Sepal.Width ~ Sepal.Length + Species, iris)
predictors <- forge(iris, model$blueprint)$predictors
predictors <- as.matrix(predictors)
score_ad_pca_numeric(model, predictors)


# ---------------------------------------------------------
# Testing score bridge function
# ---------------------------------------------------------

model <- ad_pca(~., iris)
predictors <- forge(iris, model$blueprint)$predictors
score_ad_pca_bridge("numeric", model, predictors)


# ---------------------------------------------------------
# Testing score interface function
# ---------------------------------------------------------

# Run 0.R
model <- ad_pca(~., iris)
score(model, iris)

# We should get an error:
# "Error: The class of `new_data`, 'factor', is not recognized."
# since `iris$Species` is not a data.frame
score(model, iris$Species)

# We should get an error:
# "Error: The following required columns are missing: 'Sepal.Length'."
# since `Sepal.Length` column is missing.
score(model, subset(iris, select = -Sepal.Length))

# The column `Species` is silently converted to a factor.
iris_character_col <- transform(iris, Species = as.character(Species))
score(model, iris_character_cols)

# We should get an error:
# "Error: Can't cast `x$Species` <double> to `to$Species` <factor<12d60>>."
# since `Species` can't be forced to be a factor
iris_double_col <- transform(iris, Species = 1)
score(model, iris_double_col)

Upkeep for applicable (2023)

Pre-history

  • usethis::use_readme_rmd()
  • usethis::use_roxygen_md()
  • usethis::use_github_links()
  • usethis::use_pkgdown_github_pages()
  • usethis::use_tidy_github_labels()
  • usethis::use_tidy_style()
  • usethis::use_tidy_description()
  • urlchecker::url_check()

2020

  • usethis::use_package_doc()
    Consider letting usethis manage your @importFrom directives here.
    usethis::use_import_from() is handy for this.
  • usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
  • Align the names of R/ files and test/ files for workflow happiness.
    The docs for usethis::use_r() include a helpful script.
    usethis::rename_files() may be be useful.

2021

  • usethis::use_tidy_dependencies()
  • usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
  • Remove check environments section from cran-comments.md
  • Bump required R version in DESCRIPTION to 3.6
  • Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions

2022

2023

Necessary:

  • Double check license file uses '[package] authors' as copyright holder. Run use_mit_license()
  • Update logo (https://github.com/rstudio/hex-stickers); run use_tidy_logo()
  • usethis::use_tidy_coc()
  • usethis::use_tidy_github_actions()

Optional:

  • Review 2022 checklist to see if you completed the pkgdown updates
  • Prefer pak::pak("org/pkg") over devtools::install_github("org/pkg") in README
  • Consider running use_tidy_dependencies() and/or replace compat files with use_standalone()
  • use_standalone("r-lib/rlang", "types-check") instead of home grown argument checkers
  • Add alt-text to pictures, plots, etc; see https://posit.co/blog/knitr-fig-alt/ for examples

Created on 2023-10-30 with usethis::use_tidy_upkeep_issue(), using usethis v2.2.2

Upkeep for applicable

Pre-history

  • usethis::use_readme_rmd()
  • usethis::use_roxygen_md()
  • usethis::use_github_links()
  • usethis::use_pkgdown_github_pages()
  • usethis::use_tidy_labels()
  • usethis::use_tidy_style()
  • usethis::use_tidy_description()
  • urlchecker::url_check()

2020

  • usethis::use_package_doc()
    Consider letting usethis manage your @importFrom directives here.
    usethis::use_import_from() is handy for this.
  • usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
  • Align the names of R/ files and test/ files for workflow happiness.
    usethis::rename_files() can be helpful.

2021

  • usethis::use_tidy_dependencies()
  • usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
  • Remove check environments section from cran-comments.md
  • Bump required R version in DESCRIPTION to 3.4
  • Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions
  • Add RStudio to DESCRIPTION as funder, if appropriate

Interpretation of PCA scores

First of all, thank you for this package. I find it very useful.

This question is not directly related to the package but the output interpretation.

My goal is to provide to the user the predicted class along with the applicability of the observation. Since I have continuous variables, I decided to use the PCA score and take the distance_pctl column. The percentile interpretation is simple: 95 (only 5% of observations were more different than your query). Still, it raises the next question: what should it be the threshold to define those queries that are acceptable with respect to those that are very different and, therefore, the prediction should be rejected?

I know this is a tricky question, but I think a categorization of the applicability value should be useful to improve the results' interpretation. I thought to set a 95 (for instance) threshold, but I'm wondering whether there are more elegant approaches.

Thank you!!

Upkeep for applicable

2023

Necessary:

  • Update copyright holder in DESCRIPTION: person(given = "Posit Software, PBC", role = c("cph", "fnd"))
  • Double check license file uses '[package] authors' as copyright holder. Run use_mit_license()
  • Update email addresses *@rstudio.com -> *@posit.co
  • Update logo (https://github.com/rstudio/hex-stickers); run use_tidy_logo()
  • usethis::use_tidy_coc()
  • usethis::use_tidy_github_actions()

Optional:

  • Review 2022 checklist to see if you completed the pkgdown updates
  • Prefer pak::pak("org/pkg") over devtools::install_github("org/pkg") in README
  • Consider running use_tidy_dependencies() and/or replace compat files with use_standalone()
  • use_standalone("r-lib/rlang", "types-check") instead of home grown argument checkers
  • Add alt-text to pictures, plots, etc; see https://posit.co/blog/knitr-fig-alt/ for examples

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.