GithubHelp home page GithubHelp logo

corybrunson / ordr Goto Github PK

View Code? Open in Web Editor NEW
18.0 3.0 4.0 86.78 MB

manage ordinations and render biplots in a tidyverse workflow

Home Page: https://corybrunson.github.io/ordr/

License: GNU General Public License v3.0

R 99.99% Rebol 0.01%
ordination biplot multivariate-analysis tidyverse geometric-data-analysis dimension-reduction multivariate-statistics tidymodels data-visualization grammar-of-graphics

ordr's Introduction

ordr

Lifecycle: experimental CRAN

ordr integrates ordination analysis and biplot visualization into tidyverse workflows.

motivation

Wherever there is an SVD, there is a biplot.1

ordination and biplots

Ordination is a catch-all term for a variety of statistical techniques that introduce an artificial coordinate system for a data set in such a way that a few coordinates capture a large amount of the data structure 2. The branch of mathematical statistics called geometric data analysis (GDA) provides the theoretical basis for (most of) these techniques. Ordination overlaps with regression and with dimension reduction, which can be contrasted to clustering and classification in that they assign continuous rather than categorical values to data elements 3.

Most ordination techniques decompose a numeric rectangular data set into the product of two matrices, often using singular value decomposition (SVD). The coordinates of the shared dimensions of these matrices (over which they are multiplied) are the artificial coordinates. In some cases, such as principal components analysis, the decomposition is exact; in others, such as non-negative matrix factorization, it is approximate. Some techniques, such as correspondence analysis, transform the data before decomposition. Ordination techniques may be supervised, like linear discriminant analysis, or unsupervised, like multidimensional scaling.

Analysis pipelines that use these techniques may use the artificial coordinates directly, in place of natural coordinates, to arrange and compare data elements or to predict responses. This is possible because both the rows and the columns of the original table can be located, or positioned, along these shared coordinates. The number of artificial coordinates used in an application, such as regression or visualization, is called the rank of the ordination 4. A common application is the biplot, which positions the rows and columns of the original table in a scatterplot in 1, 2, or 3 artificial coordinates, usually those that explain the most variation in the data.

implementations in R

An extensive range of ordination techniques are implemented in R, from classical multidimensional scaling (stats::cmdscale()) and principal components analysis (stats::prcomp() and stats::princomp()) in the stats package distributed with base R, across widely-used implementations of linear discriminant analysis (MASS::lda()) and correspondence analysis (ca::ca()) in general-use statistical packages, to highly specialized packages that implement cutting-edge techniques or adapt conventional techniques to challenging settings. These implementations come with their own conventions, tailored to the research communities that produced them, and it would be impractical (and probably unhelpful) to try to consolidate them.

Instead, ordr provides a streamlined process by which the models output by these methods—in particular, the matrix factors into which the original data are approximately decomposed and the artificial coordinates they share—can be inspected, annotated, tabulated, summarized, and visualized. On this last point, most biplot implementations in R provide limited customizability. ordr adopts the grammar of graphics paradigm from ggplot2 to modularize and standardize biplot elements 5. Overall, the package is designed to follow the broader syntactic conventions of the tidyverse, so that users familiar with a this workflow can more easily and quickly integrate ordination models into practice.

usage

installation

ordr is now on CRAN and can be installed using base R:

install.packages("ordr")

The development version can be installed from the (default) main branch using remotes:

remotes::install_github("corybrunson/ordr")

example

Morphologically, Iris versicolor is much closer to Iris virginica than to Iris setosa, though in every character by which it differs from Iris virginica it departs in the direction of Iris setosa.6

A very common illustration of ordination in R applies principal components analysis (PCA) to Anderson’s iris measurements. These data consist of lengths and widths of the petals and surrounding sepals from 50 each of three species of iris:

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#>        Species  
#>  setosa    :50  
#>  versicolor:50  
#>  virginica :50  
#>                 
#>                 
#> 

ordr provides a convenience function to send a subset of columns to an ordination function, wrap the resulting model in the tibble-derived ‘tbl_ord’ class, and append both model diagnostics and other original data columns as annotations to the appropriate matrix factors:7

(iris_pca <- ordinate(iris, cols = 1:4, model = ~ prcomp(., scale. = TRUE)))
#> # A tbl_ord of class 'prcomp': (150 x 4) x (4 x 4)'
#> # 4 coordinates: PC1, PC2, ..., PC4
#> # 
#> # Rows (principal): [ 150 x 4 | 1 ]
#>     PC1    PC2     PC3 ... |   Species
#>                            |   <fct>  
#> 1 -2.26 -0.478  0.127      | 1 setosa 
#> 2 -2.07  0.672  0.234  ... | 2 setosa 
#> 3 -2.36  0.341 -0.0441     | 3 setosa 
#> 4 -2.29  0.595 -0.0910     | 4 setosa 
#> 5 -2.38 -0.645 -0.0157     | 5 setosa 
#> # ℹ 145 more rows
#> # 
#> # Columns (standard): [ 4 x 4 | 3 ]
#>      PC1     PC2    PC3 ... |   name         center scale
#>                             |   <chr>         <dbl> <dbl>
#> 1  0.521 -0.377   0.720     | 1 Sepal.Length   5.84 0.828
#> 2 -0.269 -0.923  -0.244 ... | 2 Sepal.Width    3.06 0.436
#> 3  0.580 -0.0245 -0.142     | 3 Petal.Length   3.76 1.77 
#> 4  0.565 -0.0669 -0.634     | 4 Petal.Width    1.20 0.762

Additional annotations can be added using several row- and column-specific dplyr-style verbs:

iris_meta <- data.frame(
  Species = c("setosa", "versicolor", "virginica"),
  Colony = c(1L, 1L, 2L),
  Cytotype = c("diploid", "hexaploid", "tetraploid")
)
(iris_pca <- left_join_rows(iris_pca, iris_meta, by = "Species"))
#> # A tbl_ord of class 'prcomp': (150 x 4) x (4 x 4)'
#> # 4 coordinates: PC1, PC2, ..., PC4
#> # 
#> # Rows (principal): [ 150 x 4 | 3 ]
#>     PC1    PC2     PC3 ... |   Species Colony Cytotype
#>                            |   <chr>    <int> <chr>   
#> 1 -2.26 -0.478  0.127      | 1 setosa       1 diploid 
#> 2 -2.07  0.672  0.234  ... | 2 setosa       1 diploid 
#> 3 -2.36  0.341 -0.0441     | 3 setosa       1 diploid 
#> 4 -2.29  0.595 -0.0910     | 4 setosa       1 diploid 
#> 5 -2.38 -0.645 -0.0157     | 5 setosa       1 diploid 
#> # ℹ 145 more rows
#> # 
#> # Columns (standard): [ 4 x 4 | 3 ]
#>      PC1     PC2    PC3 ... |   name         center scale
#>                             |   <chr>         <dbl> <dbl>
#> 1  0.521 -0.377   0.720     | 1 Sepal.Length   5.84 0.828
#> 2 -0.269 -0.923  -0.244 ... | 2 Sepal.Width    3.06 0.436
#> 3  0.580 -0.0245 -0.142     | 3 Petal.Length   3.76 1.77 
#> 4  0.565 -0.0669 -0.634     | 4 Petal.Width    1.20 0.762

Following the broom package, the tidy() method produces a tibble describing the model components, in this case the principal coordinates, which is suitable for scree plotting:

tidy(iris_pca) %T>% print() %>%
  ggplot(aes(x = name, y = prop_var)) +
  geom_col() +
  labs(x = "", y = "Proportion of inertia") +
  ggtitle("PCA of Anderson's iris measurements",
          "Distribution of inertia")
#> # A tibble: 4 × 5
#>   name   sdev inertia prop_var quality
#>   <fct> <dbl>   <dbl>    <dbl>   <dbl>
#> 1 PC1   1.71   435.    0.730     0.730
#> 2 PC2   0.956  136.    0.229     0.958
#> 3 PC3   0.383   21.9   0.0367    0.995
#> 4 PC4   0.144    3.09  0.00518   1

Following ggplot2, the fortify() method row-binds the factor tibbles with an additional .matrix column. This is used by ggbiplot() to redirect row- and column-specific plot layers to the appropriate subsets:8

ggbiplot(iris_pca, sec.axes = "cols", scale.factor = 2) +
  geom_rows_point(aes(color = Species, shape = Species)) +
  stat_rows_ellipse(aes(color = Species), alpha = .5, level = .99) +
  geom_cols_vector() +
  geom_cols_text_radiate(aes(label = name)) +
  expand_limits(y = c(-3.5, NA)) +
  ggtitle("PCA of Anderson's iris measurements",
          "99% confidence ellipses; variables use top & right axes")

When variables are represented in standard coordinates, as typically in PCA, their rules can be rescaled to yield a predictive biplot:9

ggbiplot(iris_pca, axis.type = "predictive", axis.percents = FALSE) +
  theme_biplot() +
  geom_rows_point(aes(color = Species, shape = Species)) +
  stat_rows_center(
    aes(color = Species, shape = Species),
    size = 5, alpha = .5, fun.data = mean_se
  ) +
  geom_cols_axis(aes(label = name, center = center, scale = scale)) +
  ggtitle("Predictive biplot of Anderson's iris measurements",
          "Project a marker onto an axis to approximate its measurement")

aggregate(iris[, 1:4], by = iris[, "Species", drop = FALSE], FUN = mean)
#>      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     setosa        5.006       3.428        1.462       0.246
#> 2 versicolor        5.936       2.770        4.260       1.326
#> 3  virginica        6.588       2.974        5.552       2.026

more methods

The auxiliary package ordr.extra provides recovery methods for several additional ordination models—and has room for several more!

acknowledgments

contribute

Any feedback on the package is very welcome! If you encounter confusion or errors, do create an issue, with a minimal reproducible example if feasible. If you have requests, suggestions, or your own implementations for new features, feel free to create an issue or submit a pull request. Methods for additional ordination classes (see the methods-*.r scripts in the R folder) are especially welcome, as are new plot layers. Please try to follow the contributing guidelines and respect the Code of Conduct.

inspirations

This package was originally inspired by the ggbiplot extension developed by Vincent Q. Vu, Richard J Telford, and Vilmantas Gegzna, among others. It probably first brought biplots into the tidyverse framework. The motivation to unify a variety of ordination methods came from several books and articles by Michael Greenacre, in particular Biplots in Practice. Several answers at CrossValidated, in particular by amoeba and ttnphns, provided theoretical insights and informed design choices. Thomas Lin Pedersen’s tidygraph prequel to ggraph finally induced the shift from the downstream generation of scatterplots to the upstream handling and manipulating of models. Additional design elements and features have been informed by the monograph Biplots and the textbook Understanding Biplots by John C. Gower, David J. Hand, Sugnet Gardner–Lubbe, and Niel J. Le Roux, and by the volume Principal Components Analysis by I. T. Jolliffe.

exposition

This work was presented (slideshow PDF) at an invited panel on New Developments in Graphing Multivariate Data at the Joint Statistical Meetings, on 2022 August 8 in Washington DC. I’m grateful to Joyce Robbins for the invitation and for organizing such a fun first experience, to Naomi Robbins for chairing the event, and to my co-panelists Ursula Laa and Hengrui Luo for sharing and sparking such exciting ideas and conversations.

resources

Development of this package benefitted from the use of equipment and the support of colleagues at UConn Health and at UF Health.

Footnotes

  1. Greenacre MJ (2010) Biplots in Practice. Fundacion BBVA, ISBN: 978-84-923846. https://www.fbbva.es/microsite/multivariate-statistics/biplots.html

  2. The term ordination is most prevalent among ecologists; no catch-all term seems to be in common use outside ecology.

  3. This is not a hard rule: PCA is often used to compress data before clustering, and LDA uses dimension reduction to perform classification tasks.

  4. Regression and clustering models, like classical linear regression and k-means, can also be understood as matrix decomposition approximations and even visualized in biplots. Their shared coordinates, which are pre-defined rather than artificial, are the predictor coefficients and the cluster assignments, respectively. Methods for stats::lm() and stats::kmeans(), for example, are implemented for the sake of novelty and instruction, but are not widely used in practice.

  5. Biplot elments must be chosen with care, and it is useful and appropriate that many model-specific biplot methods have limited flexibility. This package adopts the trade-off articulated in Wilkinson’s The Grammar of Graphics (p. 15): “This system is capable of producing some hideous graphics. There is nothing in its design to prevent its misuse. … This system cannot produce a meaningless graphic, however.”

  6. Anderson E (1936) “The Species Problem in Iris”. Annals of the Missouri Botanical Garden 23(3), 457-469+471-483+485-501+503-509. https://doi.org/10.2307/2394164

  7. The data must be in the form of a data frame that can be understood by the modeling function. Step-by-step methods also exist to build and annotate a ‘tbl_ord’ from a fitted ordination model.

  8. The radiating text geom, like several other features, is adapted from the ggbiplot package.

  9. This is an experimental feature only available for linear methods, namely eigendecomposition, singular value decomposition, and principal components analysis.

ordr's People

Contributors

corybrunson avatar empaul20 avatar olivroy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ordr's Issues

spinoff ggplot2 extension for geometric data analysis

ordr includes several components that extend ggplot2 independently of working with ordination models:

  • GeomOrigin & GeomUnitCircle
  • GeomAxis
  • GeomLineranges & GeomPointranges
  • GeomIsoline
  • GeomVector & GeomTextRadiate
  • StatCenter & StatStar
  • StatChull
  • StatCone
  • StatSpantree

It may be convenient to spin these off into a GDA-focused extension (gggda?) that could then be imported or suggested by ordr.

NB: It would require care to be sure that ord_aes() still works with these layers without itself being built in to the extension. (Or maybe it should be so built in?)

automate titles and descriptions for methods documentation

{roxygen2} tricks exist to automatically generate boilerplate documentation with keyword substitutions, e.g. class names. See the class-specific tidiers in the {broom} repo for examples. This would be convenient for standardizing and updating the titles and descriptions of documentation for methods based on classes from other packages or new ('*_ord') classes.

allow option to calculate convex hull accounting for origin

The point (in many cases) of restricting to the convex hull is to showcase the most important elements of a factor (e.g. cases or variables). In the following example, three years clearly belong in the convex hull, but one (1940) does not:

library(ordr)
#> Loading required package: ggplot2
#> 
#> Attaching package: 'ordr'
#> The following object is masked from 'package:stats':
#> 
#>     cmdscale
# symmetric biplot with radially labeled variable vectors
USPersonalExpenditure %>%
  prcomp(center = FALSE) %>%
  as_tbl_ord() %>%
  confer_inertia(c(.5, .5)) %>%
  ggbiplot(aes(label = .name)) +
  geom_v_vector() +
  geom_v_text_radiate(stat = "chull")

Created on 2019-09-01 by the reprex package (v0.2.1)

This would be mitigated by including an option for stat_chull() and its derivatives to include the origin (if not already present) when calculating the hull and remove it before returning the transformed data (if it wasn't present).

stat layer to multiply by a scale factor

Consistent with the unitary aspect ratio and use of secondary axes, it would be convenient to have a stat layer that does nothing except multiply all artificial coordinates by a constant scale factor (which would default to 1). This could be used, e.g. in a row-principal biplot, to overlay one column geom that respects the secondary axes with another that is un-scaled back to the primary axes.

A possible name is StatScale.

radiating text does not respect aspect ratio

The following extreme case illustrates that the assignment of data$angle in GeomTextRadiate$draw_panel() fails to account for the non-1:1 aspect ratio produced by coord_cartesian().

library(ordr)
#> Loading required package: ggplot2
#> 
#> Attaching package: 'ordr'
#> The following object is masked from 'package:stats':
#> 
#>     cmdscale
#> The following objects are masked from 'package:base':
#> 
#>     eigen, svd
euro.cross %>%
  log() %>% abs() %>%
  cmdscale(k = 2) %>%
  as_tbl_ord() %>%
  augment() %>%
  print() -> euro_mds
#> # A tbl_ord of class 'cmds': (11 x 2) x (11 x 2)'
#> # 2 coordinates: PCo1 and PCo2
#> # 
#> # U: [ 11 x 2 | 1 ]
#>     PCo1     PCo2 |   .name
#>                   |   <chr>
#> 1  0.368  3.61e-8 | 1 ATS  
#> 2 -0.708  1.60e-8 | 2 BEF  
#> 3  2.32  -7.93e-9 | 3 DEM  
#> 4 -2.12   2.60e-8 | 4 ESP  
#> 5  1.21  -7.63e-9 | 5 FIM  
#> # … with 6 more rows
#> # 
#> # V: [ 11 x 2 | 1 ]
#>     PCo1     PCo2 |   .name
#>                   |   <chr>
#> 1  0.368  3.61e-8 | 1 ATS  
#> 2 -0.708  1.60e-8 | 2 BEF  
#> 3  2.32  -7.93e-9 | 3 DEM  
#> 4 -2.12   2.60e-8 | 4 ESP  
#> 5  1.21  -7.63e-9 | 5 FIM  
#> # … with 6 more rows
euro_mds %>%
  ggbiplot() +
  coord_cartesian() +
  geom_u_vector() +
  geom_u_text_radiate(aes(label = .name), hjust = .3)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Created on 2019-06-13 by the reprex package (v0.2.1)

class-'lm' methods do not recognize models fitted with vectors/matrices in formula form

Neither geom_cols_vector() nor geom_cols_axis() produces graphic ouput when used with a linear regression object:

library(ggbiplot)
#> Loading required package: ggplot2
#> Loading required package: plyr
#> Loading required package: scales
#> Loading required package: grid
    library(ordr)
#> 
#> Attaching package: 'ordr'
#> The following object is masked from 'package:ggbiplot':
#> 
#>     ggbiplot
    y <- c(3, 12, 18)
    x1 <- c(9, 13, 16)
    x2 <- c(-4, 17, 20)
    x1s <- scale(x1)
    x2s <- scale(x2)
    ys <- scale(y)
    mod <- lm(ys ~ x2s + x1s)
    mod_ordr <- as_tbl_ord(mod)
    mod_ordr %>%
        augment_ord() %>% 
        ggbiplot(aes(x1s, x2s))+
        geom_rows_point() +
        geom_rows_text(aes(label = .name), size = 3, nudge_x = .1) +
        geom_cols_vector(color = "red", lwd = 1.5) +
        geom_cols_axis()

Created on 2021-08-25 by the reprex package (v2.0.1)

Desired output approximately:

library(ggplot2)
    y <- c(3, 12, 18)
    x1 <- c(9, 13, 16)
    x2 <- c(-4, 17, 20)
    x1s <- scale(x1)
    x2s <- scale(x2)
    ys <- scale(y)
    yc <- scale(y, scale = FALSE)
    mod <- lm(ys ~ x2s + x1s)
    b <- mod$coefficients[2:3]
    tm <- seq(0, 20, 1)
    tmc <- tm - mean(y)
    Xc <- cbind(x1s, x2s)
    c <- calibrate::calibrate(b[2:1], yc, tmc, Xc, tmlab=tm, tl=.1,
                              graphics=FALSE, verb=FALSE)
    dfpoints <- data.frame(ys, x1s, x2s)
    
    dfaxis <- data.frame(x = c$M[1, 1], y = c$M[1, 2],
                         xend = c$M[nrow(c$M),1], yend = c$M[nrow(c$M), 2])
    dfticks <- data.frame(c$M, c$Mn, tm)
    colnames(dfticks) <- c("x", "y", "xend", "yend", "label")
    
    dfarrows <- data.frame(xend = b[2], yend = b[1])
    
    
    ggplot(dfpoints, aes(x = x1s, y = x2s)) +
        geom_point() +
        geom_text(aes(x = x1s + .1, y = x2s, label = 1:3)) +
        geom_segment(data = dfticks, aes(x = x, y = y, xend = xend, yend = yend), col = "blue") +
        geom_text(data = dfticks, aes(x = xend, y = yend, label = label), size = 3, nudge_y = -.05) +
        geom_segment(data = dfaxis, aes(x = x, y = y, xend = xend, yend = yend), col = "blue") +
        geom_segment(data = dfarrows, aes(x = 0, y = 0, xend = xend, yend = yend),
                     arrow = arrow(length = unit(.03, "npc")), color = "red", lwd = 1.5) + coord_fixed()

Created on 2021-08-25 by the reprex package (v2.0.1)

spinoff auxiliary package for specialized classes & methods

{ordr} contains a great deal of generics, tidiers, and {ggplot2} extensions. To reduce bulk, it would make sense to include methods only for the most common techniques, in terms both of distribution and of use; less commonly distributed and used techniques will be of interest to fewer users.

The natural solution is to have a spinoff package, e.g. {ordr.extra}, that provides these additional methods. It would also be the natural place for user-contributed methods, leaving {ordr} more or less feature-complete.

automated tests

Add tests for all functionality (excluding visual regression):

  • accessors
  • augmentation
  • annotation
  • conference
  • alignment
  • classes
  • methods for ordination objects
  • reconstruction
  • formatting
  • dplyr verbs
  • fortification
  • biplotting (using layer_data())
  • LRA

change primary matrix designators from u and v to row and col

The use of U and V throughout the package, e.g. as values of the .matrix column and as infixes for plot layers, to distinguish row and column perspectives on ordination data has always been problematic: It is likely both ambiguous to new users and contested (e.g. versus F and G) by seasoned users. In contrast, whatever the method being used, the designators "row" and "col" will be unambiguous. They will also be easier to visually distinguish, while only slightly longer, and more recognizable analogues to "node" and "edge" in tidygraph.

separate alignment functionality from inspection methods

Currently alignment functions like align_to() add attributes to tbl_ord objects that tell format() and fortify() methods how to transform the matrices stored in the ordination model object before printing and plotting. Unlike specifying the conference of inertia, these attributes fundamentally change the nature of the data presented to the user from its core meaning as stored in the model object. This adds complexity and confusion, rather than realizing a trade-off between the two.

Instead, the alignment functions should return an unladen swallow tibble comprising the aligned coordinates and any other variables augmented or annotated to the tbl_ord object. A .matrix argument can be used to specify which matrix U or V to apply the alignment to, the default being to both (in which case a .matrix column will be included in the output. The result can then still be passed to ggbiplot(), but it will not remain an ordination model masquerading as a wholly different model (that in most cases would never be returned by a reasonable ordination function).

remove or repurpose tidy methods

As explained in the broom package README and elaborated in this tidymodels tutorial on custom tidier methods, the tidy() generic is intended to "summarize[] information about model components", complementarily to the augment() generic's role of "add[ing] information about observations to a data set". In ordr, some liberty is taken with augment() which treats cases and variables both as "observations", but tidy() is incompatibly defined as a wrapper for fortify(), i.e. to prepare an ordination model for a (bi)plot.

tidy() could simply be omitted from ordr. A better option, i think, is to consider the shared artificial coordinates of an ordination model its "components" and have tidy() assume the role currently played by augment_coord().

single-factor format and conference methods for eigendecomposition-based ordinations

Currently, eigendecomposition-based ordination objects like those returned by the cmdscale() mask (which should perhaps be changed to cmdscale_ord()) are formatted as though they had distinct row and column coordinates, and inertia can be doubly conferred onto either factor. There may be other related problems. The format.tbl_ord() method should be able to distinguish one- from two-factor ordinations, which will help prepare for the constrained ordinations of corybrunson/ordr.extra#6, and inertia should be conferrable only to this one factor.

measures of fit and their standard names

background

Gower &al (2011) detail several measures of fit for biplots, most prominently

  • the quality of the $r$-dimensional biplot, measured as the proportion of variance in the plot, calculated as the quotient of the traces of $\Lambda_r = {D_r}^2$ and of $\Lambda = D^2$.
  • the adequacy of the representation of the $j$-th row (respectively, column) in the $r$-dimensional biplot, calculated as the $j$-th diagonal element of $U_r\ {U_r}^\top$ (respectively, $V_r\ {V_r}^\top$), understood as the fidelity of the projections of the standard coordinates.
  • the predictivity of the $j$-th row (respectively, column) in the $r$-dimensional biplot, measured as the quotient of the $j$-th diagonal elements of $U_r\ \Lambda_r\ {U_r}^\top$ and of $U\ \Lambda\ U^\top$ (respectively, of $V_r\ \Lambda_r\ {V_r}^\top$ and of $V\ \Lambda\ V^\top$), understood as the fidelity of the projections of the principal coordinates.

These can be calculated directly from any SVD or EVD and interpreted for any technique based on them. In some cases they may also be calculated for supplementary elements.

suggestions

  1. A new .quality column, calculated as cumsum(.prop_var), could be added to the output of tidy.tbl_ord().
  2. The 1- or 2-dimensional adequacy and predictivity could be computed for all wrapped classes by augment_ord(), possibly via an option measures_of_fit = TRUE. (It would not be appropriate to annotate a tbl_ord with all $n \times k$ or $p \times k$ adequacies or predictivities.)
  3. Adequacy and predictivity for a specific value of $r$ could be computed in a mutate_*() call, e.g. mutate_rows(ord, fit_std = adequacy(dimension = 2L)), where adequacy() knows and is able to recover the necessary model components (cf. computing node and edge properties in tidygraph).

The value of (1) is, i think, self-evident. Probably only one of (2) and (3) would be appropriate, and i lean toward (3). Either would be valuable both (a) for downstream analysis of rows and columns and (b) as aesthetic mappings in biplots (e.g. to increase marker/vector opacity with predictivity/adequacy).

implementation

(2) and (3) would be supported by new recovery generics, possibly for the matrices of standard and of principal coordinates. (3) would probably require registration of the underlying model object within the wrapper, as in tidygraph.

write fortify method for coordinates

Alongside fortify_u() and fortify_v(), introduce a function fortify_coord() that returns a tibble with the names of the artificial coordinates, the decomposition of variance obtained by recover_inertia() (as a .inertia.ord column), and any metadata obtained by augmentation_coord(). Have the additional arguments "coord" and "coordinates" to the .matrix parameter of fortify.tbl_ord (and, by transitivity, of tidy.tbl_ord) pass the ordination object to fortify_coord(), ignoring the include parameter.

Relatedly, send ... in ggbiplot() to the fortify() call rather than the ggplot() call. For completeness, also add an environment parameter to ggbiplot() and pass it to the same parameter in ggplot().

These changes will streamline the process of generating a scree plot, or any other visual summary of the artificial coordinates, from a tbl_ord object using ggplot2. Specifically, something like this should work:

library(ordr)
pca <- as_tbl_ord(prcomp(USArrests))
ggplot(pca, .matrix = "coord", aes(x = .name, y = .inertia.ord)) + geom_bar(stat = "identity")

function to send specific columns of a data frame to an ordination technique

A persistent difficulty at present is to retain original data columns as row augmentation for a 'tbl_ord' object. Two examples:

  • The ordination function does not retain row names, e.g. logisticPCA(). In this case a wrapper function solves the problem but not in the most desirable and generalizable way.
  • Only a subset of columns of a matrix or data frame are passed to the ordination function. This is most conspicuous with lda(), but there are many examples in which non-numeric columns are useful, e.g. for plot annotation. This is usually solved using left_join_rows() or bind_cols_rows() with the original object, an awkward move in a magrittr pipeline.

A solution may be to write a new function, e.g. ordinate(), with three named arguments, .data, .cols, and .f, that would look like this:

pca <- ordinate(
  .data = iris,
  .cols = 1:4,
  .f = ~ prcomp(., center = TRUE, scale. = FALSE)
)

Or like this:

pca <- ordinate(
  .data = iris,
  .cols = 1:4,
  .f = prcomp,
  center = TRUE, scale. = TRUE
)

The function would perform the ordination .f on .data subsetted to .cols, encode the result as a 'tbl_ord', and augment the rows of this object with any unused columns of .data. (Alternatively, an additional augment parameter could specify a subset of columns to augment.) The version where .f is passed a formula would enable tab completion for model parameters.

scale secondary axis to harmonize row and column inertias

Gower, Lubbe, and le Roux (2011) recommend an optimized scale factor $\lambda$ to harmonize the row and column elements (here i'm using their symbol $\lambda$ but not exactly their definition of it): If a $n \times p$ data matrix $X$ is singular value decomposed and the plotted elements are $F = U\ D^a$ and $\lambda\ G = \lambda\ V\ D^b$ (ideally with $a + b = 1$), then $\lambda$ should be chosen so that $n^{-1} \text{tr}(F\ F^\top) = p^{-1} \text{tr}(\lambda\ G\ (\lambda\ G)^\top)$, i.e. so that the row and column inertias are harmonized. This gives $$\lambda^2 = \displaystyle\frac{p \text{tr}(U\ D^{2a}\ U^\top)}{n \text{tr}(V\ D^{2b}\ V^\top)}\text{,}$$ which is an option that should be provided, perhaps as scale.factor = "variance" or scale.factor = "inertia". The current default of harmonizing the ranges could then be made more explicit as scale.factor = "range", and the user could pass scale.factor = NULL to allow ggbiplot() to choose. (The variance method is probably a better default than the range method.)

make `augment()` consistent with other packages; prevent redundant columns

In broom, augment() returns a tibble, as fortify() does here. In ordr, augment() should perhaps be an alias for fortify() (or vice-versa), for consistency with other packages and with the matrix-specific augment_*(). This would also prevent the redundancy illustrated below.

A new function should annotate both matrices with the columns produced by augment_*(). If the additional columns generated by the augment_*() duplicate existing column names, then this function should throw an error.

library(ordr)
#> Loading required package: ggplot2
#> 
#> Attaching package: 'ordr'
#> The following object is masked from 'package:stats':
#> 
#>     cmdscale
#> The following objects are masked from 'package:base':
#> 
#>     eigen, svd
test <- augment(as_tbl_ord(svd(USArrests)))
(augment(test))
#> # A tbl_ord of class 'svd': (50 x 4) x (4 x 4)'
#> # 4 coordinates: SV1, SV2, ..., SV4
#> # 
#> # U: [ 50 x 4 | 2 ]
#>      SV1      SV2     SV3 ... |   .name      .name1    
#>                               |   <chr>      <chr>     
#> 1 -0.172  0.0963   0.0652     | 1 Alabama    Alabama   
#> 2 -0.189  0.173   -0.427  ... | 2 Alaska     Alaska    
#> 3 -0.216  0.0790   0.0206     | 3 Arizona    Arizona   
#> 4 -0.139  0.0599   0.0139     | 4 Arkansas   Arkansas  
#> 5 -0.207 -0.00981 -0.176      | 5 California California
#> # ... with 45 more rows
#> # 
#> # V: [ 4 x 4 | 2 ]
#>       SV1     SV2     SV3 ... |   .name    .name1  
#>                               |   <chr>    <chr>   
#> 1 -0.0424  0.0162 -0.0659     | 1 Murder   Murder  
#> 2 -0.944   0.321   0.0666 ... | 2 Assault  Assault 
#> 3 -0.308  -0.938   0.155      | 3 UrbanPop UrbanPop
#> 4 -0.110  -0.127  -0.983      | 4 Rape     Rape

Created on 2018-09-24 by the reprex package (v0.2.0.9000).

extend ord_aes from spantree to other biplot stats

The package extends or introduces several nonlinear geometric statistical transformations (center, star, ellipse, convex hull, conical hull, minimal spanning tree) for the matrix factors of an ordination, but only the minimal spanning tree stat is equipped to handle the ord_aes() specification for all artificial coordinates from the ordination model. Each of these layers should be able to, since the result will often be different but still usefully visualizable.

move Greenacre examples to a separate repo

Early work on this package relied heavily on data and examples from Michael Greenacre's work, in particular Biplots in Practice. These data sets and examples should be moved to a separate illustrative document rather than installed with the package (they are still valuable for validation purposes), and the functionality they illustrate should be instead illustrated using data sets from the usual R distribution and from dependency packages where possible.

Is psych::principal supported?

Hi There
I have been exploring many packages to try and find a biplot function for psych::principal output. Do you plan to add support for this?

plot principal components in descending order in scree plot

Currently, when ggplot() is used to plot a scree plot for an augmented prcomp() object, the principal components are ordered alphabetically instead of in descending order.

library(ordr)
pca <- augment(as_tbl_ord(prcomp(mtcars, center = TRUE, scale. = TRUE)))
pcacoord <- augmentation_coord(pca)
pcainertia <- recover_inertia(pca)
pcainertia <- data.frame(pcainertia = pcainertia)
pcacoord <- cbind(pcacoord, pcainertia)
ggplot(pcacoord, aes(x = .name, y = pcainertia/sum(pcainertia))) + geom_bar(stat = "identity") + scale_y_continuous(labels = scales::percent)

Screen Shot 2019-07-27 at 6 14 19 PM

have default grid lines on x and y axes agree

Consistent with defaulting to an aspect ratio of 1, this would make the default plot more readily interpretable. It's not trivial how to do this internally, but it may involve choosing the "better" of the two defaults and assigning it to both.

unit circle geom

Hopefully there can be options geom_*_unitcircle() for each matrix factor that respect the primary versus secondary axes (when present).

pairs biplot

Because the artificial coordinates of an ordination exist on the same scale, it makes a lot of sense to be able to produce a pairs biplot consisting of pairwise plots of each artificial axis against each other. GGally::ggpairs() might be a good starting point, but a biplot function would ideally return a 'ggplot' object that could accept ordr layers. (A natural name would be ggbipairs().)

extend subset parameter to all biplot layers

The experimental subset parameter of GeomIsolines$setup_data() should be extended to all plot layers.

To consistently distinguish between ggplot() and ggbiplot(), it would be good, if possible, to enable this parameter only for *_rows_*() and *_cols_*() layers. (Should it be a parameter of stat_rows(), stat_cols(), and the other new stat layers rather than of Geom*$setup_data()?)

An important decision is what inputs subset should be able to handle. Among the possibilities:

  • a positive (negative) integer vector indicating the rows to include (exclude) – understood by [.data.frame() and dplyr::slice() but not by subset(); almost definitely worth including since it will come most naturally to users
  • a logical vector of length the number of rows – understood by [.data.frame() and subset() but not by dplyr::slice(); should not cause confusion, but could just be required to be which()ed instead of handled separately
  • a character vector of row names – understood by [.data.frame() but not by subset() or dplyr::slice(); could cause confusion because print.tbl_ord() uses tibble-like printing, which does not display row names as print.data.frame() does, but might be important for large data workflows

determine best implementation of unit circle and of origin crosshairs

Currently, a unit circle can be rendered using a custom geom, which in principle does not need aesthetics from the data. This graphical element may be more appropriately implemented as a theme element, which can be registered as documented here. Similarly, a crosshairs located at the origin is common in biplots from which (artificial) coordinate axes have been discarded, as in the book Understanding Biplots. This could also be registered as a theme element. Presumably, it should then be possible to define new themes, e.g. theme_biplot() and theme_unit_circle(), that include these elements (and, e.g., exclude grid lines).

additional specific tests

Specific additional tests are needed, based on recent experimentation:

  • augment_ord() does not increase the number of rows of either factor
  • augment_ord() returns a 'tbl_ord' object
  • glance() returns a single row
  • either both or neither augmentation_*() methods introduce an '.element' column

keep radiated labels within plotting window

Currently, geom_text_radiate() adjusts hjust internally to shift text away from arrowheads, but the plotting window does not account for this.

library(ordr)
pca <- prcomp(USPersonalExpenditure, center = FALSE)
pca_ord <- as_tbl_ord(pca)
pca_ord <- confer_inertia(pca_ord, c(.5, .5))
ggbiplot(pca_ord, aes(label = .name)) +
  theme_bw() +
  geom_v_vector() +
  geom_v_text_radiate() +
  geom_u_point()

plot

projection (prediction) and vector sum (interpolation) geoms

A geom layer geom_u_projection(from = i, to = j) should render a (by default dashed) line from the ordinates of the ith row of $U$ to the linear subspace containing those of the jth row of $V$, maybe with an optional cute right angle symbol. Both from and to could contain multiple indices. The projection should adopt the axes (primary or secondary) used by $U$.

allow exclusion of supplementary points in ggbiplot call

See below for an illustration. This can easily be done by passing ... from ggbiplot() to fortify(). In resolving #12 i opted not to, partly for simplicity. It's not clear that this would disrupt any functionality, so it may be worth doing in a separate branch and re-running examples to be sure.

library(ordr)
#> Loading required package: ggplot2
#> 
#> Attaching package: 'ordr'
#> The following object is masked from 'package:stats':
#> 
#>     cmdscale
# create 'tbl_ord' object with supplementary points
MASS::lda(group ~ ., heplots::Diabetes) %>%
  as_tbl_ord() %>%
  augment() %>%
  mutate_u(discriminant = ifelse(! .supplement, "centroid", "case")) ->
  lda
# exclude supplementary points in `fortify()`
lda %>%
  fortify(.supplement = FALSE) %>%
  ggbiplot() +
  geom_u_point(aes(shape = .grouping))

# want to be able to exclude supplementary points in `ggbiplot()`
lda %>%
  ggbiplot(.supplement = FALSE) +
  geom_u_point(aes(shape = .grouping))

Created on 2019-07-27 by the reprex package (v0.2.1)

vignette to showcase an example analysis using each supported class

It would probably be useful for users to be able to skim a single vignette containing of short data analyses using ordination techniques supported by ordr. These should not be a mere biplot gallery but rather a set of self-contained studies, separated into sections, each with a clear and concise question or hypothesis, an explanation of the data, several exploratory and/or explanatory analysis steps, and a conclusion.

The vignette could be prepared gradually, one section at a time. Ideally these would not duplicate the examples, and in particular different data sets would be used, though with minimal additional package dependencies.

methods (and function?) for linear discriminant analysis

LDA is case–variable ordination based on group centroids (in the variable space) rather than individual cases. If the lda object returned by MASS::lda() doesn't contain the elements for all accessor methods in ordr, then a new function lda_ord() may be warranted.

LDA centroid and case groupings are differently placed

See the documentation for examples of consequences. The proximate cause is that the grouping variable is augmented as .grouping for the cases but (only) as .name for the centroids. it should also go to .grouping for the centroids, and .name should perhaps be changed, depending on how the original function's output is structured.

combine isoline and axis geoms

Currently, several geoms must be combined to form annotated isolines and annotated axes. As mentioned here, this can make a single plot a pain to layer. Ideally, these should be created by a single geom each. Since they involve elements of the same type (e.g. axes and tick marks) that may require different aesthetics (e.g. dashed and solid), additional aesthetics or parameters will be needed. There should be plenty of examples of how to do this effectively in other ggplot2 extensions.

debug geom ticks

geom_*_ticks() seems to work well for the first row of either matrix factor but is buggy for later rows. Examples follow. In addition to resolving this behavior, the intervals at which ticks are located should be determined similarly to how it is done for the plotting window, if possible using the same underlying functions.

library(ordr)
#> Loading required package: ggplot2
#> 
#> Attaching package: 'ordr'
#> The following object is masked from 'package:stats':
#> 
#>     cmdscale
MASS::lda(group ~ ., heplots::Diabetes) %>%
  as_tbl_ord() %>%
  augment() %>%
  mutate_u(discriminant = ifelse(! .supplement, "centroid", "case")) %>%
  print() -> lda
#> # A tbl_ord of class 'lda': (148 x 2) x (5 x 2)'
#> # 2 coordinates: LD1 and LD2
#> # 
#> # U: [ 148 x 2 | 11 ]
#>      LD1    LD2 |   .name .grouping .prior .counts
#>                 |   <chr> <fct>      <dbl>   <int>
#> 1 -1.75   0.400 | 1 Norm… Normal     0.524      76
#> 2  0.340 -1.38  | 2 Chem… Chemical…  0.248      36
#> 3  3.66   0.580 | 3 Over… Overt_Di…  0.228      33
#> 4 -1.72   0.663 | 4 1     Normal    NA          NA
#> 5 -2.85   1.30  | 5 2     Normal    NA          NA
#> # … with 143 more rows, and 7
#> #   more variables:
#> #   .centroid.relwt <dbl>,
#> #   .centroid.glufast <dbl>,
#> #   .centroid.glutest <dbl>,
#> #   .centroid.instest <dbl>,
#> #   .centroid.sspg <dbl>,
#> #   .supplement <lgl>,
#> #   discriminant <chr>
#> # 
#> # V: [ 5 x 2 | 1 ]
#>         LD1      LD2 |   .name  
#>                      |   <chr>  
#> 1  1.36     -3.78    | 1 relwt  
#> 2 -0.0336    0.0366  | 2 glufast
#> 3  0.0126   -0.00709 | 3 glutest
#> 4 -0.000102 -0.00617 | 4 instest
#> 5  0.00424   0.00113 | 5 sspg
# good rendering:
ggbiplot(lda) +
  theme_bw() +
  geom_u_point(aes(shape = .grouping, size = discriminant), alpha = .5) +
  geom_v_axis(color = "lightgrey") +
  geom_v_ticks(ids = 1)
#> Warning: Using size for a discrete variable is not advised.

# bad rendering (too few, too long):
ggbiplot(lda) +
  theme_bw() +
  geom_u_point(aes(shape = .grouping, size = discriminant), alpha = .5) +
  geom_v_axis(color = "lightgrey") +
  geom_v_ticks(ids = 2)
#> Warning: Using size for a discrete variable is not advised.

# bad rendering (remaining variables):
ggbiplot(lda) +
  theme_bw() +
  geom_u_point(aes(shape = .grouping, size = discriminant), alpha = .5) +
  geom_v_axis(color = "lightgrey") +
  geom_v_ticks(ids = 3:5)
#> Warning: Using size for a discrete variable is not advised.

Created on 2019-07-27 by the reprex package (v0.2.1)

generic + methods for ordinate

The ordinate() function currently only supports data frame input, despite multiple techniques (CMDS, CA) accepting or requiring other input types. An obvious improvement would be to make this function a generic and use method dispatch to handle data.frame, array, dist, and possibly other input.

This would also be a good opportunity to reconsider the function's arguments. data is appropriately first (from a piping perspective), but model often feels more intuitive than cols as the second, and cols will be ignored if not absent by matrix and dist methods. tidyr::pivot_*() and parsnip::fit() are helpful points of reference but aren't determinative. Certainly a NB should be added to the effect that parameter order may change and users should habitually use parameters for all arguments (as should the examples). Necessarily ... will continue to pass model parameters.

Installation error "No commit found for the ref master"

Running R 3.6 and trying to install

remotes::install_github("corybrunson/ordr")

Returns


> 
> Error: Failed to install 'unknown package' from GitHub:
>   HTTP error 404.
>   No commit found for the ref master
> 
>   Did you spell the repo owner (`corybrunson`) and repo name (`ordr`) correctly?
>   - If spelling is correct, check that you have the required permissions to access the repo.

For conext

Sys.info()


>                                               sysname                                               release 
>                                               "Linux"                                    "5.4.0-72-generic" 
>                                               version                                              nodename 
> "#80~18.04.1-Ubuntu SMP Mon Apr 12 23:26:25 UTC 2021"                                          "tundraswan" 
>                                               machine                                                 login 
>                                              "x86_64"                                               "jacob" 
>                                                  user                                        effective_user 
>                                               "jacob"                                               "jacob" 

What am I missing here?

remove heplots dependency

This package Suggests heplots only in order to use the diabetes data set in several examples. Another data set should be used in its stead so that the dependency can be removed.

use 'outward' in geom-text examples

I only recently learned about the settings hjust = "outward", vjust = "outward" in geom_text(), which are very well-suited to biplots (being often star-shaped). Examples that use geom_*_text_repel() could probably be improved by instead using geom_*_text() with these settings.

Issues with next version of ggplot2

Hi

We preparing the next release of ggplot2 and our reverse dependency checks show that your package is failing with the new version. Looking into it we see that the issues are due to our switch to using the linewidth aesthetic for stroke width instead of size. Please see https://www.tidyverse.org/blog/2022/08/ggplot2-3-4-0-size-to-linewidth/ for more info on this.

You can install the release candidate of ggplot2 using devtools::install_github('tidyverse/[email protected]') to test this out.

We plan to submit ggplot2 by the end of October and hope you can have a fix ready before then

Kind regards
Thomas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.