dataslingers / moma Goto Github PK

MoMA: Modern Multivariate Analysis in R

Home Page: https://DataSlingers.github.io/MoMA

License: GNU General Public License v2.0

R 60.78% C++ 39.22%

multivariate-analysis multivariate-statistics statistical-learning statistics principal-component-analysis partial-least-squares canonical-correlation-analysis sparsity smoothness r

moma's Introduction

MoMA: Modern Multivariate Analysis

MoMA is a penalized SVD framework that supports a wide range of sparsity-inducing penalties. For a matrix X, MoMA gives the solution to the following optimization problem:

The penalties (the P functions) we support so far include

moma_lasso(): LASSO (least absolute shrinkage and selection operator).
moma_scad(): SCAD (smoothly clipped absolute deviation).
moma_mcp() MCP (minimax concave penalty).
moma_slope(): SLOPE (sorted (\ell)-one penalized estimation).
moma_grplasso(): Group LASSO.
moma_fusedlasso(): Fused LASSO.
moma_spfusedlasso(): Sparse fused LASSO.
moma_l1tf(): (\ell)-one trend filtering.
moma_cluster(): Cluster penalty.

With this at hand, we can easily extend classical multivariate models:

moma_sfpca() performs penalized principal component analysis.
moma_sfcca() performs penalized canonical component analysis.
moma_sflda() performs penalized linear discriminant analysis.

We also provide Shiny App support to facilitate interaction with the results. If you are new to MoMA, the best place to start is vignette("MoMA").

Installation

The newest version of the package can be installed from Github:

library(devtools)
install_github("DataSlingers/MoMA", ref = "master")

Usage

Perform sparse linear discriminant analysis on the Iris data set.

library(MoMA)

## collect data
X <- iris[, 1:4]
Y_factor <- as.factor(rep(c("s", "c", "v"), rep(50, 3)))

## range of penalty
lambda <- seq(0, 1, 0.1)

## run!
a <- moma_sflda(
    X = X,
    Y_factor = Y_factor,
    x_sparse = moma_lasso(lambda = lambda),
    rank = 3
)

plot(a) # start a Shiny app and play with it!

Background

Multivariate analysis – the study of finding meaningful patterns in datasets – is a key technique in any data scientist’s toolbox. Beyond its use for Exploratory Data Analysis (“EDA”), multivariate analysis also allows for principled Data-Driven Discovery: finding meaningful, actionable, and reproducible structure in large data sets. Classical techniques for multivariate analysis have proven immensely successful through history, but modern Data-Driven Discovery requires new techniques to account for the specific complexities of modern data. This package provides a new unified framework for Modern Multivariate Analysis (“MoMA”), which will provide a unified and flexible baseline for future research in multivariate analysis. Even more importantly, we anticipate that this easy-to-use R package will increase adoption of these powerful new models by end users and, in conjunction with R’s rich graphics libraries, position R as the leading platform for modern exploratory data analysis and data-driven discovery.

Multivariate analysis techniques date back to the earliest days of statistics, pre-dating other foundational concepts like hypothesis testing by several decades. Classical techniques such as Principal Components Analysis (“PCA”) [1, 2], Partial Least Squares (“PLS”), Canonical Correlation Analysis (“CCA”) [3], and Linear Discriminant Analysis (“LDA”), have a long and distinguished history of use in statistics and are still among the most widely used methods for EDA. Their importance is reflected in the CRAN Task View dedicated to Multivariate Analysis [4], as well as the specialized implementations available for a range of application areas. Somewhat surprisingly, each of these techniques can be interpreted as a variant of the well-studied eigendecomposition problem, allowing statisticians to build upon a rich mathematical and computational literature.

In the early 2000s, researchers noted that naive extensions of classical multivariate techniques to the high-dimensional setting produced unsatisfactory results, a finding later confirmed by advances in random matrix theory [5]. In response to these findings, multivariate analysis experienced a renaissance as researchers developed a wide array of new techniques to incorporate sparsity, smoothness, and other structure into classical techniques [6,7,8,9,10,11,12,13,14 among many others], resulting in a rich literature on “modern multivariate analysis.” Around the same time, theoretical advances showed that these techniques avoided many of the pitfalls associated with naive extensions [15,16,17,18,19,20].

While this literature is vast, it relies on a single basic principle: it is essential to adapt classical techniques to account for known characteristics and complexities of the dataset at hand for multivariate analysis to succeed. For example, a neuroscientist investigating the brain’s response to an external stimulus may expect a response which is simultaneously spatially smooth and sparse: spatially smooth because the brain processes related stimuli in well-localized areas (e.g., the visual cortex) and sparse because not all regions of the brain are used to respond to a given stimulus. Alternatively, a statistical factor model used to understand financial returns may be significantly improved by incorporating known industry sector data, motivating a form of group sparsity. A sociologist studying how pollution leads to higher levels of respiratory illnesses may combine spatial smoothness and sparsity (indicating “pockets” of activity) with a non-negativity constraint, knowing that pollution and illness have a positive effect.

To incorporate these different forms of prior knowledge into multivariate analysis, a wide variety of algorithms and approaches have been proposed. In 2013, Allen proposed a general framework that unified existing techniques for “modern” PCA, as well as proposing a number of novel extensions [21]. The recently developed MoMA algorithm builds on this work, allowing more forms of regularization and structure, as well as supporting more forms of multivariate analysis.

The principal aim of this package is to make modern multivariate analysis available to a wide audience. This package will allow for fitting PCA, PLS, CCA, and LDA with all of the modern “bells-and-whistles:” sparsity, smoothness, ordered and unordered fusion, orthogonalization with respect to arbitrary bases, and non-negativity constraints. Uniting this wide literature under a single umbrella using the MoMA algorithm will provide a unified and flexible platform for data-driven discovery in R.

Authors

Michael Weylandt

Department of Statistics, Rice University
Genevera Allen

Departments of Statistics, CS,and ECE, Rice University

Jan and Dan Duncan Neurological Research Institute Baylor College of Medicine and Texas Children’s Hospital
Luofeng “Luke” Liao

School of Data Science, Fudan University

Acknowledgements

MW was funded by an NSF Graduate Research Fellowship 1450681.
LL was funded by Google Summer of Code 2019.

References

[1] K. Pearson. “On Lines and Planes of Closest Fit to Systems of Points in Space.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, p.559-572, 1901. https://doi.org/10.1080/14786440109462720

[2] H. Hotelling. Analysis of a Complex of Statistical Variables into Principal Components. Journal of Educational Psychology 24(6), p.417-441, 1933. http://dx.doi.org/10.1037/h0071325

[3] H. Hotelling. “Relations Between Two Sets of Variates” Biometrika 28(3-4), p.321-377, 1936. https://doi.org/10.1093/biomet/28.3-4.321

[4] See CRAN Task View: Multivariate Statistics

[5] I. Johnstone, A. Lu. “On Consistency and Sparsity for Principal Components Analysis in High Dimensions.” Journal of the American Statistical Association: Theory and Methods 104(486), p.682-693, 2009. https://doi.org/10.1198/jasa.2009.0121

[6] B. Silverman. “Smoothed Functional Frincipal Fomponents Analysis by Choice of Norm.” Annals of Statistics 24(1), p.1-24, 1996. https://projecteuclid.org/euclid.aos/1033066196

[7] J. Huang, H. Shen, A. Buja. “Functional Principal Components Analysis via Penalized Rank One Approximation.” Electronic Journal of Statistics 2, p.678-695, 2008. https://projecteuclid.org/euclid.ejs/1217450800

[8] I.T. Jolliffe, N.T. Trendafilov, M. Uddin. “A Modified Principal Component Technique Based on the Lasso.” Journal of Computational and Graphical Statistics 12(3), p.531-547, 2003. https://doi.org/10.1198/1061860032148

[9] H. Zou, and T. Hastie, and R. Tibshirani. “Sparse Principal Component Analysis.” Journal of Computational and Graphical Statistics 15(2), p.265-286, 2006. https://doi.org/10.1198/106186006X113430

[10] A. d’Aspremont, L. El Gahoui, M.I. Jordan, G.R.G. Lanckriet. “A Direct Formulation for Sparse PCA Using Semidefinite Programming.” SIAM Review 49(3), p.434-448, 2007. https://doi.org/10.1137/050645506

[11] A. d’Aspremont, F. Bach, L. El Gahoui. “Optimal Solutions for Sparse Principal Component Analysis.” Journal of Machine Learning Research 9, p.1269-1294, 2008. http://www.jmlr.org/papers/v9/aspremont08a.htm

[12] D. Witten, R. Tibshirani, T. Hastie. “A Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis.” Biostatistics 10(3), p.515-534, 2009. https://doi.org/10.1093/biostatistics/kxp008

[13] R. Jenatton, G. Obozinski. F. Bach. “Structured Sparse Principal Component Analysis.” Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010. http://proceedings.mlr.press/v9/jenatton10a.html

[14] G.I. Allen, M. Maletic-Savatic. “Sparse Non-Negative Generalized PCA with Applications to Metabolomics.” Bioinformatics 27(21), p.3029-3035, 2011. https://doi.org/10.1093/bioinformatics/btr522

[15] A.A. Amini, M.J. Wainwright. “High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components.” Annals of Statistics 37(5B), p.2877-2921, 2009. https://projecteuclid.org/euclid.aos/1247836672

[16] S. Jung, J.S. Marron. “PCA Consistency in High-Dimension, Low Sample Size Context.” Annals of Statistics 37(6B), p.4104-4130, 2009. https://projecteuclid.org/euclid.aos/1256303538

[17] Z. Ma. “Sparse Principal Component Analysis and Iterative Thresholding.” Annals of Statistics 41(2), p.772-801, 2013. https://projecteuclid.org/euclid.aos/1368018173

[18] T.T. Cai, Z. Ma, Y. Wu. “Sparse PCA: Optimal Rates and Adaptive Estimation.” Annals of Statistics 41(6), p.3074-3110, 2013. https://projecteuclid.org/euclid.aos/1388545679

[19] V.Q. Vu, J. Lei. “Minimax Sparse Principal Subspace Estimation in High Dimensions.” Annals of Statistics 41(6), p.2905-2947, 2013. https://projecteuclid.org/euclid.aos/1388545673

[20] D. Shen, H. Shen, J.S. Marron. “Consistency of Sparse PCA in High Dimension, Low Sample Size Contexts.” Journal of Multivariate Analysis 115, p.317-333, 2013. https://doi.org/10.1016/j.jmva.2012.10.007

[21] G.I. Allen. “Sparse and Functional Principal Components Analysis.” ArXiv Pre-Print 1309.2895 (2013). https://arxiv.org/abs/1309.2895

moma's People

Contributors

Stargazers

Watchers

Forkers

banana1530 aung2phyowai prasenjit1989 sumeetmankar171

moma's Issues

DP Approach for Ordered Fused Lasso

DP approach is faster than path algorithm when length of the β > 300.

Skip RNG Set-Up

By default, Rcpp transfers the RNG state to and from R when calling into C++. This is a little bit expensive and not necessary for us since we don't use RNGs in C++.

If we change our C++ attributes to // [[Rcpp::export(rng = false)]], things will be a smidge faster. (I'd imagine this is still dwarfed by compute time, but it can't hurt.)

See example at https://github.com/tidyverse/dplyr/blob/e08f0511986a10efda5b9fd991af2e049d801333/src/address.cpp#L18

Non-negative ordered fusion lasso

How to find the proximal operator of that?

Support Correspondance Analysis

Correspondance Analysis is another multivariate analysis technique based on an (generalized) SVD which can probably cast in the MoMA framework. I don’t know much about it (yet) but it. Ay be worth supporting alongside PCA, PLS, CCA, LDA, etc.

Absorb parameter values into sparsity-type/smoothness-type specification

Perhaps a similar sort of thing:

u_smoothness = second_order_difference(select = TRUE)

I used a similar pattern for fusion weights in the clustRviz package: it's a bit tricky to get your head around it, but check it out - might have some useful ideas.

Related, but separate, question: if we go this route, should parameter values be in the object or as a second argument?

Originally posted by @michaelweylandt in #42

GSoC (Google Summer of Code) 2019 Report

Work done in GSoC 2019

Code formatting #33
Code coverage #49
PG loop argument wrapper #36
Design improve #37
R6 PCA wrappers #42
Put parameters values in sparsity / smoothness specification #48
Add SFLDA / SFCCA, and documentations #54

How to use it

Please refer to the Github pages of the repo. It contains detailed documentations of the functions and a couple of illustrative examples.

TODO

Extend the package to allow more penalty choices and multivariate methods #52, #19
More helper R6 methods to facilitate exploration of the results.
Support caching and frame smoothing in Shiny apps.

Citation File

Add a citation file so that citation("MoMA") returns something useful. For now, the DSW SFPCA paper should be fine.

Fix R CMD Check notes

Accessor with interpolation

#37 (comment)

Longer term, we should think about making an accessor which takes alpha_u, lambda_u, etc (values not indices) and does the extraction. If we don't have exactly the right value in the saved list, we should (by default) interpolate with an option for an exact solve.

(Something similar for the coef function in Michael's ExclusiveLasso package.)

Improve the `b` / `g` coding

Still not wild about the b / g coding. I think it's a bit tricky for users to discover.

Thinking ahead, we definitely want BIC selection, but might also want AIC and eBIC. Maybe

select = FALSE as the default (corresponding to keeping a full grid) and allow select = c("BIC", "AIC", "EBIC") as alternative values.

Originally posted by @michaelweylandt in https://github.com/michaelweylandt/MoMA/pull/48/files

Add generalized lasso to the toolset

Generalized lasso [1] meets the requirements of sparse penalty in our sfpca framework:

of order 1
convex or can be decomposed of difference of two convex functions

However existing algorithm is not fast enough to solve lots of generalized lasso problems quickly. Some of its special case has efficient algorithms though, i.e., l1 trend filtering. [2]

[1] https://arxiv.org/pdf/1005.1971.pdf
THE SOLUTION PATH OF THE GENERALIZED LASSO
[2] Koh, K., Kim, S. and Boyd, S. (2007), An interior-point method for large-scale l1-regularized logistic regression, Journal of Machine Learning Research

Undealt comments on SLOPE

#25 (comment)

Think about Grid Search / Auto-Selection Interface

Use Rcpp::checkUserInterrupt()

Impose restriction on Omega

We might want our users not to pass S = identity matrix.

Vignettes for Parameter Selection

You can also point folks to the brief discussion in the paper.

This definitely is the sort of thing we should write a vignette on at some post-GSoC point.

Originally posted by @michaelweylandt in https://github.com/_render_node/MDE3OlB1bGxSZXF1ZXN0UmV2aWV3Mjc4MDgzMjQ0/pull_request_reviews/more_threads

Setting signs for returned singular vectors

Optimization on unordered fusion lasso

For special weight matrices, we can provide even more efficient algorithms than the general-purpose ADMM or AMA methods.

For w_ij are all equal, i.e., the graph is unweighted, we can use the algorithm described in The DFS Fused Lasso: Linear-Time Denoising over General Graphs, which has O(m+n) complexity, where m is the number of edges and n the number of nodes.

SLOPE Penalty

Recently, the "SLOPE" penalty (sorted L1-norm) has been shown to have good theoretical properties and attractive performance in simulations [1]. It is a first order penalty, so would fit within the MoMA framework. It theoretically has several tuning parameters (one for each parameter), so we might take a reduced case, e.g., the BH-type rule discussed in [1]. Algorithm 4 in [1] gives a good algorithm for the proximal operator.

[1] https://projecteuclid.org/euclid.aoas/1446488733

Handle passing deflation scheme encoding between R and C++

Hmmm.... it looks like the answer is "sort of." The following works, but isn't totally type-safe. We could probably do a more general solution in the future. Issue?

#include "Rcpp.h"

enum DeflationMethod {
  PCA = 0, 
  PLS = 1,  
  CCA = 2
};

namespace Rcpp {
  SEXP wrap(DeflationMethod dm){
    Rcpp::IntegerVector dm_int = Rcpp::wrap(static_cast<int>(dm));
    dm_int.attr("class") = "DeflationMethod"; 
    return dm_int;
  }

  template <> DeflationMethod as(SEXP dm_sexp){
    Rcpp::IntegerVector dm_iv(dm_sexp);
    int dm_int = dm_iv(0); 
    DeflationMethod dm = static_cast<DeflationMethod>(dm_int);
    return dm; 
  }
}

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
DeflationMethod make_pls(){
  DeflationMethod pls = DeflationMethod::PLS;
  return pls;
}

// [[Rcpp::export]]
void take_pls(DeflationMethod x){
  Rcpp::Rcout << " x = " << x << std::endl; 
}

Originally posted by @michaelweylandt in #54

Speed up MoMA in the case of strong roughness penalty

I discover that if alpha is relatively larger, the time needed to solve a single penalized rank-1 SVD becomes unbearable.

More reproducible examples are on their way.

@michaelweylandt suggest using analytical solution of the roughness-penalized SVD as starting point.

End-to-End Tests

Let's add some end-to-end tests where we reproduce the examples (either simulated or EEG) from the SFPCA paper.

Nothing fancy - just save the "known good" results (from GA's Matlab code) and then run our functions on the same data and check that everything (estimated values and selected BICs) is close to what we expect.

Possibly should wait until after the SFPCA wrappers are written.

Code formatting

Create a clang-format file for C++ code. Use the styler package for R code formatting.

Improve BIC Search Algorithm

When doing nested BIC search, do we need to update both U and V? (The current code holds V constant while optimizing over U and vice versa) Formally, in the BIC expression, v is a function of u_hat so it seems weird to not change v. On the flip side, we're just optimizing a regression (with constraint). My concern is that, if we're just solving the penalized regression problem and Xv is a sub-unit vector, then we can always choose u to be exactly Xv and then we get zero variance and stumble into a log(0) problem. (Or at least, that's what I got out of a quick skim. Please correct me if I'm wrong @Banana1530.)

For well-initialized V, this doesn't make a big difference so we can usually get away with it: in particular, if V is well initialized (maybe because we're already using a full grid or tuning parameters are small enough that the SVD is really close to correct), V_init and V_stationary point are close, so BIC(U, V_init) and BIC(U, V_stationary point) will be similar. That seems like too strong an assumption to make in practice though.

The below experiments indicate what I mean. If we set UPDATE_V = 0 but leave INITIALIZE_V_SVD = 1, we get saved by the fact that v_star and V1 are pretty close; but if we set both flags to 0, corresponding to a random initialization of v_hat without updating, things go poorly. If we have INITIALIZE_V_SVD = 0, but keep UPDATE_V = 1, we do better, but not nearly as well as using the SVD initialization.

UPDATE_V = 1;        % Set to 0 to fix V at initialization value
INITIALIZE_SVD = 1;  % Set to 0 to initialize U, V to random unit vector

n = 50; p = 25; s = 5; 
u_star = [ones(s, 1); zeros(n - s, 1)]; v_star = [zeros(p - s, 1); ones(s, 1)]; d = 3; 

N = randn(n, p);
S = d * u_star * v_star'; 

X = S + N;  
[U, ~, V] = svd(X); 

st = @(x, lambda) sign(x) .* max(abs(x) - lambda, 0); 

% Initialize SFPCA
U1 = U(:, 1); V1 = V(:, 1);


% SFPCA search on lambda_U
% Keep v parameters fixed for now...
lambda_u_range = linspace(0, 5, 51);
n_lambda_u     = size(lambda_u_range, 2);

sigma_hat_holder = zeros(size(lambda_u_range)); 
bic_holder     = zeros(size(lambda_u_range));
df_holder      = zeros(size(lambda_u_range)); 

for lu_ix=1:n_lambda_u
  lu = lambda_u_range(lu_ix);
  
  % Quick and dirty SFPCA - only doing sparsity in U
  
  if INITIALIZE_SVD
      u_hat = U1; 
      v_hat = V1; 
  else 
      u_hat = randn(n, 1); u_hat = u_hat / norm(u_hat); 
      v_hat = randn(p, 1); v_hat = v_hat / norm(v_hat);
  end
  
  u_hat_old = u_hat + 5000; v_hat_old = v_hat + 5000; 
  
  while norm(u_hat - u_hat_old) + norm(v_hat - v_hat_old) > 1e-6
      while norm(u_hat - u_hat_old) > 1e-6
          u_hat_old = u_hat; 
          u_hat = st(X * v_hat, lu); 
          u_hat = u_hat / norm(u_hat); 
      end
      
      if UPDATE_V
          while norm(v_hat - v_hat_old) > 1e-6
              v_hat_old = v_hat;
              v_hat = st(u_hat' * X, 0);
              v_hat = v_hat / norm(v_hat);
              v_hat = v_hat'; % Keep sizes correct
          end
      else
          v_hat_old = v_hat;
      end
  end
  
  sigma_hat_sq = mean((X * v_hat - u_hat).^2); 
  
  sigma_hat_holder(lu_ix) = sigma_hat_sq; 
  df_holder(lu_ix)  = sum(u_hat ~= 0); 
  bic_holder(lu_ix) = log(sigma_hat_sq / n) + 1 / n * log(n) * sum(u_hat ~= 0); 
end

[min_bic, min_bic_ind] = min(bic_holder);

lambda_u_optimal = lambda_u_range(min_bic_ind);
  
% Quick and dirty SFPCA - only doing sparsity in U

if INITIALIZE_SVD
    u_hat = U1; 
    v_hat = V1; 
else 
    u_hat = randn(n, 1); u_hat = u_hat / norm(u_hat); 
    v_hat = randn(p, 1); v_hat = v_hat / norm(v_hat);
end
  
  u_hat_old = u_hat + 5000; v_hat_old = v_hat + 5000;
  
while norm(u_hat - u_hat_old) + norm(v_hat - v_hat_old) > 1e-6
    while norm(u_hat - u_hat_old) > 1e-6
        u_hat_old = u_hat; 
        u_hat = st(X * v_hat, lambda_u_optimal); 
        u_hat = u_hat / norm(u_hat); 
    end
     
    if UPDATE_V
        while norm(v_hat - v_hat_old) > 1e-6
            v_hat_old = v_hat;
            v_hat = st(u_hat' * X, 0);
            v_hat = v_hat / norm(v_hat);
            v_hat = v_hat'; % Keep sizes correct
        end
    else
        v_hat_old = v_hat;
    end
end

snr = max(svd(X)) / max(svd(N)); 

u_hat_supp = u_hat ~= 0; 
u_star_supp = u_star ~= 0; 
u_star_nonsupp = u_star == 0; 

tpr = mean(u_hat_supp(u_star_supp)); 
fpr = mean(u_hat_supp(u_star_nonsupp));

Running 1000 replicates, I see

                           | TPR  |  FPR  | SNR
Update_V = 1, INIT_SVD = 1 | 96%  | 0%    | 1.5
Update_V = 1, INIT_SVD = 0 | 42%  | 26%   | 1.5
Update_V = 0, INIT_SVD = 1 | 96%  | 0.1%  | 1.5
Update_V = 0, INIT_SVD = 0 | 36%  | 18%   | 1.5

My hunch is that the Update_V = 0, INIT_SVD = 1 case would suffer more on harder situations than this: the angle between V1(S + N) and v_star isn't very much for this problem.

Portably Render MoMA Problem in README

See discussion at #54 (comment)

Add sparse fused lasso to the toolset

Sparse fused lasso [1] can be easily calculated from unordered fused lasso. See proposition 1 in [2].

[1] Sparsity and smoothness via the fused lasso (https://web.stanford.edu/group/SOL/papers/fused-lasso-JRSSB.pdf)

[2 ] Pathwise coordinate optimization. (https://arxiv.org/pdf/0708.1485.pdf)

Add Sparse Group Lasso to Available Penalties

It should be possible to add the sparse group lasso penalty of https://www.tandfonline.com/doi/abs/10.1080/10618600.2012.681250 (among others).

In particular, equation (11) of https://arxiv.org/pdf/1801.09661.pdf gives a simple closed form proximal operator.

A collection of degree of freedoms

We need to find good estimates of degrees of freedom for different penalties if we are using BIC for model selection.

Lasso, fused lasso, sparse fused lasso.
Table 1 of THE SOLUTION PATH OF THE GENERALIZED LASSO
many others

Improve code coverage

Momentum step size

I notice there are two different versions of FISTA algorithm. To minimize g+h, where h is non-smooth and g smooth, FISTA update goes like this

$y = x ^ { ( k - 1 ) } +\beta_k \left( x ^ { ( k - 1 ) } - x ^ { ( k - 2 ) } \right)$

$x ^ { ( k ) } = \operatorname { prox } _ { t _ { k } h } \left( y - t _ { k } \nabla g ( y ) \right)$

where \beta_k is the momentum stepsize.

[1] uses

$\beta_k = \frac{k-2}{k+1}$

For the original paper [2]

$\beta_k = \frac{t_k-1}{t_k}$

where
$t_1=1,t _ { k + 1 } = \frac { 1 + \sqrt { 1 + 4 t _ { k } ^ { 2 } } } { 2 }$

[1] http://www.seas.ucla.edu/~vandenbe/236C/lectures/fista.pdf Page 3
[2] https://people.rennes.inria.fr/Cedric.Herzet/Cedric.Herzet/Sparse_Seminar/Entrees/2012/11/12_A_Fast_Iterative_Shrinkage-Thresholding_Algorithmfor_Linear_Inverse_Problems_(A._Beck,_M._Teboulle)_files/Breck_2009.pdf

Clean up existing entry points

Currently we have three R wrappers. They differ in functionalities, the abstraction level of arguments they take in, and where they are used in testsuites. Eventually their functionalities will be subsets of SFPCA wrappers' and thus they should be removed.

1. sfpca

https://github.com/michaelweylandt/MoMA/blob/7c8fd20fbd18d9cbfe21837bacd8ad401853efa6/R/sfpca.R#L1

It is simply an R interface for the C++ function cpp_sfpca , https://github.com/michaelweylandt/MoMA/blob/7c8fd20fbd18d9cbfe21837bacd8ad401853efa6/src/moma_R_function.cpp#L6 which uses repeatedly MoMA::solve and MoMA::deflate. We need to explicitly specify all parameters.

What it does: Solve the penalized SVD for fixed alpha_u/v, lambda_u/v. It also finds rank-k SVD by repeatedly deflating the matrix and then rerunning the algorithm. Note we don't have tests for the latter functionality yet.

Where it is used in the testsuite: It is used to test the correctness of the PG algorithm. To do this we inspect special cases where closed-form solutions exist. Then we check the results obtained by our algorithm against closed-form solutions. See https://github.com/michaelweylandt/MoMA/blob/7c8fd20fbd18d9cbfe21837bacd8ad401853efa6/tests/testthat/test_sfpca.R#L1.

2. moma_svd

https://github.com/michaelweylandt/MoMA/blob/7c8fd20fbd18d9cbfe21837bacd8ad401853efa6/R/moma_svd.R#L61

What it does: It supports the following three use cases. Note that it cooperates with prox argument wrappers like lasso(), scad() and PG loop settings wrapper (not merged yet). Essentially what it does is a proper subset of MoMA::select_nestedBIC described in section 3.

Find rank-k penalized SVD with fixed alpha_u/v and lambda_u/v by calling cpp_sfpca described above;
Run nested-BIC search on 2-D grids, whose axises could be a combination of any two parameters, by calling cpp_sfpca_nestedBIC. cpp_sfpca_nestedBIC does some sanity check and then calls MoMA::select_nestedBIC;
https://github.com/michaelweylandt/MoMA/blob/7c8fd20fbd18d9cbfe21837bacd8ad401853efa6/src/moma_R_function.cpp#L179
Run grid search on 2-D grids by calling cpp_sfpca_grid , which uses MoMA::reset and MoMA::solve;
https://github.com/michaelweylandt/MoMA/blob/7c8fd20fbd18d9cbfe21837bacd8ad401853efa6/src/moma_R_function.cpp#L80

Where it is used in the testsuite: It tests that prox arguments are correctly passed to C++ side (see test_argument.R https://github.com/michaelweylandt/MoMA/blob/7c8fd20fbd18d9cbfe21837bacd8ad401853efa6/tests/testthat/test_arguments.R#L1 ). We also test that cpp_sfpca_grid and cpp_sfpca give identical result (see test_grid.R https://github.com/michaelweylandt/MoMA/blob/7c8fd20fbd18d9cbfe21837bacd8ad401853efa6/tests/testthat/test_grid.R#L1).

3 MoMA::grid_BIC_mix

This will become the core of SFPCA wrappers (in progress). It supports finding the first k pairs of singular vectors, and the combination of nested-BIC search and grid search.

Where it is used in the testsuite: We test that it gives correctly sized lists. See https://github.com/michaelweylandt/MoMA/blob/7c8fd20fbd18d9cbfe21837bacd8ad401853efa6/tests/testthat/test_BIC_gird_mixed.R#L1

Add RcppFiveDList::operator()

I think so.

If we add RcppFiveDList::operator(), wont' that let us write

list(1,2,3,4,5) = Rcpp::List::create(....)

in the code that uses this class?

Originally posted by @michaelweylandt in #37

Make `X_original` constant

Any reason these are non-const?

Originally posted by @michaelweylandt in https://github.com/_render_node/MDE3OlB1bGxSZXF1ZXN0UmV2aWV3Mjc4MDgzMjQ0/pull_request_reviews/more_threads

Interpolate by projection onto an ellipsoid

Hmmm. Let me think more about this - I'm pretty busy for the next few days, so we can leave this as an issue and come back to it in a different PR.

Originally posted by @michaelweylandt in #42

Remove MoMA::u, MoMA::v

At the very beginning, we include u and v as members of MoMA to facilitate warm-start. They are initialized to be the SVD of data matrix X, and they are updated in real time as the PG loop runs. Ideally, we want

problem = MoMA(X, lambda_u=0)
problem.solve()
arma::vec u1 = problem.u
arma::vec v1 = problem.v

problem.reset(lambda_v=0.1)
problem.solve() // warm-start
arma::vec u2 = problem.u
arma::vec v2 = problem.v

However, as the mix of BIC search and grid search comes into our design, the concept of warm-start becomes intricate. Furthermore, dependence on MoMA::u and MoMA::v as solutions to the current or last penalized SVD problem might be a gotcha.

So lets remove MoMA::u/v and let outside code (wrt to MoMA internal code) take care of warm-start.

Add elastic net to the toolset

The elastic net [1] eases the problems caused by high correlation in the features space. However it is not of order 1 and thus violates the requirement.

[1] https://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf Regularization and variable selection via the elastic net

Move to GitHub Actions for CI and Deployment

See DataSlingers/clustRviz#97 and DataSlingers/ExclusiveLasso#15 for templates.

LDA Example Data: Rockhoppers

The iris data for LDA / classification is overused and typically mis-applied [1].

Let's use a new data set for our LDA examples and include it in the package. Steinfurth et al. have a paper on classifying penguins by sex using various body measurements [2] which seems like it would make a great example.

Idea from [3]; see also [4-5].

[1] http://www.dicook.org/files/jsm19/slides#1
[2] https://www.int-res.com/abstracts/esr/v39/p293-302/
[3] https://twitter.com/dan_p_simpson/status/1164581393516527616
[4] http://www.publish.csiro.au/mu/MU16027
[5] https://figshare.com/articles/Data_from_Using_measurements_to_predict_laying_order_in_harvested_Northern_Rockhopper_Penguin_Eudyptes_moseleyi_eggs/3384109

Remove `mget(ls()) ` in `create_moma_sparsity_func`

Ok - I understand where you are going with the formals manipulation and it's quite nice: I'm still less comfortable with mget(ls()). Since there are only 10 of these and we don't have any plans to add more, is it worth it to move away from the wrapper function approach and just hand-write 10 small functions?