airoldilab / sgd Goto Github PK

View Code? Open in Web Editor NEW

61.0 12.0 18.0 2.08 MB

An R package for large scale estimation with stochastic gradient descent

C++ 55.26% R 42.81% C 1.94%

gradient-descent big-data data-analysis r statistics

sgd's People

Stargazers

Watchers

Forkers

milestonesvn codeaudit nhisato appcoreopc sandy4321 mfouda hxd1011 vafisher mu-bu hulalazz samcorbettdavies dijkhuist cerquide kkholst georgios-vassos1 lixixibj jonlachmann yshin12

sgd's Issues

Create basic template of the paper

Flags for constructing design matrix is ignored

subset: a subset of data points; can be a parameter in sgd.control
na.action: how to deal when data has NA; can be a parameter in sgd.control
model: logical value determining whether to output the X data frame
x,y: logical value determining whether to output the x and/or y
contrasts: a list for performing hypothesis testing on other sets of predictors; can be a paramter in sgd.control
weights
offset

One-dim-eigen broken

It's awful.

library(sgd)

# Dimensions
N <- 1e5
d <- 1e2

# Generate data.
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.theta <- sgd(y ~ ., data=dat, model="lm", sgd.control=list(lr="one-dim-eigen"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 24.7658

verbose argument in run_experiment() doesn't do anything

In src/sgd.cpp. It should print out progress/useful debugging info, like the release stuff, e.g., "Family object released"

Add Cox proportional hazards model

Experiment to run them, stock data set for them, and compare with coxnet

Plot diagnostics

These must allow one to specify multiple sgd objects to plot.

MSE
Classification error
Evaluation of cost function

available x-axis for each of the above plots:

Runtime
log-Iteration

Datasets for cox model and EE

Find large datasets for:

GMM (compare to gmm)
glm (compare to glmnet, bigglm)
cox (compare to coxnet)

In R, both lm and glm support weighted observations to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions). I find the SGD in sci-kit learn also supports weighted observations. Could you point me to some references about the implementation of SGD with weighted observations?

Thanks!

Rename cpp files to sgd_* instead of implicit_*

Implementation of p-dimension learning rate

Hi everyone~

I've implemented p-dimension learning rate just now, with the source code in learnrate branch. Currently, it supports the user to either pass in a string ('uni-dim' or 'px-dim') or an interger({1, 2}) to specify what kind of learning rate they prefer to use in our main interface.

Several issues I've encountered for now:

@lantian2012 and I had a really long discussion about what we should pass in to the learning rate function to make it generic enough. (Before we are only passing in t, which represents the current number of iteration. Soon we discovered that this was not enough for p-dimension case.) Right now what we passed in was the t-th datapoint, the previous theta and t itself.
This p-dimension learning rate did not perform that well as the previous one did. I still used the system generated poisson dataset to run the test. The true theta should be around (0.7, 1.4), but what we got when using p-dimension learning rate and 'asgd' method was (0.52, 1.29), and (0.16, 1.45) with 'implicit'. Here's the code of the learning rate:

static mat learning_rate(const mat& theta_old, const Imp_DataPoint& data_pt, 
                          unsigned t, unsigned p,
                          score_func_type score_func) {
    mat Idiag(p, p, fill::eye);
    mat Gi = score_func(theta_old, data_pt);
    Idiag = Idiag + diagmat(Gi * Gi.t());

    for (unsigned i = 0; i < p; ++i) {
      if (abs(Idiag.at(i, i)) > 1e-8) {
        Idiag.at(i, i) = 1. / Idiag.at(i, i);
      }
    }
    return Idiag;
  }

Get any idea where might went wrong...? (One thing that confused me was: is this Idiag matrix initialized just once at the very beginning or in each iteration step?)

In #8 where @ptoulis mentioned Auto-full method, could you please offer some more information on what that is? Thanks :)

Create an issue on the wiki regarding categorical variables.

Create issue on wiki about learning rates.

Create a wiki page where we will discuss approaches to set the learning rate.

I will also make sure to add a few interesting papers.

Convergence diagnostics

Ideas

Run chi-squared test sequentially after a batch of iterations to check convergence. This can also be used as a way to stop SGD early rather than running all of the iterations that a user may specify.
Bootstrap data set, or choose different initial points, to see if the point estimate converges to something else
Run a bunch of SGD chains, and analyze chain convergence using something like Gelman-Rubin diagnostic
Assess (an estimated) number of remaining iterations for convergence while running sgd. This is similar to number of effective draws for MCMC.

Make theory and SGD updates for heteroskedasticity

Transfer repository ownership to airoldilab

This will be done once the package goes public, or if we get approval from Github that https://github.com/airoldilab is an education-based organization and they can thus support us having private repos for free (I contacted them; the request is pending).

Cannot export all method functions into R library

Add simulation experiments for spectral methods

to do

mixture of Gaussians
HMM

others to try

Kalman filter
POMDPs

Add compatibility to read in bigmemory data sets

Since we're doing scalable computation, we should work with scalable I/O, storage, etc. packages. That is, we'd like to be able to run SGD on data sets with memory larger than a computer's RAM.

Unit testing

Add unit tests to tests/ folder.

Interval estimates

Assumption using Fisher information to give variances, but we should also be able to check whether or not SGD is in the asymptotic region

Safe check for glm family fails

sgd.theta <- sgd(y ~ ., data=dat, model="glm", model.control=list(family=poisson))

gives error

Error in UseMethod("family") : 
  no applicable method for 'family' applied to an object of class "NULL"

sgd.theta <- sgd(y ~ ., data=dat, model="glm", model.control=list(family=poisson())) is working.

Both inputs should be supported.

compiled code warning in R CMD check

I built the package with R CMD build sgd, then checked with R CMD check sgd_0.1.tar.gz. There is a compiled code warning:

* checking compiled code ... WARNING
File ‘sgd/libs/sgd.so’:
  Found ‘___assert_rtn’, possibly from ‘assert’ (C)
    Object: ‘sgd.o’

Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the C RNG.

I don't see any asserts in our code, so I don't know how this arises.

Full log below:

* using log directory ‘/Users/dvt/temp/sgd.Rcheck’
* using R version 3.1.2 (2014-10-31)
* using platform: x86_64-apple-darwin14.1.0 (64-bit)
* using session charset: UTF-8
* checking for file ‘sgd/DESCRIPTION’ ... OK
* checking extension type ... Package
* this is package ‘sgd’ version ‘0.1’
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘sgd’ can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking line endings in C/C++/Fortran sources/headers ... OK
* checking line endings in Makefiles ... OK
* checking compilation flags in Makevars ... OK
* checking for portable use of $(BLAS_LIBS) and $(LAPACK_LIBS) ... OK
* checking compiled code ... WARNING
File ‘sgd/libs/sgd.so’:
  Found ‘___assert_rtn’, possibly from ‘assert’ (C)
    Object: ‘sgd.o’

Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the C RNG.

See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.
* checking examples ... OK
* checking PDF version of manual ... OK
* DONE

WARNING: There was 1 warning.
See
  ‘/Users/dvt/temp/sgd.Rcheck/00check.log’
for details.

Allow user to specify generic loss function to do SGD on

This cannot be efficiently implemented in Rcpp. Rcpp cannot interface with the R function in any way other than calling it in R each time. Hence for now, we are using strings to match models specified by the user, which allows for faster computation using C++ functions. Later on, we will extend this where some model argument in SGD can either be a string (uses C++ function on case-by-case basis) or a generic function (uses R function).

Method to calculate update equation for Implicit-SGD

Not sure if there is an analytic solution for GLM other than normal distribution. Like for example, the Logistic requires to solve some kind of equation like $x + \exp(x)x = C$ . To make things more complicated, x is a vector......

Output of sgd

It should collect as few large objects as possible, such as not fitted.values but of course keep coefficients. The storage of the parameter estimates is already quite large in our setting where we assume large number of observations and features.

Build base class for running experiments for arbitrary models, not just GLMs

In implicit_experiment.h, the current implementation has variables and class methods which are used only for GLMs, e.g., h_transfer and g_link.

Decide on name for R package on CRAN

Things available for example

sgd (stochastic gradient descent)
sa (stochastic approximations)
sgm (stochastic gradient methods)

work on finalizing the datasets

We will need to organize the datasets which will be used for testing.
There are some datasets from economics that were brought in my attention.
Ye, could you try to start a WIKI page here regarding the datasets?
We will update it as we go.

Output Compatibility

I find that most regression results are compatible with summary(). For lm and glm, the summary includes the standard error and p-value of all estimated coefficients.
I think it would be hard to get the standard error for our package. Do you think we should exclude it from our package?

default arguments in control list of sgd

The list I pass into sgd.control removes any of the additional default parameters.

X <- matrix(rnorm(10), ncol=1)
eps <- rnorm(10)
theta <- c(1,5)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.est <- sgd(y ~ x, data=dat, model="glm",
               model.control=list(family = gaussian()),
               sgd.control=list(epsilon=1e-10))
## Error in fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.00691773260806754,  :
##  argument "lr.type" is missing, with no default

Do a if is.null(...) on every list element in order to pass in the defaults instead?

Implicit overflow

We find that overflow is possible for the implicit-sgd method. When x and y are large.

At the first steps, theta is initialized to be close to 0s. A large y leads to a large r_n. As the bound is large, the guesses for ksi can be large. For Poisson model, h(norm(x)_ksi) can go to infinity.
For example, when x=10, y=1000, a_n=0.5, r_n=500. Initial guesses of ksi can easily go to 200. In the Poisson model, exp(200_100) will go to infinity.

AdaGrad vs d-dim

Why is the square root in AdaGrad empirically getting better performance? ... or is it? To be analyzed!

gradient blows up for linear regression with standard SGD

Implicit works but this one doesn't (it should)

library(sgd)

# Dimensions
N <- 1e5
d <- 1e2

# Generate data.
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.theta <- sgd(y ~ ., data=dat, model="lm", sgd.control=list(method="sgd"))
## NA or infinite gradient
## Error in fit(x, y, model.control, sgd.control) :
##   An error has occured, program stopped.

Implement five (5) different modes for setting the learning rate

I believe the user should have the following options for the learning rate.

Manual: Should be possible to set the learning rate manually
Auto-1dim: Automatic setting of one-dimensional rate (for speed)
Auto-pxdim: Automatic setting of diagonal p-dimensional rate (for efficiency)
Auto-full: Online estimation of the full matrix.
Auto-QN: Use a Quasi-Newton scheme.
Averaging: Use averaging.

I suggest we work on the 2 & 3 & 4 for now.
We can add the rest as we go. Any thoughts?

automatic cross validation for parameter tuning

should be as fast as possible. for now do grid search. possible extension:

implementation using bayesian optimization, c.f., ryan adam's work

Learning rate for categorical variables

Our current method of determining learning rate includes calculating the eigenvalues of X. For categorical variables, we now split categorical variables into several indicator variables, and put the indicator variables as columns into X. Should we exclude categorical variables when we calculate the covariance matrix and eigenvalues? (It takes effort!)

Allow github installation using devtools

Penalized MLE

How to handle feature selection, like glmnet? Should allow for something such as elastic net to do L1/L2 regularization.

Add simple test code

We would like to have a simple source file that
calls the appropriate routines to run a very simple, baseline model (e.g. regression).

this serves both as a quick-start and for code testing.

It is better to create a separate folder "testing" where we will include source for such tests.

Merge all repos

lantian2012/sgd-r-package
dustinvtran/ai-sgd
ptoulis/implicit-sgd

Relevant github commands http://stackoverflow.com/a/10548919/1740602

Cannot run sgd example

Here is what i get when I run the "Dobson" example.

Error in rep(no, length.out = length(ans)) :
attempt to replicate an object of type 'closure'

My version is
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 0.2
year 2013
month 09
day 25
svn rev 63987
language R
version.string R version 3.0.2 (2013-09-25)
nickname Frisbee Sailing

intercept flag in model.control for glm is ignored

intercept flag in model.control for glm is ignored.

Add SVMs

update documentation according to the new interface

man/ folder has outdated stuff, and the running examples no longer work

this is the one that worked for me

library(sgd)
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
print(d.AD <- data.frame(treatment, outcome, counts))
sgd.D93 <- sgd(counts ~ outcome + treatment, model="glm", model.control=list(family = poisson()))
sgd.D93

Safe checks

There are many safe checks in glm. I plan to implement most of them, but I bumped into several issues.

glm checks the validity of estimates on all observations in each iteration. For example, in each iteration, we can get \mu=linkinv(x %*% \theta). \theta is the current estimate. In glm, it checks if variance(\mu) is a number and variance(\mu) != 0. Note x is the matrix of all observations. I feel this might be too computationally expensive as we may have many observations. Also, I feel it weird to use the whole data in each iteration in an iterative algorithm. I am thinking about only using the observation of the current iteration (one row of x) for checks. Then, I plan to check the validity of the last estimate on all observations. How do you think of it?
glm checks if d(mu)/d(eta) is NA or ==0. mu=E(Y). eta = x %*% theta. I don't know why we should check this. glm will stop if there is a NA, and it will ignore this observation if there is a 0. I'm sorry for being bad at generalized linear models.
glm changes step size when deviance is infinite, or eta is out of the support. What glm does is that it uses the average of the current estimate and the old estimate as the current estimate. If the deviance or eta is still wrong, it does the average over and over again. I think this strategy will work in SGD, do you think it is a good idea to implement it in this way?
Do you think we should check the change in deviance for the last two iterations to determine if the algorithm has converged, and give a warning if it doesn't converge?

Add estimating equations model

additional applications to network models. That is, you want to aim for particular network characteristics, e.g., average number of links, average diameter, average

P-dim Learning Rate Decreasing Too Fast

Hi,

@lantian2012 and I was doing some experiments on Poisson model dataset. We found out that if using p-dim learning rate, the learning rate will drop too fast. This is mainly because Idiag matrix is the summation of the square of score_function in each iteration. And this value will soon goes up to around 1e+18 (when X are around 10, theta.true is around [0.5, 1], Y ~ Pois(exp(theta' X))), which means its inversion being as tiny as 1e-18. At this point, the learning rate is almost 0, which means in each update, theta will only change for really a small value.

@ptoulis Yesterday you said that we are calculating the square of the score function Gi^2 because this will approximate to the real Fisher Information Matrix. However, in our pset 3, it is listed that:
I(\theta) = E(h'(x_n'theta) * x_n x_n') / phi

It was using the expectation of the first derivative of transfer function (instead of the square of the score function).

I've been trying to use something like h'(x_n'theta)*(x_n x_n') as learning rate. The result, though, was still disappointing.

benchmark times

Just get some rough heuristics for how it performs now

Family for different link function

I was working on supporting a certain family for all its valid link functions. I have put the theoretical derivation in reference/glm_family_conclusion.pdf and reference/Distribution_LinkFunc_Relationship.pdf. Basically, what we did in class was the canonical link function. And a bit more change is needed in score function \nabla log-likelihood if we are using some other link functions.

The practical issue I've encountered is that: at the first few iterations, the estimation of theta is very unstable, therefore it may easily produce some crazy value when calculating the score function. Is there any way to avoid this issue?

airoldilab / sgd Goto Github PK

sgd's People

Stargazers

Watchers

Forkers

sgd's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs