GithubHelp home page GithubHelp logo

airoldilab / sgd Goto Github PK

View Code? Open in Web Editor NEW
61.0 12.0 18.0 2.08 MB

An R package for large scale estimation with stochastic gradient descent

C++ 55.26% R 42.81% C 1.94%
gradient-descent big-data data-analysis r statistics

sgd's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sgd's Issues

Flags for constructing design matrix is ignored

  • subset: a subset of data points; can be a parameter in sgd.control
  • na.action: how to deal when data has NA; can be a parameter in sgd.control
  • model: logical value determining whether to output the X data frame
  • x,y: logical value determining whether to output the x and/or y
  • contrasts: a list for performing hypothesis testing on other sets of predictors; can be a paramter in sgd.control
  • weights
  • offset

One-dim-eigen broken

It's awful.

library(sgd)

# Dimensions
N <- 1e5
d <- 1e2

# Generate data.
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.theta <- sgd(y ~ ., data=dat, model="lm", sgd.control=list(lr="one-dim-eigen"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 24.7658

Plot diagnostics

These must allow one to specify multiple sgd objects to plot.

  • MSE
  • Classification error
  • Evaluation of cost function

available x-axis for each of the above plots:

  • Runtime
  • log-Iteration

Weighted Observations

In R, both lm and glm support weighted observations to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions). I find the SGD in sci-kit learn also supports weighted observations. Could you point me to some references about the implementation of SGD with weighted observations?

Thanks!

Implementation of p-dimension learning rate

Hi everyone~

I've implemented p-dimension learning rate just now, with the source code in learnrate branch. Currently, it supports the user to either pass in a string ('uni-dim' or 'px-dim') or an interger({1, 2}) to specify what kind of learning rate they prefer to use in our main interface.

Several issues I've encountered for now:

  • @lantian2012 and I had a really long discussion about what we should pass in to the learning rate function to make it generic enough. (Before we are only passing in t, which represents the current number of iteration. Soon we discovered that this was not enough for p-dimension case.) Right now what we passed in was the t-th datapoint, the previous theta and t itself.
  • This p-dimension learning rate did not perform that well as the previous one did. I still used the system generated poisson dataset to run the test. The true theta should be around (0.7, 1.4), but what we got when using p-dimension learning rate and 'asgd' method was (0.52, 1.29), and (0.16, 1.45) with 'implicit'. Here's the code of the learning rate:
static mat learning_rate(const mat& theta_old, const Imp_DataPoint& data_pt, 
                          unsigned t, unsigned p,
                          score_func_type score_func) {
    mat Idiag(p, p, fill::eye);
    mat Gi = score_func(theta_old, data_pt);
    Idiag = Idiag + diagmat(Gi * Gi.t());

    for (unsigned i = 0; i < p; ++i) {
      if (abs(Idiag.at(i, i)) > 1e-8) {
        Idiag.at(i, i) = 1. / Idiag.at(i, i);
      }
    }
    return Idiag;
  }

Get any idea where might went wrong...? (One thing that confused me was: is this Idiag matrix initialized just once at the very beginning or in each iteration step?)

  • In #8 where @ptoulis mentioned Auto-full method, could you please offer some more information on what that is? Thanks :)

Convergence diagnostics

Ideas

  • Run chi-squared test sequentially after a batch of iterations to check convergence. This can also be used as a way to stop SGD early rather than running all of the iterations that a user may specify.
  • Bootstrap data set, or choose different initial points, to see if the point estimate converges to something else
  • Run a bunch of SGD chains, and analyze chain convergence using something like Gelman-Rubin diagnostic
  • Assess (an estimated) number of remaining iterations for convergence while running sgd. This is similar to number of effective draws for MCMC.

Add compatibility to read in bigmemory data sets

Since we're doing scalable computation, we should work with scalable I/O, storage, etc. packages. That is, we'd like to be able to run SGD on data sets with memory larger than a computer's RAM.

Interval estimates

Assumption using Fisher information to give variances, but we should also be able to check whether or not SGD is in the asymptotic region

Safe check for glm family fails

sgd.theta <- sgd(y ~ ., data=dat, model="glm", model.control=list(family=poisson))

gives error

Error in UseMethod("family") : 
  no applicable method for 'family' applied to an object of class "NULL"

sgd.theta <- sgd(y ~ ., data=dat, model="glm", model.control=list(family=poisson())) is working.

Both inputs should be supported.

compiled code warning in R CMD check

I built the package with R CMD build sgd, then checked with R CMD check sgd_0.1.tar.gz. There is a compiled code warning:

* checking compiled code ... WARNING
File ‘sgd/libs/sgd.so’:
  Found ‘___assert_rtn’, possibly from ‘assert’ (C)
    Object: ‘sgd.o’

Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the C RNG.

I don't see any asserts in our code, so I don't know how this arises.

Full log below:

* using log directory ‘/Users/dvt/temp/sgd.Rcheck’
* using R version 3.1.2 (2014-10-31)
* using platform: x86_64-apple-darwin14.1.0 (64-bit)
* using session charset: UTF-8
* checking for file ‘sgd/DESCRIPTION’ ... OK
* checking extension type ... Package
* this is package ‘sgd’ version ‘0.1’
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘sgd’ can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking line endings in C/C++/Fortran sources/headers ... OK
* checking line endings in Makefiles ... OK
* checking compilation flags in Makevars ... OK
* checking for portable use of $(BLAS_LIBS) and $(LAPACK_LIBS) ... OK
* checking compiled code ... WARNING
File ‘sgd/libs/sgd.so’:
  Found ‘___assert_rtn’, possibly from ‘assert’ (C)
    Object: ‘sgd.o’

Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the C RNG.

See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.
* checking examples ... OK
* checking PDF version of manual ... OK
* DONE

WARNING: There was 1 warning.
See
  ‘/Users/dvt/temp/sgd.Rcheck/00check.log’
for details.

Allow user to specify generic loss function to do SGD on

This cannot be efficiently implemented in Rcpp. Rcpp cannot interface with the R function in any way other than calling it in R each time. Hence for now, we are using strings to match models specified by the user, which allows for faster computation using C++ functions. Later on, we will extend this where some model argument in SGD can either be a string (uses C++ function on case-by-case basis) or a generic function (uses R function).

Method to calculate update equation for Implicit-SGD

Not sure if there is an analytic solution for GLM other than normal distribution. Like for example, the Logistic requires to solve some kind of equation like $x + \exp(x)x = C$. To make things more complicated, x is a vector......

Output of sgd

It should collect as few large objects as possible, such as not fitted.values but of course keep coefficients. The storage of the parameter estimates is already quite large in our setting where we assume large number of observations and features.

work on finalizing the datasets

We will need to organize the datasets which will be used for testing.
There are some datasets from economics that were brought in my attention.
Ye, could you try to start a WIKI page here regarding the datasets?
We will update it as we go.

Output Compatibility

I find that most regression results are compatible with summary(). For lm and glm, the summary includes the standard error and p-value of all estimated coefficients.
I think it would be hard to get the standard error for our package. Do you think we should exclude it from our package?

default arguments in control list of sgd

The list I pass into sgd.control removes any of the additional default parameters.

X <- matrix(rnorm(10), ncol=1)
eps <- rnorm(10)
theta <- c(1,5)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.est <- sgd(y ~ x, data=dat, model="glm",
               model.control=list(family = gaussian()),
               sgd.control=list(epsilon=1e-10))
## Error in fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.00691773260806754,  :
##  argument "lr.type" is missing, with no default

Do a if is.null(...) on every list element in order to pass in the defaults instead?

Implicit overflow

We find that overflow is possible for the implicit-sgd method. When x and y are large.

screen shot 2015-01-10 at 11 19 49 pm

At the first steps, theta is initialized to be close to 0s. A large y leads to a large r_n. As the bound is large, the guesses for ksi can be large. For Poisson model, h(norm(x)_ksi) can go to infinity.
For example, when x=10, y=1000, a_n=0.5, r_n=500. Initial guesses of ksi can easily go to 200. In the Poisson model, exp(200_100) will go to infinity.

AdaGrad vs d-dim

Why is the square root in AdaGrad empirically getting better performance? ... or is it? To be analyzed!

gradient blows up for linear regression with standard SGD

Implicit works but this one doesn't (it should)

library(sgd)

# Dimensions
N <- 1e5
d <- 1e2

# Generate data.
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)

sgd.theta <- sgd(y ~ ., data=dat, model="lm", sgd.control=list(method="sgd"))
## NA or infinite gradient
## Error in fit(x, y, model.control, sgd.control) :
##   An error has occured, program stopped.

Implement five (5) different modes for setting the learning rate

I believe the user should have the following options for the learning rate.

  • Manual: Should be possible to set the learning rate manually
  • Auto-1dim: Automatic setting of one-dimensional rate (for speed)
  • Auto-pxdim: Automatic setting of diagonal p-dimensional rate (for efficiency)
  • Auto-full: Online estimation of the full matrix.
  • Auto-QN: Use a Quasi-Newton scheme.
  • Averaging: Use averaging.

I suggest we work on the 2 & 3 & 4 for now.
We can add the rest as we go. Any thoughts?

Learning rate for categorical variables

Our current method of determining learning rate includes calculating the eigenvalues of X. For categorical variables, we now split categorical variables into several indicator variables, and put the indicator variables as columns into X. Should we exclude categorical variables when we calculate the covariance matrix and eigenvalues? (It takes effort!)

Penalized MLE

How to handle feature selection, like glmnet? Should allow for something such as elastic net to do L1/L2 regularization.

  • GLMs (explicit)
  • GLMs (implicit)
  • Cox model (explicit)
  • Cox model (implicit)
  • GMMs (explicit)
  • GMMs (implicit)
  • M-estimation (explicit)
  • M-estimation (implicit)

Add simple test code

We would like to have a simple source file that
calls the appropriate routines to run a very simple, baseline model (e.g. regression).

this serves both as a quick-start and for code testing.

It is better to create a separate folder "testing" where we will include source for such tests.

Cannot run sgd example

Here is what i get when I run the "Dobson" example.

Error in rep(no, length.out = length(ans)) :
attempt to replicate an object of type 'closure'

My version is
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 0.2
year 2013
month 09
day 25
svn rev 63987
language R
version.string R version 3.0.2 (2013-09-25)
nickname Frisbee Sailing

update documentation according to the new interface

man/ folder has outdated stuff, and the running examples no longer work

this is the one that worked for me

library(sgd)
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
print(d.AD <- data.frame(treatment, outcome, counts))
sgd.D93 <- sgd(counts ~ outcome + treatment, model="glm", model.control=list(family = poisson()))
sgd.D93

Safe checks

There are many safe checks in glm. I plan to implement most of them, but I bumped into several issues.

  1. glm checks the validity of estimates on all observations in each iteration. For example, in each iteration, we can get \mu=linkinv(x %*% \theta). \theta is the current estimate. In glm, it checks if variance(\mu) is a number and variance(\mu) != 0. Note x is the matrix of all observations. I feel this might be too computationally expensive as we may have many observations. Also, I feel it weird to use the whole data in each iteration in an iterative algorithm. I am thinking about only using the observation of the current iteration (one row of x) for checks. Then, I plan to check the validity of the last estimate on all observations. How do you think of it?
  2. glm checks if d(mu)/d(eta) is NA or ==0. mu=E(Y). eta = x %*% theta. I don't know why we should check this. glm will stop if there is a NA, and it will ignore this observation if there is a 0. I'm sorry for being bad at generalized linear models.
  3. glm changes step size when deviance is infinite, or eta is out of the support. What glm does is that it uses the average of the current estimate and the old estimate as the current estimate. If the deviance or eta is still wrong, it does the average over and over again. I think this strategy will work in SGD, do you think it is a good idea to implement it in this way?
  4. Do you think we should check the change in deviance for the last two iterations to determine if the algorithm has converged, and give a warning if it doesn't converge?

Add estimating equations model

additional applications to network models. That is, you want to aim for particular network characteristics, e.g., average number of links, average diameter, average

P-dim Learning Rate Decreasing Too Fast

Hi,

@lantian2012 and I was doing some experiments on Poisson model dataset. We found out that if using p-dim learning rate, the learning rate will drop too fast. This is mainly because Idiag matrix is the summation of the square of score_function in each iteration. And this value will soon goes up to around 1e+18 (when X are around 10, theta.true is around [0.5, 1], Y ~ Pois(exp(theta' X))), which means its inversion being as tiny as 1e-18. At this point, the learning rate is almost 0, which means in each update, theta will only change for really a small value.

@ptoulis Yesterday you said that we are calculating the square of the score function Gi^2 because this will approximate to the real Fisher Information Matrix. However, in our pset 3, it is listed that:
I(\theta) = E(h'(x_n'theta) * x_n x_n') / phi

It was using the expectation of the first derivative of transfer function (instead of the square of the score function).

I've been trying to use something like h'(x_n'theta)*(x_n x_n') as learning rate. The result, though, was still disappointing.

Family for different link function

I was working on supporting a certain family for all its valid link functions. I have put the theoretical derivation in reference/glm_family_conclusion.pdf and reference/Distribution_LinkFunc_Relationship.pdf. Basically, what we did in class was the canonical link function. And a bit more change is needed in score function \nabla log-likelihood if we are using some other link functions.

The practical issue I've encountered is that: at the first few iterations, the estimation of theta is very unstable, therefore it may easily produce some crazy value when calculating the score function. Is there any way to avoid this issue?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.