airoldilab / sgd Goto Github PK
View Code? Open in Web Editor NEWAn R package for large scale estimation with stochastic gradient descent
An R package for large scale estimation with stochastic gradient descent
subset
: a subset of data points; can be a parameter in sgd.controlna.action
: how to deal when data has NA; can be a parameter in sgd.controlmodel
: logical value determining whether to output the X data framex,y
: logical value determining whether to output the x and/or ycontrasts
: a list for performing hypothesis testing on other sets of predictors; can be a paramter in sgd.controlweights
offset
It's awful.
library(sgd)
# Dimensions
N <- 1e5
d <- 1e2
# Generate data.
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)
sgd.theta <- sgd(y ~ ., data=dat, model="lm", sgd.control=list(lr="one-dim-eigen"))
mean((sgd.theta$coefficients - theta)^2) # MSE
## [1] 24.7658
In src/sgd.cpp
. It should print out progress/useful debugging info, like the release stuff, e.g., "Family object released"
Experiment to run them, stock data set for them, and compare with coxnet
These must allow one to specify multiple sgd objects to plot.
available x-axis for each of the above plots:
Find large datasets for:
gmm
)glmnet
, bigglm
)coxnet
)In R, both lm and glm support weighted observations to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions). I find the SGD in sci-kit learn also supports weighted observations. Could you point me to some references about the implementation of SGD with weighted observations?
Thanks!
Hi everyone~
I've implemented p-dimension learning rate just now, with the source code in learnrate
branch. Currently, it supports the user to either pass in a string ('uni-dim' or 'px-dim') or an interger({1, 2}) to specify what kind of learning rate they prefer to use in our main interface.
Several issues I've encountered for now:
static mat learning_rate(const mat& theta_old, const Imp_DataPoint& data_pt,
unsigned t, unsigned p,
score_func_type score_func) {
mat Idiag(p, p, fill::eye);
mat Gi = score_func(theta_old, data_pt);
Idiag = Idiag + diagmat(Gi * Gi.t());
for (unsigned i = 0; i < p; ++i) {
if (abs(Idiag.at(i, i)) > 1e-8) {
Idiag.at(i, i) = 1. / Idiag.at(i, i);
}
}
return Idiag;
}
Get any idea where might went wrong...? (One thing that confused me was: is this Idiag
matrix initialized just once at the very beginning or in each iteration step?)
Create a wiki page where we will discuss approaches to set the learning rate.
I will also make sure to add a few interesting papers.
Ideas
This will be done once the package goes public, or if we get approval from Github that https://github.com/airoldilab is an education-based organization and they can thus support us having private repos for free (I contacted them; the request is pending).
to do
others to try
Since we're doing scalable computation, we should work with scalable I/O, storage, etc. packages. That is, we'd like to be able to run SGD on data sets with memory larger than a computer's RAM.
Add unit tests to tests/
folder.
Follow similar structure as https://github.com/hadley/dplyr/tree/master/tests. See http://r-pkgs.had.co.nz/tests.html
Assumption using Fisher information to give variances, but we should also be able to check whether or not SGD is in the asymptotic region
sgd.theta <- sgd(y ~ ., data=dat, model="glm", model.control=list(family=poisson))
gives error
Error in UseMethod("family") :
no applicable method for 'family' applied to an object of class "NULL"
sgd.theta <- sgd(y ~ ., data=dat, model="glm", model.control=list(family=poisson()))
is working.
Both inputs should be supported.
I built the package with R CMD build sgd
, then checked with R CMD check sgd_0.1.tar.gz
. There is a compiled code warning:
* checking compiled code ... WARNING
File ‘sgd/libs/sgd.so’:
Found ‘___assert_rtn’, possibly from ‘assert’ (C)
Object: ‘sgd.o’
Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the C RNG.
I don't see any asserts in our code, so I don't know how this arises.
Full log below:
* using log directory ‘/Users/dvt/temp/sgd.Rcheck’
* using R version 3.1.2 (2014-10-31)
* using platform: x86_64-apple-darwin14.1.0 (64-bit)
* using session charset: UTF-8
* checking for file ‘sgd/DESCRIPTION’ ... OK
* checking extension type ... Package
* this is package ‘sgd’ version ‘0.1’
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘sgd’ can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking line endings in C/C++/Fortran sources/headers ... OK
* checking line endings in Makefiles ... OK
* checking compilation flags in Makevars ... OK
* checking for portable use of $(BLAS_LIBS) and $(LAPACK_LIBS) ... OK
* checking compiled code ... WARNING
File ‘sgd/libs/sgd.so’:
Found ‘___assert_rtn’, possibly from ‘assert’ (C)
Object: ‘sgd.o’
Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the C RNG.
See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.
* checking examples ... OK
* checking PDF version of manual ... OK
* DONE
WARNING: There was 1 warning.
See
‘/Users/dvt/temp/sgd.Rcheck/00check.log’
for details.
This cannot be efficiently implemented in Rcpp. Rcpp cannot interface with the R function in any way other than calling it in R each time. Hence for now, we are using strings to match models specified by the user, which allows for faster computation using C++ functions. Later on, we will extend this where some model
argument in SGD can either be a string (uses C++ function on case-by-case basis) or a generic function (uses R function).
Not sure if there is an analytic solution for GLM other than normal distribution. Like for example, the Logistic requires to solve some kind of equation like $x + \exp(x)x = C$
. To make things more complicated, x
is a vector......
It should collect as few large objects as possible, such as not fitted.values
but of course keep coefficients
. The storage of the parameter estimates is already quite large in our setting where we assume large number of observations and features.
In implicit_experiment.h
, the current implementation has variables and class methods which are used only for GLMs, e.g., h_transfer
and g_link
.
Things available for example
We will need to organize the datasets which will be used for testing.
There are some datasets from economics that were brought in my attention.
Ye, could you try to start a WIKI page here regarding the datasets?
We will update it as we go.
I find that most regression results are compatible with summary(). For lm and glm, the summary includes the standard error and p-value of all estimated coefficients.
I think it would be hard to get the standard error for our package. Do you think we should exclude it from our package?
The list I pass into sgd.control
removes any of the additional default parameters.
X <- matrix(rnorm(10), ncol=1)
eps <- rnorm(10)
theta <- c(1,5)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)
sgd.est <- sgd(y ~ x, data=dat, model="glm",
model.control=list(family = gaussian()),
sgd.control=list(epsilon=1e-10))
## Error in fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.00691773260806754, :
## argument "lr.type" is missing, with no default
Do a if is.null(...)
on every list element in order to pass in the defaults instead?
We find that overflow is possible for the implicit-sgd method. When x and y are large.
At the first steps, theta is initialized to be close to 0s. A large y leads to a large r_n. As the bound is large, the guesses for ksi can be large. For Poisson model, h(norm(x)_ksi) can go to infinity.
For example, when x=10, y=1000, a_n=0.5, r_n=500. Initial guesses of ksi can easily go to 200. In the Poisson model, exp(200_100) will go to infinity.
Why is the square root in AdaGrad empirically getting better performance? ... or is it? To be analyzed!
Implicit works but this one doesn't (it should)
library(sgd)
# Dimensions
N <- 1e5
d <- 1e2
# Generate data.
X <- matrix(rnorm(N*d), ncol=d)
theta <- rep(5, d+1)
eps <- rnorm(N)
y <- cbind(1, X) %*% theta + eps
dat <- data.frame(y=y, x=X)
sgd.theta <- sgd(y ~ ., data=dat, model="lm", sgd.control=list(method="sgd"))
## NA or infinite gradient
## Error in fit(x, y, model.control, sgd.control) :
## An error has occured, program stopped.
I believe the user should have the following options for the learning rate.
I suggest we work on the 2 & 3 & 4 for now.
We can add the rest as we go. Any thoughts?
should be as fast as possible. for now do grid search. possible extension:
Our current method of determining learning rate includes calculating the eigenvalues of X. For categorical variables, we now split categorical variables into several indicator variables, and put the indicator variables as columns into X. Should we exclude categorical variables when we calculate the covariance matrix and eigenvalues? (It takes effort!)
How to handle feature selection, like glmnet
? Should allow for something such as elastic net to do L1/L2 regularization.
We would like to have a simple source file that
calls the appropriate routines to run a very simple, baseline model (e.g. regression).
this serves both as a quick-start and for code testing.
It is better to create a separate folder "testing" where we will include source for such tests.
lantian2012/sgd-r-package
dustinvtran/ai-sgd
ptoulis/implicit-sgd
Relevant github commands http://stackoverflow.com/a/10548919/1740602
Here is what i get when I run the "Dobson" example.
Error in rep(no, length.out = length(ans)) :
attempt to replicate an object of type 'closure'
My version is
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 0.2
year 2013
month 09
day 25
svn rev 63987
language R
version.string R version 3.0.2 (2013-09-25)
nickname Frisbee Sailing
intercept
flag in model.control
for glm
is ignored.
man/
folder has outdated stuff, and the running examples no longer work
this is the one that worked for me
library(sgd)
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
print(d.AD <- data.frame(treatment, outcome, counts))
sgd.D93 <- sgd(counts ~ outcome + treatment, model="glm", model.control=list(family = poisson()))
sgd.D93
There are many safe checks in glm. I plan to implement most of them, but I bumped into several issues.
additional applications to network models. That is, you want to aim for particular network characteristics, e.g., average number of links, average diameter, average
Hi,
@lantian2012 and I was doing some experiments on Poisson model dataset. We found out that if using p-dim
learning rate, the learning rate will drop too fast. This is mainly because Idiag
matrix is the summation of the square of score_function
in each iteration. And this value will soon goes up to around 1e+18 (when X are around 10, theta.true is around [0.5, 1], Y ~ Pois(exp(theta' X))), which means its inversion being as tiny as 1e-18. At this point, the learning rate is almost 0, which means in each update, theta will only change for really a small value.
@ptoulis Yesterday you said that we are calculating the square of the score function Gi^2 because this will approximate to the real Fisher Information Matrix. However, in our pset 3, it is listed that:
I(\theta) = E(h'(x_n'theta) * x_n x_n') / phi
It was using the expectation of the first derivative of transfer function (instead of the square of the score function).
I've been trying to use something like h'(x_n'theta)*(x_n x_n')
as learning rate. The result, though, was still disappointing.
Just get some rough heuristics for how it performs now
I was working on supporting a certain family for all its valid link functions. I have put the theoretical derivation in reference/glm_family_conclusion.pdf
and reference/Distribution_LinkFunc_Relationship.pdf
. Basically, what we did in class was the canonical link function. And a bit more change is needed in score function \nabla log-likelihood if we are using some other link functions.
The practical issue I've encountered is that: at the first few iterations, the estimation of theta is very unstable, therefore it may easily produce some crazy value when calculating the score function. Is there any way to avoid this issue?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.