zhangyuc / splash Goto Github PK

View Code? Open in Web Editor NEW

94.0 94.0 21.0 258.26 MB

Splash Project for parallel stochastic learning

HTML 37.94% Ruby 3.40% CSS 58.67%

splash's People

Contributors

Stargazers

Watchers

splash's Issues

Question about SGD update

Hi @zhangyuc
I find in your code, the weight update is as below:
Local mode: number of cores on the local machine

while(k < n) { values(k) *= - actual_stepsize / math.sqrt(s(k) + 1e-16f) k += 1 }
what troube me is why the update is this, which seem not the standard format:
such as w = w - stepsize*gradient.
thank you!

Distributed SharedVariableSet

In some problems like matrix/tensor factorization, for a specific input, the variables that need to be modified are known beforehand. What if each variable is a vector, there are millions of variables, and storing the whole matrix in SharedVariableSet would take too much memory (say, > 32 GB)?

Provided we divide the input in a specific manner, the weight matrix can also be divided, is there a way to distribute SharedVariableSet and load only the subset that's needed for a specific input chunk?

@zhangyuc how hard would it be to add this feature?

Reg param in StochasticGradientDescent optimizator

Hi, thanks for the library. From initial benchmarks for my ml pipelines seems to be faster than LBFGS. But accuracy for logistic regression is worth. Would be cool to handle reg parameter in StochasticGradientDescent the same way Spark's minibatch SGD does.

Some questions about spalsh

Hi @zhangyuc:
I have seen your paper about splash. I want a question:
Is the experiment "in Local solutions with unit-weight data" the same with spark mllib currently imp? BTW,According to my experiment, the memory usage is more than spark mllib SGD. Thankyou!

License is missing

Hi,
Could you state what kind of license (e.g., MIT, GPL, Apache License2, and so on) this product uses?
I think license attachment will have benefits for many organizations or companies.

Cheers,

Gradient class support mini-batch?

If I understand correctly by reading the report (http://arxiv.org/pdf/1506.07552v1.pdf), the new algorithm is not a batch-wise one. The philosophy behind is that batch-wise approach makes less progress compared to the full sequential update.

Yet from an implementation perspective, processing batch can be faster than processing points one by one at the same size (because of dense matrix multiplication). I guess it at least provides some room to speed up the optimization procedure.

I am not sure adding an extra interface to support mini-batch can be beneficial for further speed-ups?

Jianbo

Splash consistency with Spark's RDD guarantees

Reviewing Splash's code, I notice quite a number of places where a workset is modified in an RDD#foreach or RDD#map operation. This of course works fine when every change that is made is kept in memory and all memory is retained by Spark.

However, AFAICS this is a poor match to Spark's fault tolerance guarantees. E.g. #foreach operation is assumed by Spark to not do any mutations. This means that it is free to discard a copy of data that is also available on disk, whether a foreach loop iterated over it or not. When records in the RDD have been changed "behind Spark's back", results will differ depending on whether there was a GC or e.g. a node crashed.

Now, perhaps there's a good reason as to why this is not an issue for the approach Splash takes. I would certainly be curious to know under which conditions it is possible to do in-memory mutations without telling Spark - and still get the same fault tolerance guarantees.

My think about the local solution

Hi @zhangyuc @mijordan3
According to your comparison of reweighting, we see that the local solution (just use average) has a high bias, but if we shuffle the data before the training, i think we can avoid this problems, if so , we just needn't do reweighting.
Thank you very much!

zhangyuc / splash Goto Github PK

splash's People

Contributors

Stargazers

Watchers

Forkers

splash's Issues

Question about SGD update

Distributed SharedVariableSet

Reg param in StochasticGradientDescent optimizator

Some questions about spalsh

License is missing

Gradient class support mini-batch?

Splash consistency with Spark's RDD guarantees

My think about the local solution

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs