zhangyuc / splash Goto Github PK
View Code? Open in Web Editor NEWSplash Project for parallel stochastic learning
Splash Project for parallel stochastic learning
Hi @zhangyuc
I find in your code, the weight update is as below:
Local mode: number of cores on the local machine
while(k < n) { values(k) *= - actual_stepsize / math.sqrt(s(k) + 1e-16f) k += 1 }
what troube me is why the update is this, which seem not the standard format:
such as w = w - stepsize*gradient.
thank you!
In some problems like matrix/tensor factorization, for a specific input, the variables that need to be modified are known beforehand. What if each variable is a vector, there are millions of variables, and storing the whole matrix in SharedVariableSet
would take too much memory (say, > 32 GB)?
Provided we divide the input in a specific manner, the weight matrix can also be divided, is there a way to distribute SharedVariableSet
and load only the subset that's needed for a specific input chunk?
@zhangyuc how hard would it be to add this feature?
Hi, thanks for the library. From initial benchmarks for my ml pipelines seems to be faster than LBFGS. But accuracy for logistic regression is worth. Would be cool to handle reg parameter in StochasticGradientDescent the same way Spark's minibatch SGD does.
Hi @zhangyuc:
I have seen your paper about splash. I want a question:
Is the experiment "in Local solutions with unit-weight data" the same with spark mllib currently imp? BTW,According to my experiment, the memory usage is more than spark mllib SGD. Thankyou!
Hi,
Could you state what kind of license (e.g., MIT, GPL, Apache License2, and so on) this product uses?
I think license attachment will have benefits for many organizations or companies.
Cheers,
If I understand correctly by reading the report (http://arxiv.org/pdf/1506.07552v1.pdf), the new algorithm is not a batch-wise one. The philosophy behind is that batch-wise approach makes less progress compared to the full sequential update.
Yet from an implementation perspective, processing batch can be faster than processing points one by one at the same size (because of dense matrix multiplication). I guess it at least provides some room to speed up the optimization procedure.
I am not sure adding an extra interface to support mini-batch can be beneficial for further speed-ups?
Jianbo
Reviewing Splash's code, I notice quite a number of places where a workset is modified in an RDD#foreach or RDD#map operation. This of course works fine when every change that is made is kept in memory and all memory is retained by Spark.
However, AFAICS this is a poor match to Spark's fault tolerance guarantees. E.g. #foreach operation is assumed by Spark to not do any mutations. This means that it is free to discard a copy of data that is also available on disk, whether a foreach loop iterated over it or not. When records in the RDD have been changed "behind Spark's back", results will differ depending on whether there was a GC or e.g. a node crashed.
Now, perhaps there's a good reason as to why this is not an issue for the approach Splash takes. I would certainly be curious to know under which conditions it is possible to do in-memory mutations without telling Spark - and still get the same fault tolerance guarantees.
Hi @zhangyuc @mijordan3
According to your comparison of reweighting, we see that the local solution (just use average) has a high bias, but if we shuffle the data before the training, i think we can avoid this problems, if so , we just needn't do reweighting.
Thank you very much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.