GithubHelp home page GithubHelp logo

splash's People

Contributors

mijordan3 avatar zhangyuc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

splash's Issues

Question about SGD update

Hi @zhangyuc
I find in your code, the weight update is as below:
Local mode: number of cores on the local machine

while(k < n) { values(k) *= - actual_stepsize / math.sqrt(s(k) + 1e-16f) k += 1 }
what troube me is why the update is this, which seem not the standard format:
such as w = w - stepsize*gradient.
thank you!

Distributed SharedVariableSet

In some problems like matrix/tensor factorization, for a specific input, the variables that need to be modified are known beforehand. What if each variable is a vector, there are millions of variables, and storing the whole matrix in SharedVariableSet would take too much memory (say, > 32 GB)?

Provided we divide the input in a specific manner, the weight matrix can also be divided, is there a way to distribute SharedVariableSet and load only the subset that's needed for a specific input chunk?

@zhangyuc how hard would it be to add this feature?

Reg param in StochasticGradientDescent optimizator

Hi, thanks for the library. From initial benchmarks for my ml pipelines seems to be faster than LBFGS. But accuracy for logistic regression is worth. Would be cool to handle reg parameter in StochasticGradientDescent the same way Spark's minibatch SGD does.

Some questions about spalsh

Hi @zhangyuc:
I have seen your paper about splash. I want a question:
Is the experiment "in Local solutions with unit-weight data" the same with spark mllib currently imp? BTW,According to my experiment, the memory usage is more than spark mllib SGD. Thankyou!

License is missing

Hi,
Could you state what kind of license (e.g., MIT, GPL, Apache License2, and so on) this product uses?
I think license attachment will have benefits for many organizations or companies.

Cheers,

Gradient class support mini-batch?

If I understand correctly by reading the report (http://arxiv.org/pdf/1506.07552v1.pdf), the new algorithm is not a batch-wise one. The philosophy behind is that batch-wise approach makes less progress compared to the full sequential update.

Yet from an implementation perspective, processing batch can be faster than processing points one by one at the same size (because of dense matrix multiplication). I guess it at least provides some room to speed up the optimization procedure.

I am not sure adding an extra interface to support mini-batch can be beneficial for further speed-ups?

Jianbo

Splash consistency with Spark's RDD guarantees

Reviewing Splash's code, I notice quite a number of places where a workset is modified in an RDD#foreach or RDD#map operation. This of course works fine when every change that is made is kept in memory and all memory is retained by Spark.

However, AFAICS this is a poor match to Spark's fault tolerance guarantees. E.g. #foreach operation is assumed by Spark to not do any mutations. This means that it is free to discard a copy of data that is also available on disk, whether a foreach loop iterated over it or not. When records in the RDD have been changed "behind Spark's back", results will differ depending on whether there was a GC or e.g. a node crashed.

Now, perhaps there's a good reason as to why this is not an issue for the approach Splash takes. I would certainly be curious to know under which conditions it is possible to do in-memory mutations without telling Spark - and still get the same fault tolerance guarantees.

My think about the local solution

Hi @zhangyuc @mijordan3
According to your comparison of reweighting, we see that the local solution (just use average) has a high bias, but if we shuffle the data before the training, i think we can avoid this problems, if so , we just needn't do reweighting.
Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.