GithubHelp home page GithubHelp logo

pydatablog / parallelkmeans.jl Goto Github PK

View Code? Open in Web Editor NEW
50.0 3.0 12.0 5.86 MB

Parallel & lightning fast implementation of available classic and contemporary variants of the KMeans clustering algorithm

License: MIT License

Julia 76.26% Jupyter Notebook 23.74%
clustering kmeans-clustering julia parallel-computing mlj-unsupervised mlj kmeans-clustering-algorithm

parallelkmeans.jl's People

Contributors

ablaom avatar anderson15 avatar arkoniak avatar asinghvi17 avatar github-actions[bot] avatar pydatablog avatar xiaodaigh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

parallelkmeans.jl's Issues

Brush things up

Some refactoring is needed and we are more or less ready for this changes. I put them all here together, but they can be split later into separate issues.

  • Refactor init
  • Decide on algorithms type (or "distance type can be different from X")
  • Document algorithms separately
  • Drop extra fields from result.

Below is a short description of these problems.

  1. Refactor init.
    Currently init is redundant, since we have init and init_k and some weird logic, which one to choose. Correct solution is to use multiple dispatch, add additional function create_seed which should accept argument init (and all other necessary arguments). If init is String or Symbol it should fall to smart_init, if Nothing then default kmeans++, otherwise return deepcopy of the init.

All of this should happen in kmeans (before kmeans!), so duplicated copy is avoided.

  1. Decide on algorithms type (distance may have different type)
    Currently we infer distance type from the type of the design matrix. This can be wrong, for example, if X eltype is RGB or Complex, then distance can have different type, usually Float64 or Float32.

This can be solved by turning all algorithms to parametric, for example
Lloyd{Float64, Float64} and we can define something like this

struct Foo{T1, T2} end
Foo{T1}() where {T1} = Foo{T1, T1}()
Foo() = Foo{Float64, Float64}()

Foo{Float64}() # Foo{Float64, Float64}
Foo() # Foo{Float64, Float64}

It make it somewhat more verbose, and constraint to the design matrix type, but on the other hand it's more Julia like.

On the other hand, currently we infer everything from the matrices itself and distance type can be kmeans argument. I think it can work, but it looks weird.

  1. Better documentation

I think it would be better for users to come to the documentation and see separate page, where all algorithms and their usage is described, especially taking into account the fact that we soon will add stochastic algorithms (coresets and minibatch). It can be organized as follows;

  • Full scan algorithms
    -- Lloyd
    -- Elkan
    -- Hamerly
    -- Yinyang
  • Stochastic algortihms
    -- Coresets
    -- MiniBatch
  1. Drop extra fields from results.

Currently we have lots of redundant fields in result, which are not used, and I think they shouldn't be added, since they can be always calculated from all current data result. This extra information shouldn't be calculated inside kmeans, there should be separate set of utility functions, which can be invoked if need arise.

Extend [compat] MLJModelInterface = "^0.2,^0.3"

Supervised classifiers in MLJ now require a lower bound of 0.3 to buy into a performance boost. So you may find users complaining that adding ParallelKMeans to their MLJ environment downgrades performance of other models unless you update.

There should be nothing breaking about this update.

Questions welcome.

GPU support

Add GPU support for our implementation of k-means.

Test failure on Julia 1.5

Currently I comment out coreset verbose test, since it produces inconsistent results between Julia 1.4.1 and Julia 1.5 nightly.
See relevant issue: JuliaRandom/StableRNGs.jl#3

When it is resolved, tests should be updated appropriately.

Benchmark figure

Hi and thanks for this package.
In the benchmark figure in the readme, can you make the PK implementations easier to view & disentangle from the others?
For example, may make them dashed, or thicker, or have a different marker...
Or make all the PK the same color except w/ different markers...

Use unpack

We have lots of manual unpack, it would be nice to switch to nice Unpack.jl library, of course after thorough benchmarking.

More generic type support

This is just an idea, not sure whether it makes much sense.

Currently we restrict vectors to consist of Float64 numbers. But it would be interesting to add support for a much broader type of numbers, and that includes not only Real or Complex, but any kind of number.

For reference point this discussion and notebooks in it: https://discourse.julialang.org/t/differentialequations-jl-and-measurements-jl/6350

So, if our cluster algorithm can support Measurements and the like it would be a good improvement and some novel thing.

Release 1.0.0

Release goals:

  • Updated Documentation.
  • Updated benchmarks & cleaner benchmark plot. #79
  • Contribution guidelines. #108

Release 0.1.9

  • Bump up compat for Distances
  • Add Julia 1.5 to test

Expose the RNG as a hyper-parameter

For easier reproducibility.

Your algorithm is parallelized, so some thought must be given to that. I'm guessing random number generation only happening on the master process? Otherwise you might have to expose multiple RNG's (which MLJ does in learning_curve).

In MLJ the "standard" way to expose an RNG is to make it a hyperparameter rng=Random.GLOBAL_RNG which user can set to any AbstractRNG or to an integer (meaning "use MersenneTwister(rng) as your generator.")

Refactor `YinYang` to support non-euclidean metrics.

Currently, YinYang can work only with euclidean metric, since it's main niternal functions rely heavily on exact form of metric calculation. Algorithm should be generalized (everywhere, where you see sqrt it's a euclidean metric smell...)

Add Lightweight Coresets

I have a working implementation of the lightweight coresets paper (https://las.inf.ethz.ch/files/bachem18scalable.pdf) in Julia. It's not distributed yet (I only have one machine to run it on anyway) but if you guys want I can clone the repo and try to put together a pull request. I looked at the documentation but the contribution guidelines had a TODO, so not sure what exactly is expected.

Release 0.2.1

  • MiniBatch algorithm implementation #55
  • Fix upstream test failure in StableRNGs #82
  • General cleanup in anticipation of v1.0.0

Add binder support for live interactive testing

Binder makes it possible for users to test packages online in an interactive notebook session. This is a great way for users to test the package before installing locally. Support for binder on the landing page will help enhance the accessibility of the project as well give users a hands-on experience with the examples given in the documentation.

Not working on this example. Only gives one value

using ParallelKMeans
using DataFrames


df = DataFrame(val = rand(1_000_000))


@time multi_results = kmeans(reshape(df[!, :val], :, 1), 8)

The multi_results.assignment only gives one value and I was expecting 1_000_000.

Recursive version of multithreading

Our current implementation is rather simplistic and naive, just split up matrix in equal chunks and upload them to different threads. But this is not the way how it was intended to be: https://julialang.org/blog/2019/07/multithreading/

The general idea, how it should be implemented is to write a recursive function, which splits matrix in half and recursive call itself. Upon hitting some limit (<1000 columns for example?) actual procedure should commence.

This approach has its benefits, for example, there would be no penalty for multithreading small matrices since the algorithm wouldn't start new threads in this case. Also, it helps to remove MultiThreading/SingleThread modes completely. We should implement this approach and benchmark it properly.

Smart init needs to be thoroughly vetted and tested. Currently buggy and unstable

Convergence seems to be unstable at different tolerance levels plus sum of squares values jump around a lot at different K values.

using Plots
using Clustering
using ParallelKMeans

X = rand(10000, 30);

@time a = [Clustering.kmeans(X', i).totalcost for i = 2:10];
@time b = [ParallelKMeans.kmeans(X, i, tol=1e-6, verbose=false)[end] for i = 2:10];

plot(a)
plot!(b)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Faster k-means

As a future step after the implementation of point-wise parallel computations, it would make sense to improve algorithm by using "Fast kmeans" techniques.

Several approaches exists, here is some inspirational links

In latter case one should be careful though, there is no license, should contact author whether our MIT is applicable.

sonic

Release 0.1.7

For this release, the goals are:

  • Add binder support
  • Distance support. #25
  • Update MLJ Interface with Yinyang & Coresets Algorithms.

Correctly set up distinct pages for the two main branches

Currently, there are no docs correctly pointed to by each of the two main branches (master & experimental).

The doc generation script set up should be cleaned to address this so that each branch will have the latest docs generated on every new merge.

possible test failure in upcoming Julia version 1.5

A PkgEval run for a Julia pull request which changes the generated numbers for rand(a:b) indicates that the tests of this package might fail in Julia 1.5 (and on Julia current master branch).

Also, you might be interested in using the new StableRNGs.jl registered package, which provides guaranteed stable streams of random numbers across Julia releases.

Apologies if this is a false positive. Cf. https://github.com/JuliaCI/NanosoldierReports/blob/ab6676206b210325500b4f4619fa711f2d7429d2/pkgeval/by_hash/52c2272_vs_47c55db/logs/ParallelKMeans/1.5.0-DEV-87d2a04de3.log

Add MiniBatchKMeans

It would be nice to have MiniBatchKMeans, in the same way as it is done in scikitlearn:

MiniBatchKMeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

Paper: https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

Abstract: We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ϵaccurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.