pydatablog / parallelkmeans.jl Goto Github PK

Parallel & lightning fast implementation of available classic and contemporary variants of the KMeans clustering algorithm

License: MIT License

Julia 76.26% Jupyter Notebook 23.74%

clustering kmeans-clustering julia parallel-computing mlj-unsupervised mlj kmeans-clustering-algorithm

parallelkmeans.jl's People

Contributors

Stargazers

Watchers

Forkers

arkoniak fossabot asinghvi17 getzdan zeta1999 howardzzhang ablaom hazard3000sung evanmassaro sysuyy bhavukkalra shariul

parallelkmeans.jl's Issues

Brush things up

Some refactoring is needed and we are more or less ready for this changes. I put them all here together, but they can be split later into separate issues.

Refactor init
Decide on algorithms type (or "distance type can be different from X")
Document algorithms separately
Drop extra fields from result.

Below is a short description of these problems.

Refactor init.
Currently init is redundant, since we have init and init_k and some weird logic, which one to choose. Correct solution is to use multiple dispatch, add additional function create_seed which should accept argument init (and all other necessary arguments). If init is String or Symbol it should fall to smart_init, if Nothing then default kmeans++, otherwise return deepcopy of the init.

All of this should happen in kmeans (before kmeans!), so duplicated copy is avoided.

Decide on algorithms type (distance may have different type)
Currently we infer distance type from the type of the design matrix. This can be wrong, for example, if X eltype is RGB or Complex, then distance can have different type, usually Float64 or Float32.

This can be solved by turning all algorithms to parametric, for example
Lloyd{Float64, Float64} and we can define something like this

struct Foo{T1, T2} end
Foo{T1}() where {T1} = Foo{T1, T1}()
Foo() = Foo{Float64, Float64}()

Foo{Float64}() # Foo{Float64, Float64}
Foo() # Foo{Float64, Float64}

It make it somewhat more verbose, and constraint to the design matrix type, but on the other hand it's more Julia like.

On the other hand, currently we infer everything from the matrices itself and distance type can be kmeans argument. I think it can work, but it looks weird.

Better documentation

I think it would be better for users to come to the documentation and see separate page, where all algorithms and their usage is described, especially taking into account the fact that we soon will add stochastic algorithms (coresets and minibatch). It can be organized as follows;

Full scan algorithms
-- Lloyd
-- Elkan
-- Hamerly
-- Yinyang
Stochastic algortihms
-- Coresets
-- MiniBatch

Drop extra fields from results.

Currently we have lots of redundant fields in result, which are not used, and I think they shouldn't be added, since they can be always calculated from all current data result. This extra information shouldn't be calculated inside kmeans, there should be separate set of utility functions, which can be invoked if need arise.

Support for distance metrics beyond Euclidian.

Currently, the package uses the Squared Euclidian distance as the defacto metric. Users should be given the freedom to choose other metrics.

The plan is to provide support for all the available distance metrics provided by Distances.jl .

Extend [compat] MLJModelInterface = "^0.2,^0.3"

Supervised classifiers in MLJ now require a lower bound of 0.3 to buy into a performance boost. So you may find users complaining that adding ParallelKMeans to their MLJ environment downgrades performance of other models unless you update.

There should be nothing breaking about this update.

Questions welcome.

GPU support

Add GPU support for our implementation of k-means.

Add better initialisation methods

As far as I can tell, only kmeans++ is currently implemented. Looking at https://www.mdpi.com/1999-4893/14/1/6 "Improving Scalable K-Means++" it looks like SRPK-means‖ could be a good method to have available 🙂.

Test failure on Julia 1.5

Currently I comment out coreset verbose test, since it produces inconsistent results between Julia 1.4.1 and Julia 1.5 nightly.
See relevant issue: JuliaRandom/StableRNGs.jl#3

When it is resolved, tests should be updated appropriately.

Benchmark figure

Hi and thanks for this package.
In the benchmark figure in the readme, can you make the PK implementations easier to view & disentangle from the others?
For example, may make them dashed, or thicker, or have a different marker...
Or make all the PK the same color except w/ different markers...

Use unpack

We have lots of manual unpack, it would be nice to switch to nice Unpack.jl library, of course after thorough benchmarking.

You used the elbow method for choosing optimal number of clusters as a benchmark. Is this a function of your package as well?

Make project publication friendly for researchers

It would be great to provide an interface for researchers to cite this project. Zenodo seems like a good choice but other alternatives should be explored as well.

More generic type support

This is just an idea, not sure whether it makes much sense.

Currently we restrict vectors to consist of Float64 numbers. But it would be interesting to add support for a much broader type of numbers, and that includes not only Real or Complex, but any kind of number.

For reference point this discussion and notebooks in it: https://discourse.julialang.org/t/differentialequations-jl-and-measurements-jl/6350

So, if our cluster algorithm can support Measurements and the like it would be a good improvement and some novel thing.

Release 1.0.0

Release goals:

Updated Documentation.
Updated benchmarks & cleaner benchmark plot. #79
Contribution guidelines. #108

Release 0.1.9

Bump up compat for Distances
Add Julia 1.5 to test

Expose the RNG as a hyper-parameter

For easier reproducibility.

Your algorithm is parallelized, so some thought must be given to that. I'm guessing random number generation only happening on the master process? Otherwise you might have to expose multiple RNG's (which MLJ does in learning_curve).

In MLJ the "standard" way to expose an RNG is to make it a hyperparameter rng=Random.GLOBAL_RNG which user can set to any AbstractRNG or to an integer (meaning "use MersenneTwister(rng) as your generator.")

Refactor `YinYang` to support non-euclidean metrics.

Currently, YinYang can work only with euclidean metric, since it's main niternal functions rely heavily on exact form of metric calculation. Algorithm should be generalized (everywhere, where you see sqrt it's a euclidean metric smell...)

Add Lightweight Coresets

I have a working implementation of the lightweight coresets paper (https://las.inf.ethz.ch/files/bachem18scalable.pdf) in Julia. It's not distributed yet (I only have one machine to run it on anyway) but if you guys want I can clone the repo and try to put together a pull request. I looked at the documentation but the contribution guidelines had a TODO, so not sure what exactly is expected.

Release 0.2.1

MiniBatch algorithm implementation #55
Fix upstream test failure in StableRNGs #82
General cleanup in anticipation of v1.0.0

Add binder support for live interactive testing

Binder makes it possible for users to test packages online in an interactive notebook session. This is a great way for users to test the package before installing locally. Support for binder on the landing page will help enhance the accessibility of the project as well give users a hands-on experience with the examples given in the documentation.

Better convergence system for mini batch algorithm

Not working on this example. Only gives one value

using ParallelKMeans
using DataFrames


df = DataFrame(val = rand(1_000_000))


@time multi_results = kmeans(reshape(df[!, :val], :, 1), 8)

The multi_results.assignment only gives one value and I was expecting 1_000_000.

Release 0.1.5

@JuliaRegistrator register()

Recursive version of multithreading

Our current implementation is rather simplistic and naive, just split up matrix in equal chunks and upload them to different threads. But this is not the way how it was intended to be: https://julialang.org/blog/2019/07/multithreading/

The general idea, how it should be implemented is to write a recursive function, which splits matrix in half and recursive call itself. Upon hitting some limit (<1000 columns for example?) actual procedure should commence.

This approach has its benefits, for example, there would be no penalty for multithreading small matrices since the algorithm wouldn't start new threads in this case. Also, it helps to remove MultiThreading/SingleThread modes completely. We should implement this approach and benchmark it properly.

Initial version release

@JuliaRegistrator register()

[Patch Release v0.1.1]: MLJModels Interface

@JuliaRegistrator register()

Add more metrics

Currently we are implementing only SqEucledian metric, but we can add support for all other metrics in Distances in the same manner as it is done in https://github.com/JuliaStats/Distances.jl/blob/master/src/generic.jl#L45

We should check the performance of course. It may be possible to use our own implementation for SqEucledian and generic Distances implementation for all other metrics.

Smart init needs to be thoroughly vetted and tested. Currently buggy and unstable

Convergence seems to be unstable at different tolerance levels plus sum of squares values jump around a lot at different K values.

using Plots
using Clustering
using ParallelKMeans

X = rand(10000, 30);

@time a = [Clustering.kmeans(X', i).totalcost for i = 2:10];
@time b = [ParallelKMeans.kmeans(X, i, tol=1e-6, verbose=false)[end] for i = 2:10];

plot(a)
plot!(b)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Faster k-means

As a future step after the implementation of point-wise parallel computations, it would make sense to improve algorithm by using "Fast kmeans" techniques.

Several approaches exists, here is some inspirational links

In latter case one should be careful though, there is no license, should contact author whether our MIT is applicable.

Release 0.1.6

@JuliaRegistrator register()

Release 0.1.7

For this release, the goals are:

Add binder support
Distance support. #25
Update MLJ Interface with Yinyang & Coresets Algorithms.

Release 0.2.0

Patch release to fix MLJinterface issue.

Patch Release 0.1.1 - MLJModels Interface

@JuliaRegistrator register()

Correctly set up distinct pages for the two main branches

Currently, there are no docs correctly pointed to by each of the two main branches (master & experimental).

The doc generation script set up should be cleaned to address this so that each branch will have the latest docs generated on every new merge.

possible test failure in upcoming Julia version 1.5

A PkgEval run for a Julia pull request which changes the generated numbers for rand(a:b) indicates that the tests of this package might fail in Julia 1.5 (and on Julia current master branch).

Also, you might be interested in using the new StableRNGs.jl registered package, which provides guaranteed stable streams of random numbers across Julia releases.

Apologies if this is a false positive. Cf. https://github.com/JuliaCI/NanosoldierReports/blob/ab6676206b210325500b4f4619fa711f2d7429d2/pkgeval/by_hash/52c2272_vs_47c55db/logs/ParallelKMeans/1.5.0-DEV-87d2a04de3.log

Update MLJ Interface with Yinyang and Coresets algorithm

Update the MLJ Interface with the newly added algorithms and the corresponding tests.

Comparison with other implementations

It would be nice to have proper comparison with other k-means implementations.

It can be

Additional links for consideration

Fix broken Git Actions BenchmarkCI

Currently, Git actions for the benchmarking pull requests is broken. It needs to be investigated and fixed.

Add MiniBatchKMeans

It would be nice to have MiniBatchKMeans, in the same way as it is done in scikitlearn:

MiniBatchKMeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

Paper: https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

Abstract: We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ϵaccurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml

pydatablog / parallelkmeans.jl Goto Github PK

parallelkmeans.jl's People

Contributors

Stargazers

Watchers

Forkers

parallelkmeans.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs