pydatablog / parallelkmeans.jl Goto Github PK
View Code? Open in Web Editor NEWParallel & lightning fast implementation of available classic and contemporary variants of the KMeans clustering algorithm
License: MIT License
Parallel & lightning fast implementation of available classic and contemporary variants of the KMeans clustering algorithm
License: MIT License
Some refactoring is needed and we are more or less ready for this changes. I put them all here together, but they can be split later into separate issues.
init
X
")Below is a short description of these problems.
init
.init
is redundant, since we have init
and init_k
and some weird logic, which one to choose. Correct solution is to use multiple dispatch, add additional function create_seed
which should accept argument init
(and all other necessary arguments). If init
is String or Symbol
it should fall to smart_init
, if Nothing
then default kmeans++
, otherwise return deepcopy of the init.All of this should happen in kmeans
(before kmeans!
), so duplicated copy is avoided.
distance
type from the type of the design matrix. This can be wrong, for example, if X
eltype
is RGB
or Complex
, then distance can have different type, usually Float64
or Float32
.This can be solved by turning all algorithms to parametric, for example
Lloyd{Float64, Float64}
and we can define something like this
struct Foo{T1, T2} end
Foo{T1}() where {T1} = Foo{T1, T1}()
Foo() = Foo{Float64, Float64}()
Foo{Float64}() # Foo{Float64, Float64}
Foo() # Foo{Float64, Float64}
It make it somewhat more verbose, and constraint to the design matrix type, but on the other hand it's more Julia like.
On the other hand, currently we infer everything from the matrices itself and distance type can be kmeans
argument. I think it can work, but it looks weird.
I think it would be better for users to come to the documentation and see separate page, where all algorithms and their usage is described, especially taking into account the fact that we soon will add stochastic algorithms (coresets and minibatch). It can be organized as follows;
Currently we have lots of redundant fields in result, which are not used, and I think they shouldn't be added, since they can be always calculated from all current data result. This extra information shouldn't be calculated inside kmeans
, there should be separate set of utility functions, which can be invoked if need arise.
Currently, the package uses the Squared Euclidian distance as the defacto metric. Users should be given the freedom to choose other metrics.
The plan is to provide support for all the available distance metrics provided by Distances.jl .
Supervised classifiers in MLJ now require a lower bound of 0.3 to buy into a performance boost. So you may find users complaining that adding ParallelKMeans to their MLJ environment downgrades performance of other models unless you update.
There should be nothing breaking about this update.
Questions welcome.
Add GPU support for our implementation of k-means.
As far as I can tell, only kmeans++
is currently implemented. Looking at https://www.mdpi.com/1999-4893/14/1/6 "Improving Scalable K-Means++" it looks like SRPK-means‖
could be a good method to have available 🙂.
Currently I comment out coreset verbose test, since it produces inconsistent results between Julia 1.4.1 and Julia 1.5 nightly.
See relevant issue: JuliaRandom/StableRNGs.jl#3
When it is resolved, tests should be updated appropriately.
Hi and thanks for this package.
In the benchmark figure in the readme, can you make the PK implementations easier to view & disentangle from the others?
For example, may make them dashed, or thicker, or have a different marker...
Or make all the PK the same color except w/ different markers...
We have lots of manual unpack, it would be nice to switch to nice Unpack.jl library, of course after thorough benchmarking.
It would be great to provide an interface for researchers to cite this project. Zenodo seems like a good choice but other alternatives should be explored as well.
This is just an idea, not sure whether it makes much sense.
Currently we restrict vectors to consist of Float64 numbers. But it would be interesting to add support for a much broader type of numbers, and that includes not only Real or Complex, but any kind of number.
For reference point this discussion and notebooks in it: https://discourse.julialang.org/t/differentialequations-jl-and-measurements-jl/6350
So, if our cluster algorithm can support Measurements and the like it would be a good improvement and some novel thing.
For easier reproducibility.
Your algorithm is parallelized, so some thought must be given to that. I'm guessing random number generation only happening on the master process? Otherwise you might have to expose multiple RNG's (which MLJ does in learning_curve
).
In MLJ the "standard" way to expose an RNG is to make it a hyperparameter rng=Random.GLOBAL_RNG
which user can set to any AbstractRNG
or to an integer (meaning "use MersenneTwister(rng)
as your generator.")
Currently, YinYang
can work only with euclidean metric, since it's main niternal functions rely heavily on exact form of metric calculation. Algorithm should be generalized (everywhere, where you see sqrt
it's a euclidean metric smell...)
I have a working implementation of the lightweight coresets paper (https://las.inf.ethz.ch/files/bachem18scalable.pdf) in Julia. It's not distributed yet (I only have one machine to run it on anyway) but if you guys want I can clone the repo and try to put together a pull request. I looked at the documentation but the contribution guidelines had a TODO, so not sure what exactly is expected.
Binder makes it possible for users to test packages online in an interactive notebook session. This is a great way for users to test the package before installing locally. Support for binder on the landing page will help enhance the accessibility of the project as well give users a hands-on experience with the examples given in the documentation.
using ParallelKMeans
using DataFrames
df = DataFrame(val = rand(1_000_000))
@time multi_results = kmeans(reshape(df[!, :val], :, 1), 8)
The multi_results.assignment
only gives one value and I was expecting 1_000_000
.
@JuliaRegistrator register()
Our current implementation is rather simplistic and naive, just split up matrix in equal chunks and upload them to different threads. But this is not the way how it was intended to be: https://julialang.org/blog/2019/07/multithreading/
The general idea, how it should be implemented is to write a recursive function, which splits matrix in half and recursive call itself. Upon hitting some limit (<1000 columns for example?) actual procedure should commence.
This approach has its benefits, for example, there would be no penalty for multithreading small matrices since the algorithm wouldn't start new threads in this case. Also, it helps to remove MultiThreading/SingleThread
modes completely. We should implement this approach and benchmark it properly.
@JuliaRegistrator register()
@JuliaRegistrator register()
Currently we are implementing only SqEucledian
metric, but we can add support for all other metrics in Distances
in the same manner as it is done in https://github.com/JuliaStats/Distances.jl/blob/master/src/generic.jl#L45
We should check the performance of course. It may be possible to use our own implementation for SqEucledian
and generic Distances
implementation for all other metrics.
Convergence seems to be unstable at different tolerance levels plus sum of squares values jump around a lot at different K values.
using Plots
using Clustering
using ParallelKMeans
X = rand(10000, 30);
@time a = [Clustering.kmeans(X', i).totalcost for i = 2:10];
@time b = [ParallelKMeans.kmeans(X, i, tol=1e-6, verbose=false)[end] for i = 2:10];
plot(a)
plot!(b)
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
As a future step after the implementation of point-wise parallel computations, it would make sense to improve algorithm by using "Fast kmeans" techniques.
Several approaches exists, here is some inspirational links
In latter case one should be careful though, there is no license, should contact author whether our MIT is applicable.
@JuliaRegistrator register()
For this release, the goals are:
@JuliaRegistrator register()
Currently, there are no docs correctly pointed to by each of the two main branches (master & experimental).
The doc generation script set up should be cleaned to address this so that each branch will have the latest docs generated on every new merge.
A PkgEval run for a Julia pull request which changes the generated numbers for rand(a:b)
indicates that the tests of this package might fail in Julia 1.5 (and on Julia current master branch).
Also, you might be interested in using the new StableRNGs.jl registered package, which provides guaranteed stable streams of random numbers across Julia releases.
Apologies if this is a false positive. Cf. https://github.com/JuliaCI/NanosoldierReports/blob/ab6676206b210325500b4f4619fa711f2d7429d2/pkgeval/by_hash/52c2272_vs_47c55db/logs/ParallelKMeans/1.5.0-DEV-87d2a04de3.log
@JuliaRegistrator register()
@JuliaRegistrator register()
@JuliaRegistrator register()
Update the MLJ Interface with the newly added algorithms and the corresponding tests.
It would be nice to have proper comparison with other k-means implementations.
It can be
Additional links for consideration
Currently, Git actions for the benchmarking pull requests is broken. It needs to be investigated and fixed.
It would be nice to have MiniBatchKMeans, in the same way as it is done in scikitlearn
:
MiniBatchKMeans: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html
Paper: https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
Abstract: We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ϵaccurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.