GithubHelp home page GithubHelp logo

ranjaykrishna / pyxmeans Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mynameisfiber/pyxmeans

0.0 2.0 0.0 2.46 MB

Quick implementation of xmeans in python and C

License: MIT License

Python 27.91% C 71.39% Makefile 0.70%

pyxmeans's Introduction

pyxmeans

This is a quick implementation of XMeans for using kmeans type clustering with an unknown number of clusters. In order to make this code runnable, I chose to use MiniBatchKMeans instead of KMeans, but they should be swappable.

Currently MiniBatch and XMeans are supported. XMeans uses multiple MiniBatch and trial MiniBatch runs in order to infer how many clusters the data has. It does so by taking the population of a given cluster center, splitting it, and seeing if the resulting labels have a better BIC (Bayesian Information Criterion) than before. This can be done successively until we find the number of clusters.

In addition to providing XMeans, the MiniBatch implementation in this package is exceedingly fast. Below are benchmarks for all the provided clustering methods and sklearn's MiniBatch routine.

$ python -m pyxmeans.benchmark
Creating data
Number of points:  10000
Number of dimensions:  2
Number of clusters:  32
initial BIC:  -50214.4559857
initial variance:  0.00164148581105
initial RMS Error:  2.31798411948

Clustering with single-threaded pyxmeans
singlethreaded pyxmeans took 0.043875s
BIC:  -50762.9827994
Variance:  0.00115765494439
RMS Error:  2.31035692593

Clustering with multi-threaded pyxmeans
multithreaded pyxmeans took 0.326129s
BIC:  -50982.8001508
Variance:  0.00104848000929
RMS Error:  2.3113536455

Clustering with multi-threaded pyxmeans (starting k at 20)
multithreaded pyxmeans took 79.005781s
Num Clusters:  30
BIC:  -50352.8461421
Variance:  0.00104986238957
RMS Error:  2.31100693171

Clustering with sklearn
scikitlearn took 9.241426s
BIC:  -50679.1763114
Variance:  0.00112580908789
RMS Error:  2.31050074192

NOTES:

* `max_no_improvement` is set to `None` for MiniBatchKMeans to properly
  compare per-iteration speeds since we currently do not support
  early-stopping.
* RMS Error for the multi-threaded pymeans is higher because that function
  aims at minimizing the variance of the resulting model.

Dependencies

Todo:

  • Optimize data layout when dealing with / comparing computed clusters
  • Better memory management in XMeans (we're copying things everywhere)
  • Pool out children tests in XMeans

pyxmeans's People

Contributors

mynameisfiber avatar

Watchers

James Cloos avatar Ranjay Krishna avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.