GithubHelp home page GithubHelp logo

php-kmeans's Introduction

PHP Kmean

Latest Version on Packagist Build Status Quality Score Total Downloads

K-mean clustering algorithm implementation in PHP.

Please also see the FAQ

Installation

You can install the package via composer:

composer require bdelespierre/php-kmeans

Usage

require "vendor/autoload.php";

// prepare 50 points of 2D space to be clustered
$points = [
    [80,55],[86,59],[19,85],[41,47],[57,58],
    [76,22],[94,60],[13,93],[90,48],[52,54],
    [62,46],[88,44],[85,24],[63,14],[51,40],
    [75,31],[86,62],[81,95],[47,22],[43,95],
    [71,19],[17,65],[69,21],[59,60],[59,12],
    [15,22],[49,93],[56,35],[18,20],[39,59],
    [50,15],[81,36],[67,62],[32,15],[75,65],
    [10,47],[75,18],[13,45],[30,62],[95,79],
    [64,11],[92,14],[94,49],[39,13],[60,68],
    [62,10],[74,44],[37,42],[97,60],[47,73],
];

// create a 2-dimentions space
$space = new KMeans\Space(2);

// add points to space
foreach ($points as $i => $coordinates) {
    $space->addPoint($coordinates);
}

// cluster these 50 points in 3 clusters
$clusters = $space->solve(3);

// display the cluster centers and attached points
foreach ($clusters as $num => $cluster) {
    $coordinates = $cluster->getCoordinates();
    printf(
        "Cluster %s [%d,%d]: %d points\n",
        $num,
        $coordinates[0],
        $coordinates[1],
        count($cluster)
    );
}

Note: the example is given with points of a 2D space but it will work with any dimention >1.

Testing

composer test

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Credits

License

Lesser General Public License (LGPL). Please see License File for more information.

FAQ

How to get coordinates of a point/cluster:

$x = $point[0];
$y = $point[1];

// or

list($x,$y) = $point->getCoordinates();

List all points of a space/cluster:

foreach ($cluster as $point) {
    printf('[%d,%d]', $point[0], $point[1]);
}

Attach data to a point:

$point = $space->addPoint([$x, $y, $z], "user #123");

Retrieve point data:

$data = $space[$point]; // e.g. "user #123"

Watch the algorithm run

Each iteration step can be monitored using a callback function passed to Kmeans\Space::solve:

$clusters = $space->solve(3, function($space, $clusters) {
    static $iterations = 0;

    printf("Iteration: %d\n", ++$iterations);

    foreach ($clusters as $i => $cluster) {
        printf("Cluster %d [%d,%d]: %d points\n", $i, $cluster[0], $cluster[1], count($cluster));
    }
});

php-kmeans's People

Contributors

battlecook avatar bdelespierre avatar ramzeng avatar roncemer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

php-kmeans's Issues

Can you add a version?

Could not find package bdelespierre/php-kmeans at any version for your minimum-stability (stable). Check the package spelling or your minimum-stability

Can you add a version?Thank you。

Update usage and instructions for v3

v3 implementation is vastly different from v2. More elegant, robust, forward-compatible, and better tested. Also comes with new features like the ability to clusterize GPS Coordinates.

An overhaul of the usage and instruction, most notably in README.md but also in demo.php, are therefore needed.

Attaching an ID to each point

Hi there,
is there any way to attach an ID to each point, to be able to tell which is which later? The ID shouldn't be used as a cluster variable, of course. Thanks!

Algorithm run visualization

It would be great to be able to visualize graphically each step of the algorithm to watch it run (and potentially find bottlenecks)

Get Nears Coordinate from a Cluster

How i can Get Nears Point data of the Cluster , Example

Cluster 0 [50,64]: 5 points
[81,95] <----------------------- this ?
[43,95]
[17,65]
[49,93]
[30,62]

How I get Nears Point of this cluster 0 from the centroid ?

Get total variation (Elbow method)

In order to find the best value for K (the number of clusters), it would be nice to get the variance of the distance of clustered points to their cluster's centroid.

Inspired by https://www.youtube.com/watch?v=4b5d3muPQmA
Also see https://en.wikipedia.org/wiki/Elbow_method_(clustering)

I also believe the current v3 implementation of RandomInitialization is wrong 🤷‍♂️

Proposed change

$result = (new Kmeans\Algorithm($init))->clusterize($points, $K);
echo $result->getTotalVariance();

Performance improvement

I am doing clustering of about 50K locations. Each cluster should have about 20 or less locations. Unfortunately it takes about 1 hour to finish the algorithm. My initial guess says that repeated distance calculation makes it slow, if I add the correct distance formula based on LatLong it will be slower.
If you also think so then adding distance matrix will be help to optimize it. Here is similar example in DBScan.
https://github.com/bhavikm/DBSCAN-clustering/blob/master/index.php
The matrix calculation can be done when user calls solve.

Add a maximum number of iteration threshold

Proposed change

$kmeans = new Kmeans(new RandomInitialization());

// fit = clusterize (we should rename this method)
$result = $kmeans->fit($points, maxIter: 300);

echo $result->getIterations(); // never more than 300

Is there a way to adapt the algorithm to use weighted centroid formula?

One big problem with k-mean is the number of loops it does. But when the points are "close" to each other, why no group them together as a local cluster represented by a single weighted point? Would it make sense?

We could calculate for each point the shortest distance to another point. Then use the mean of these distances to determine which points are really "close" to each other by a user-specified factor. Then create a new point at the centroid of the selected points, giving it a weight equal to the number of points it represents.

That could help lower the number of point-to-cluster-centroid distance computations, wouldn't it?

Deterministic testing

AlgorithmTest::clusterize fails randomly because it uses randomized data.

PHPUnit 9.5.19 #StandWithUkraine

Runtime:       PHP 8.1.1 with Xdebug 3.1.2
Configuration: /home/benjamin/Workspace/bdelespierre/php-kmeans/phpunit.xml.dist

..................................F

Time: 00:00.167, Memory: 10.00 MB

There was 1 failure:

1) Tests\Unit\Euclidean\AlgorithmTest::testClusterize with data set "3D" (Kmeans\Euclidean\Space Object (...), 2.0, Kmeans\PointCollection Object (...), Kmeans\PointCollection Object (...), Kmeans\PointCollection Object (...))
Failed asserting that 2.096814878707058 is less than 2.0.

/home/benjamin/Workspace/bdelespierre/php-kmeans/tests/Unit/AlgorithmTest.php:87

FAILURES!
Tests: 35, Assertions: 69, Failures: 1.

Generating code coverage report in Clover XML format ... done [00:00.006]

Generating code coverage report in HTML format ... done [00:00.043]

This issue must be solved by effectively removing the randomization part and replacing every dataset by static ones.

Multidimensional arrays and diversity clustering

Hello, thank you for sharing this package. I'm hoping to use it to help group users into diverse groups based on socioeconomic factors like race, gender, age, etc. Our dataset contains 20 factors that need to be taken into consideration. Have you used this to solve such a problem?

I've started some preliminary testing, and seem to be getting results but I can't tell what is happening behind the scenes. Furthermore, I would like to be able to weight each factor. For example, race may be the most important factor in some cases, while gender may be in others.

Here is what the data looks like:

 user_id => [
    race,
    gender,
    age
 ]

The numerical representation for each possible value is what we store:

array:10 [
  1 => array:3 [
    0 => -10
    1 => 6
    2 => 1
  ]
  2 => array:3 [
    0 => 3
    1 => 2
    2 => 1
  ]
  3 => array:3 [
    0 => 2
    1 => 1
    2 => 5
  ]
  4 => array:3 [
    0 => 9
    1 => 3
    2 => 4
  ]
  5 => array:3 [
    0 => -12
    1 => 6
    2 => 0
  ]
  6 => array:3 [
    0 => -6
    1 => 7
    2 => 3
  ]
  7 => array:3 [
    0 => 7
    1 => 7
    2 => 5
  ]
  8 => array:3 [
    0 => 4
    1 => 4
    2 => 0
  ]
  9 => array:3 [
    0 => 5
    1 => 7
    2 => 1
  ]
  10 => array:3 [
    0 => -11
    1 => 3
    2 => 2
  ]
]

I'm curious as well, after the clustering is performed, is there anyway to retrieve the original key for the data? This is needed because I need to know which users are in each cluster.

If this is not the appropriate channel for this type of question, or beyond the scope of the repo, please let me know. I certainly appreciate any feedback you may have. Thank you :)

Question: Predictable Results

First: I'm not very into math, so if I don't use the correct words, I'm sorry!

I tried kmeans to cluster coordinates (100 latitude and longitude coordinates) into groups of 4. It seems to work alright, but every time I get a different result. As far as I understand, this is due to some randomness in the initial coordinate that is responsible for the algorithm to cluster things later.

The thing is: as long as the initial coordinate set stays the same (means: no points added, removed or altered) there should be always the same result.

What I try to do is to cluster teams (I have a coordinate for every team) which are distributed over a country. The goal is that teams don't have to travel a long distance to meet in groups to play together, and every team is inside such a group.

Maybe Kmeans is not the right thing to use for this use case? Or there is a solution to get predictable results with this library of yours?

Should points be immutables?

To make Point immutable, we would just need to make the following method private:

public function setCoordinates(array $coordinates): void

Would it make sense for points to be immutable? Is there a way we benefit from this in terms of clarity or performance?

@battlecook what's your opinion on this?

Pause algorithm execution

Since resuming the algorithm is possible (see #28), why not give the user the ability to pause it as well?

Here's an implementation target:

$algo = new Kmeans\Algorithm(new Kmeans\RandomInitialization());

$aglo->registerIterationCallback(function ($algo) {
    if ($algo->getStatus()->startedAt() > new \DateTime('1 hour ago')) {
        return $algo->pause();
    }
});

$result = $algo->clusterize($points, $nbClusters);

if ($result->getStatus()->isPaused()) {
    echo "Clusterization ran for more than 1h and had to be paused.";
}

What do you think?

Add facades and helpers to simplify usage

Proposed change

$data = [
    [80,55],[86,59],[19,85],[41,47],[57,58],
    [76,22],[94,60],[13,93],[90,48],[52,54],
    [62,46],[88,44],[85,24],[63,14],[51,40],
    [75,31],[86,62],[81,95],[47,22],[43,95],
    [71,19],[17,65],[69,21],[59,60],[59,12],
    [15,22],[49,93],[56,35],[18,20],[39,59],
    [50,15],[81,36],[67,62],[32,15],[75,65],
    [10,47],[75,18],[13,45],[30,62],[95,79],
    [64,11],[92,14],[94,49],[39,13],[60,68],
    [62,10],[74,44],[37,42],[97,60],[47,73],
];

// should auto-detect the arity of the euclidean space
$results = kmeans($data, clusters: 3);

For GPS coordinates

$cities = [
    [48.85889, 2.32004], // Paris
    [45.75781, 4.83201], // Lyon
    [43.29617, 5.36995], // Marseille
];

$results = kmeans_gps($cities, clusters: 1);

Allow clusterization of GPS locations

Proposed change

$algo = new Kmeans\Gps\Algorithm(new Kmeans\RandomInitialization());

$locations = new Kmeans\PointCollection(new Kmeans\Gps\Space(), [
    new Kmeans\Gps\Point(48.85341, 2.3488), // Paris
    // ...
]);

$clusters = $algo->clusterize($locations, 3);

Resume algorithm execution

I believe it would be nice to be able to resume algorithm execution after its completion. It could be useful as new points are being added so previous iterations don't need to be re-run again.

Example: I have clustered my 100 000 users into 5 clusters. Since the last clustering, 100 new users have been added. Most of them are probably already very close to the existing clusters' centroids. Hence, I should be able to resume clustering the same dataset PLUS the new users to save time.

Multithread for the win

Multithreading the algo would significantly improve the performances. We may use the pthread library when available or proc_open (if none are available well 🤷‍♂️ )

And it would be so much fun to code 🤩

Packagist Release

The current packagist release seems to be out of date? It is based on tag 2.1.1, and seems to differ from the master branch. Would it be possible to update the release?

Missing "getPoints" method?

I believe the Cluster->getPoints() method is missing within this class. It is referenced within the readme. Am I missing something here?

Specify a distance function

Proposed change

$algo = new Kmeans\Algorithm($init);
$algo->setDistanceFunction(
    fn (Point $a, Point $b) => geocode_dist($a->getCoordinates(), $b->getCoordinates)
);
$result = $algo->clusterize($points, $K);

Call to undefined method "KMeans\Cluster::getPoints()"

Environment

  • PHP 7.3 (XAMPP)
  • Library (latest from composer)

Problem

Fatal error: Uncaught Error: Call to undefined method KMeans\Cluster::getPoints()

Reconstructed

I just follow the code from Readme with a little change using 8 dimentions of array on $points variable, here is my code

require "vendor/autoload.php";

// dummy data
$points = [
  [79,75,75,85,76,78,76,80],
  [84,76,79,77,76,77,75,81],
  [77,84,78,85,92,89,77,82],
  [78,86,84,77,78,77,75,75],
  [82,82,81,91,90,82,79,91]
];

// create a 8-dimentions space
$space = new KMeans\Space(8);

// add points to space
foreach ($points as $i => $coordinates) {
    $space->addPoint($coordinates);
}

// cluster these 50 points in 3 clusters
$clusters = $space->solve(3);

// display the cluster centers and attached points
foreach ($clusters as $num => $cluster) {
  $coordinates = $cluster->getCoordinates();
  printf(
    "Cluster %s [%d,%d,%d,%d,%d,%d,%d,%d]: %d points\n",
    $num,
    $coordinates[0],
    $coordinates[1],
    $coordinates[2],
    $coordinates[3],
    $coordinates[4],
    $coordinates[5],
    $coordinates[6],
    $coordinates[7],
    count($cluster->getPoints())
  );
}

The problem gone if i remove count($cluster->getPoints())

Regard's
ian

PHP 5.3 Compatibility

Unbelievably, Amazon Linux, CentOS and RHEL still use PHP 5.3. There are a few changes which are needed to support these platforms, which are still very prevalent.

Cluster.php, line 47. Change this:
$points = [];
to this:
$points = array();

Cluster.php, lines 51-54. Change this:
return [
'centroid' => parent::toArray(),
'points' => $points,
];
to this:
return array(
'centroid' => parent::toArray(),
'points' => $points,
);

Point.php, lines 46-49. Change this:
return [
'coordinates' => $this->coordinates,
'data' => isset($this->space[$this]) ? $this->space[$this] : null,
];
to this:
return array(
'coordinates' => $this->coordinates,
'data' => isset($this->space[$this]) ? $this->space[$this] : null,
);

Space.php, line 53. Change this:
$points = [];
to this:
$points = array();

Space.php, line 57. Change this:
return ['points' => $points];
to this:
return array('points' => $points);

Space.php, line 101. Change this:

Thanks!
Ron Cemer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.