bdelespierre / php-kmeans Goto Github PK

View Code? Open in Web Editor NEW

91.0 8.0 41.0 93 KB

PHP K-Means

License: MIT License

PHP 100.00%

looking-for-contributors machine-learning-algorithms php

php-kmeans's Introduction

PHP Kmean

K-mean clustering algorithm implementation in PHP.

Please also see the FAQ

Installation

You can install the package via composer:

composer require bdelespierre/php-kmeans

Usage

require "vendor/autoload.php";

// prepare 50 points of 2D space to be clustered
$points = [
    [80,55],[86,59],[19,85],[41,47],[57,58],
    [76,22],[94,60],[13,93],[90,48],[52,54],
    [62,46],[88,44],[85,24],[63,14],[51,40],
    [75,31],[86,62],[81,95],[47,22],[43,95],
    [71,19],[17,65],[69,21],[59,60],[59,12],
    [15,22],[49,93],[56,35],[18,20],[39,59],
    [50,15],[81,36],[67,62],[32,15],[75,65],
    [10,47],[75,18],[13,45],[30,62],[95,79],
    [64,11],[92,14],[94,49],[39,13],[60,68],
    [62,10],[74,44],[37,42],[97,60],[47,73],
];

// create a 2-dimentions space
$space = new KMeans\Space(2);

// add points to space
foreach ($points as $i => $coordinates) {
    $space->addPoint($coordinates);
}

// cluster these 50 points in 3 clusters
$clusters = $space->solve(3);

// display the cluster centers and attached points
foreach ($clusters as $num => $cluster) {
    $coordinates = $cluster->getCoordinates();
    printf(
        "Cluster %s [%d,%d]: %d points\n",
        $num,
        $coordinates[0],
        $coordinates[1],
        count($cluster)
    );
}

Note: the example is given with points of a 2D space but it will work with any dimention >1.

Testing

composer test

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Credits

License

Lesser General Public License (LGPL). Please see License File for more information.

FAQ

How to get coordinates of a point/cluster:

$x = $point[0];
$y = $point[1];

// or

list($x,$y) = $point->getCoordinates();

List all points of a space/cluster:

foreach ($cluster as $point) {
    printf('[%d,%d]', $point[0], $point[1]);
}

Attach data to a point:

$point = $space->addPoint([$x, $y, $z], "user #123");

Retrieve point data:

$data = $space[$point]; // e.g. "user #123"

Watch the algorithm run

Each iteration step can be monitored using a callback function passed to Kmeans\Space::solve:

$clusters = $space->solve(3, function($space, $clusters) {
    static $iterations = 0;

    printf("Iteration: %d\n", ++$iterations);

    foreach ($clusters as $i => $cluster) {
        printf("Cluster %d [%d,%d]: %d points\n", $i, $cluster[0], $cluster[1], count($cluster));
    }
});

php-kmeans's People

Contributors

Stargazers

Watchers

php-kmeans's Issues

Can you add a version?

Could not find package bdelespierre/php-kmeans at any version for your minimum-stability (stable). Check the package spelling or your minimum-stability

Can you add a version?Thank you。

Update usage and instructions for v3

v3 implementation is vastly different from v2. More elegant, robust, forward-compatible, and better tested. Also comes with new features like the ability to clusterize GPS Coordinates.

An overhaul of the usage and instruction, most notably in README.md but also in demo.php, are therefore needed.

Attaching an ID to each point

Hi there,
is there any way to attach an ID to each point, to be able to tell which is which later? The ID shouldn't be used as a cluster variable, of course. Thanks!

Why do you need getData, setData($data) in Point Interface in v3?

It seems clear that in PointInterface there are functions that deal with coordinates and space.

I confirmed that there are functions getData and setData to handle data.

Can you explain what data means (or intention)?

Algorithm run visualization

It would be great to be able to visualize graphically each step of the algorithm to watch it run (and potentially find bottlenecks)

Write a test with the Iris dataset

Apparently, it's kind of popular among ML libraries.

See https://en.wikipedia.org/wiki/Iris_flower_data_set

Get Nears Coordinate from a Cluster

How i can Get Nears Point data of the Cluster , Example

Cluster 0 [50,64]: 5 points
[81,95] <----------------------- this ?
[43,95]
[17,65]
[49,93]
[30,62]

How I get Nears Point of this cluster 0 from the centroid ?

Get total variation (Elbow method)

In order to find the best value for K (the number of clusters), it would be nice to get the variance of the distance of clustered points to their cluster's centroid.

Inspired by https://www.youtube.com/watch?v=4b5d3muPQmA
Also see https://en.wikipedia.org/wiki/Elbow_method_(clustering)

I also believe the current v3 implementation of RandomInitialization is wrong 🤷‍♂️

Proposed change

$result = (new Kmeans\Algorithm($init))->clusterize($points, $K);
echo $result->getTotalVariance();

I am doing clustering of about 50K locations. Each cluster should have about 20 or less locations. Unfortunately it takes about 1 hour to finish the algorithm. My initial guess says that repeated distance calculation makes it slow, if I add the correct distance formula based on LatLong it will be slower.
If you also think so then adding distance matrix will be help to optimize it. Here is similar example in DBScan.
https://github.com/bhavikm/DBSCAN-clustering/blob/master/index.php
The matrix calculation can be done when user calls solve.

Add a maximum number of iteration threshold

Proposed change

$kmeans = new Kmeans(new RandomInitialization());

// fit = clusterize (we should rename this method)
$result = $kmeans->fit($points, maxIter: 300);

echo $result->getIterations(); // never more than 300

Is there a way to adapt the algorithm to use weighted centroid formula?

One big problem with k-mean is the number of loops it does. But when the points are "close" to each other, why no group them together as a local cluster represented by a single weighted point? Would it make sense?

We could calculate for each point the shortest distance to another point. Then use the mean of these distances to determine which points are really "close" to each other by a user-specified factor. Then create a new point at the centroid of the selected points, giving it a weight equal to the number of points it represents.

That could help lower the number of point-to-cluster-centroid distance computations, wouldn't it?

Deterministic testing

AlgorithmTest::clusterize fails randomly because it uses randomized data.

PHPUnit 9.5.19 #StandWithUkraine

Runtime:       PHP 8.1.1 with Xdebug 3.1.2
Configuration: /home/benjamin/Workspace/bdelespierre/php-kmeans/phpunit.xml.dist

..................................F

Time: 00:00.167, Memory: 10.00 MB

There was 1 failure:

1) Tests\Unit\Euclidean\AlgorithmTest::testClusterize with data set "3D" (Kmeans\Euclidean\Space Object (...), 2.0, Kmeans\PointCollection Object (...), Kmeans\PointCollection Object (...), Kmeans\PointCollection Object (...))
Failed asserting that 2.096814878707058 is less than 2.0.

/home/benjamin/Workspace/bdelespierre/php-kmeans/tests/Unit/AlgorithmTest.php:87

FAILURES!
Tests: 35, Assertions: 69, Failures: 1.

Generating code coverage report in Clover XML format ... done [00:00.006]

Generating code coverage report in HTML format ... done [00:00.043]

This issue must be solved by effectively removing the randomization part and replacing every dataset by static ones.

Multidimensional arrays and diversity clustering

Hello, thank you for sharing this package. I'm hoping to use it to help group users into diverse groups based on socioeconomic factors like race, gender, age, etc. Our dataset contains 20 factors that need to be taken into consideration. Have you used this to solve such a problem?

I've started some preliminary testing, and seem to be getting results but I can't tell what is happening behind the scenes. Furthermore, I would like to be able to weight each factor. For example, race may be the most important factor in some cases, while gender may be in others.

Here is what the data looks like:

 user_id => [
    race,
    gender,
    age
 ]

The numerical representation for each possible value is what we store:

array:10 [
  1 => array:3 [
    0 => -10
    1 => 6
    2 => 1
  ]
  2 => array:3 [
    0 => 3
    1 => 2
    2 => 1
  ]
  3 => array:3 [
    0 => 2
    1 => 1
    2 => 5
  ]
  4 => array:3 [
    0 => 9
    1 => 3
    2 => 4
  ]
  5 => array:3 [
    0 => -12
    1 => 6
    2 => 0
  ]
  6 => array:3 [
    0 => -6
    1 => 7
    2 => 3
  ]
  7 => array:3 [
    0 => 7
    1 => 7
    2 => 5
  ]
  8 => array:3 [
    0 => 4
    1 => 4
    2 => 0
  ]
  9 => array:3 [
    0 => 5
    1 => 7
    2 => 1
  ]
  10 => array:3 [
    0 => -11
    1 => 3
    2 => 2
  ]
]

I'm curious as well, after the clustering is performed, is there anyway to retrieve the original key for the data? This is needed because I need to know which users are in each cluster.

If this is not the appropriate channel for this type of question, or beyond the scope of the repo, please let me know. I certainly appreciate any feedback you may have. Thank you :)

Question: Predictable Results

First: I'm not very into math, so if I don't use the correct words, I'm sorry!

I tried kmeans to cluster coordinates (100 latitude and longitude coordinates) into groups of 4. It seems to work alright, but every time I get a different result. As far as I understand, this is due to some randomness in the initial coordinate that is responsible for the algorithm to cluster things later.

The thing is: as long as the initial coordinate set stays the same (means: no points added, removed or altered) there should be always the same result.

What I try to do is to cluster teams (I have a coordinate for every team) which are distributed over a country. The goal is that teams don't have to travel a long distance to meet in groups to play together, and every team is inside such a group.

Maybe Kmeans is not the right thing to use for this use case? Or there is a solution to get predictable results with this library of yours?

Should points be immutables?

To make Point immutable, we would just need to make the following method private:

php-kmeans/src/Point.php

Line 27 in 46e2051

public function setCoordinates(array $coordinates): void

Would it make sense for points to be immutable? Is there a way we benefit from this in terms of clarity or performance?

@battlecook what's your opinion on this?

Pause algorithm execution

Since resuming the algorithm is possible (see #28), why not give the user the ability to pause it as well?

Here's an implementation target:

$algo = new Kmeans\Algorithm(new Kmeans\RandomInitialization());

$aglo->registerIterationCallback(function ($algo) {
    if ($algo->getStatus()->startedAt() > new \DateTime('1 hour ago')) {
        return $algo->pause();
    }
});

$result = $algo->clusterize($points, $nbClusters);

if ($result->getStatus()->isPaused()) {
    echo "Clusterization ran for more than 1h and had to be paused.";
}

What do you think?

Add facades and helpers to simplify usage

Proposed change

$data = [
    [80,55],[86,59],[19,85],[41,47],[57,58],
    [76,22],[94,60],[13,93],[90,48],[52,54],
    [62,46],[88,44],[85,24],[63,14],[51,40],
    [75,31],[86,62],[81,95],[47,22],[43,95],
    [71,19],[17,65],[69,21],[59,60],[59,12],
    [15,22],[49,93],[56,35],[18,20],[39,59],
    [50,15],[81,36],[67,62],[32,15],[75,65],
    [10,47],[75,18],[13,45],[30,62],[95,79],
    [64,11],[92,14],[94,49],[39,13],[60,68],
    [62,10],[74,44],[37,42],[97,60],[47,73],
];

// should auto-detect the arity of the euclidean space
$results = kmeans($data, clusters: 3);

For GPS coordinates

$cities = [
    [48.85889, 2.32004], // Paris
    [45.75781, 4.83201], // Lyon
    [43.29617, 5.36995], // Marseille
];

$results = kmeans_gps($cities, clusters: 1);

Allow clusterization of GPS locations

Proposed change

$algo = new Kmeans\Gps\Algorithm(new Kmeans\RandomInitialization());

$locations = new Kmeans\PointCollection(new Kmeans\Gps\Space(), [
    new Kmeans\Gps\Point(48.85341, 2.3488), // Paris
    // ...
]);

$clusters = $algo->clusterize($locations, 3);

Providing callback function to handle different distance formula

Thanks for providing the library.
The distance formula in

php-kmeans/src/KMeans/Point.php

Line 50 in 8695727

public function getDistanceWith(self $point, bool $precise = true): float

is euclidean formula.
It does not work properly when GPS lattitude and longitude are involved.

Would you provide generic callback function if library user wants to use different distance formula?

Regards.

Resume algorithm execution

I believe it would be nice to be able to resume algorithm execution after its completion. It could be useful as new points are being added so previous iterations don't need to be re-run again.

Example: I have clustered my 100 000 users into 5 clusters. Since the last clustering, 100 new users have been added. Most of them are probably already very close to the existing clusters' centroids. Hence, I should be able to resume clustering the same dataset PLUS the new users to save time.

Multithread for the win

Multithreading the algo would significantly improve the performances. We may use the pthread library when available or proc_open (if none are available well 🤷‍♂️ )

And it would be so much fun to code 🤩

Packagist Release

The current packagist release seems to be out of date? It is based on tag 2.1.1, and seems to differ from the master branch. Would it be possible to update the release?

Missing "getPoints" method?

I believe the Cluster->getPoints() method is missing within this class. It is referenced within the readme. Am I missing something here?

fathest points

find fathest points in each cluster ?

Specify a distance function

Proposed change

$algo = new Kmeans\Algorithm($init);
$algo->setDistanceFunction(
    fn (Point $a, Point $b) => geocode_dist($a->getCoordinates(), $b->getCoordinates)
);
$result = $algo->clusterize($points, $K);

I keep getting new clusters on each refresh

Is there any way to add a random seed or something in there as whenever i run it, it gives me a new solution each time

Call to undefined method "KMeans\Cluster::getPoints()"

Environment

PHP 7.3 (XAMPP)
Library (latest from composer)

Problem

Fatal error: Uncaught Error: Call to undefined method KMeans\Cluster::getPoints()

Reconstructed

I just follow the code from Readme with a little change using 8 dimentions of array on $points variable, here is my code

require "vendor/autoload.php";

// dummy data
$points = [
  [79,75,75,85,76,78,76,80],
  [84,76,79,77,76,77,75,81],
  [77,84,78,85,92,89,77,82],
  [78,86,84,77,78,77,75,75],
  [82,82,81,91,90,82,79,91]
];

// create a 8-dimentions space
$space = new KMeans\Space(8);

// add points to space
foreach ($points as $i => $coordinates) {
    $space->addPoint($coordinates);
}

// cluster these 50 points in 3 clusters
$clusters = $space->solve(3);

// display the cluster centers and attached points
foreach ($clusters as $num => $cluster) {
  $coordinates = $cluster->getCoordinates();
  printf(
    "Cluster %s [%d,%d,%d,%d,%d,%d,%d,%d]: %d points\n",
    $num,
    $coordinates[0],
    $coordinates[1],
    $coordinates[2],
    $coordinates[3],
    $coordinates[4],
    $coordinates[5],
    $coordinates[6],
    $coordinates[7],
    count($cluster->getPoints())
  );
}

The problem gone if i remove count($cluster->getPoints())

Regard's
ian

PHP 5.3 Compatibility

Unbelievably, Amazon Linux, CentOS and RHEL still use PHP 5.3. There are a few changes which are needed to support these platforms, which are still very prevalent.

Cluster.php, line 47. Change this:
$points = [];
to this:
$points = array();

Cluster.php, lines 51-54. Change this:
return [
'centroid' => parent::toArray(),
'points' => $points,
];
to this:
return array(
'centroid' => parent::toArray(),
'points' => $points,
);

Point.php, lines 46-49. Change this:
return [
'coordinates' => $this->coordinates,
'data' => isset($this->space[$this]) ? $this->space[$this] : null,
];
to this:
return array(
'coordinates' => $this->coordinates,
'data' => isset($this->space[$this]) ? $this->space[$this] : null,
);

Space.php, line 53. Change this:
$points = [];
to this:
$points = array();

Space.php, line 57. Change this:
return ['points' => $points];
to this:
return array('points' => $points);

Space.php, line 101. Change this:

Thanks!
Ron Cemer

bdelespierre / php-kmeans Goto Github PK

php-kmeans's Introduction

PHP Kmean

Installation

Usage

Testing

Changelog

Contributing

Security

Credits

License

FAQ

How to get coordinates of a point/cluster:

List all points of a space/cluster:

Attach data to a point:

Retrieve point data:

Watch the algorithm run

php-kmeans's People

Contributors

Stargazers

Watchers

Forkers

php-kmeans's Issues

Proposed change

Proposed change

Proposed change

Proposed change

Proposed change

Environment

Problem

Reconstructed

Recommend Projects

Recommend Topics

Recommend Org

Jobs