GithubHelp home page GithubHelp logo

jshiohaha / k-means-clustering Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 3.78 MB

K-means clustering implementation. Acceptable file format is .arff file in order to directly compare results with Weka.

License: MIT License

Python 100.00%

k-means-clustering's Introduction

K Means Clustering Implementation

K-means clustering implementation. Acceptable file format is .arff file in order to directly compare results with Weka. The majority of testing was done on the iris.arff dataset. Testing on this dataset had a few advantages. Namely, there are only 4 attributes for each instance of data (transaction). This means that the algorithm was only ever dealing with 4-dimensions. This made it much easier to comprehend what was going on. Further, the dataset is easily separable on any 3-axis combination of the 4 explanatory variables, which means that we could easily plot the visual plot to make sure that clusters were converging to where we would expect.

Repository Structure

Documents Folder: contains a PDF of the report.

Data Folder: contains all .arff data sets used in the testing of this implementation.

src Folder: contains all code for this assignment, split up into generateKClusters.py and kmeans.py. The kmeans.py file contains all logic related to the actual k-means algorithm, while the generateKClusters.py contains functions needed to parse command line arguments, run algorithm, and plot the results.

Program

Dependencies

Note that a few Python packages used in this implementation prevent the program from being executable on the CSE server. However, we have included a file called requirements.txt, which will allow you to batch install all of the required dependencies for this implementation via pip. All you have to do is run the following command, pip3 install -r requirements.txt.

Input

There are 4 main inputs for the program, which are entered via CLI arguments:

  • -f: represents the input data file to be read parsed and read in
  • -k: represents the number of clusters that the clustering algorithm should try to find
  • -e: represents a threshold such that if the change in sum of the distances from cluster centers decreases below this value, the program will terminate
  • -i: represents the number of iterations to run before terminating if the other terminating conditions are not met
  • [Optional] -s: represents the value of the seed value when using random() to force pseudo-random functions to behave deterministically. This helps to ensure repeatability. If not specified, then the k-means clustering algorithm will use a default value of 10.
  • [Optional] -n: represents a boolean flag that tells the program if it should normalize the data set before running the k-means algorithm.

Output

The program outputs information about the clustering results from k-means. This information includes initial centroid coordinates, runtime, resulting cluster attributes, and total cluster membership.

Program Commands

The program can be started by running the following command that correspond to the input parameters listed above:

python generateKClusters.py -f <input_file> -k <num_clusters> -e <epsilon> -i <max_iterations> -s 1

Or, if you don't specify the -s CLI argument, which denotes the value of the seed variable, then you would use the following command:

python generateKClusters.py -f <input_file> -k <num_clusters> -e <epsilon> -i <max_iterations>

If your default version of Python is Python 2.x, you will need to specify python3 on the command line. Otherwise, running python will default to Python 3.x.

Extra Program Functions

A number of additional functions were written to assist in the plotting and printing this algorithm's results. However, none are directly called by the current main() method in generateKClusters.py.

Implementation Assumptions

• Assume that all the attributes are continuous variables.

• Your program must allow the number of clusters (k) to be specified as input.

• Your program must allow the epsilon (change in the sum of the distances from the cluster centers) to be specified as input.

• Your program must allow the number of iterations to be specified as input.

Terminating Conditions

The program will stop if either of the following conditions hold:

  1. The number of iterations is reached
  2. The change in the total sum of the squares of the distances (SSD) falls below epsilon

k-means-clustering's People

Contributors

jshiohaha avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.