antonypapadakis / clustering_data_with_many_methods Goto Github PK

View Code? Open in Web Editor NEW

This project was developed in the scope of the undergraduate course algorithmic problem solving of the NKUA. It includes 12 different implementations of clustering techniques from simple K-means to LSH/Hypercube clustering.

Makefile 1.02% C++ 98.98%

clustering_data_with_many_methods's Introduction

Clustering Data With Many Methods Cplusplus

This project was developed in the scope of the undergraduate course algorithmic problem solving of the NKUA(2018-2019). It includes 12 different implementations of clustering techniques from simple K-means to LSH/Hypercube clustering.

Each of the implemented algorithms includes 3 different steps. Initialization, Assignment and Update.

We provide hereby all the different combinations for the algorithmic implementation of each step:

Initialization

Random selection of k points (simplest)
K-means++

Assignment

Lloyd’s assignment
Assignment by Range search with LSH
Assignment by Range search with Hypercube

Update

K-means
Partitioning Around Medoids (PAM) improved like Lloyd’s

In total we get 12 combinations.

Run details

A makefile is included for compilation purposes and the program can be run by executing

./cluster -i input.csv -d euclidean -c cluster.conf -o out ./cluster -i input.csv -d cosine -c cluster.conf -o out

Where input.csv are the input data files in csv format and out is the filepath to the output that the programm will produce (you just need to state a name and the rest will be taken care of by the program)

cluster.conf is the file path to the configuration file for the program (a sample is provided)

The output file will contain for each cluster the indexes of the data belonging to it

Also a silhouette implementation is included for evaluation purposes.

Implementation was concluded in early 2019

Details of methology

For Regular Data

For every point, compute distance to every centroid.
Return (exact) nearest centroid.

For Big Data

Index k centroids into data-structure, e.g. LSH hashtables or Hypercube.
For every non-centroid point, run ANN to find nearest centroid.
Return (approximate) nearest centroid.

Recommend Projects

antonypapadakis / clustering_data_with_many_methods Goto Github PK

clustering_data_with_many_methods's Introduction

Clustering Data With Many Methods Cplusplus

Initialization

Assignment

Update

Run details

Details of methology

For Regular Data

For Big Data

clustering_data_with_many_methods's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs