GithubHelp home page GithubHelp logo

antonypapadakis / clustering_data_with_many_methods Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 12.73 MB

This project was developed in the scope of the undergraduate course algorithmic problem solving of the NKUA. It includes 12 different implementations of clustering techniques from simple K-means to LSH/Hypercube clustering.

Makefile 1.02% C++ 98.98%
clustering lsh hypercube kmeans kmedoids cpp

clustering_data_with_many_methods's Introduction

Clustering Data With Many Methods Cplusplus

This project was developed in the scope of the undergraduate course algorithmic problem solving of the NKUA(2018-2019). It includes 12 different implementations of clustering techniques from simple K-means to LSH/Hypercube clustering.

Each of the implemented algorithms includes 3 different steps. Initialization, Assignment and Update.

We provide hereby all the different combinations for the algorithmic implementation of each step:

Initialization

  1. Random selection of k points (simplest)
  2. K-means++

Assignment

  1. Lloyd’s assignment
  2. Assignment by Range search with LSH
  3. Assignment by Range search with Hypercube

Update

  1. K-means
  2. Partitioning Around Medoids (PAM) improved like Lloyd’s

In total we get 12 combinations.

Run details

A makefile is included for compilation purposes and the program can be run by executing

./cluster -i input.csv -d euclidean -c cluster.conf -o out ./cluster -i input.csv -d cosine -c cluster.conf -o out

Where input.csv are the input data files in csv format and out is the filepath to the output that the programm will produce (you just need to state a name and the rest will be taken care of by the program)

cluster.conf is the file path to the configuration file for the program (a sample is provided)

The output file will contain for each cluster the indexes of the data belonging to it

Also a silhouette implementation is included for evaluation purposes.

Implementation was concluded in early 2019

Details of methology

For Regular Data

  1. For every point, compute distance to every centroid.
  2. Return (exact) nearest centroid.

For Big Data

  1. Index k centroids into data-structure, e.g. LSH hashtables or Hypercube.
  2. For every non-centroid point, run ANN to find nearest centroid.
  3. Return (approximate) nearest centroid.

clustering_data_with_many_methods's People

Contributors

antonypapadakis avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.