GithubHelp home page GithubHelp logo

greatlse / bayon Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fujimizu/bayon

0.0 2.0 0.0 824 KB

a simple and fast clustering tool

License: GNU Lesser General Public License v2.1

Makefile 5.02% C++ 94.11% C 0.87%

bayon's Introduction

Tutorial in Japanese

Tutorial in English

Overview

Bayon is a simple and fast hard-clustering tool.

Bayon supports Repeated Bisection clustering and K-means clustering.

Install

% ./configure
% make
% sudo make install

Usage

Clustering input data

% bayon -n num [options] file
% bayon -l limit [options] file
   -n, --number=num      the number of clusters
   -l, --limit=lim       limit value of cluster bisection
   -p, --point           output similarity points
   -c, --clvector=file   save the vectors of cluster centroids
   --clvector-size=num   max size of output vectors of
                         cluster centroids (default: 50)
   --method=method       clustering method(rb, kmeans), default:rb
   --seed=seed           set a seed for random number generator

Get similar clusters for each input documents

% bayon -C file [options] file
   -C, --classify=file   target vectors
   --inv-keys=num        max size of the keys of each vector to be
                         looked up in inverted index (default: 20)
   --inv-size=num        max size of the inverted index of each key
                         (default: 100)
   --classify-size=num   max size of output similar groups
                         (default: 20)

Common options

   --vector-size=num     max size of each input vector
   --idf                 apply idf to input vectors
   -h, --help            show help messages
   -v, --version         show the version and exit

Example

  • clustering (number_of_output_clusters = 100)
% bayon -n 100 input.tsv > cluster.tsv
  • clustering (save vectors of cluster centroids)
% bayon -n 100 -c centroid.tsv input.tsv > cluster.tsv
  • classification (get similar clusters for input documents)
% bayon -C centroid.tsv input.tsv > classify.tsv

Format of Input Data

List of the vectors of input documents for clustering and classification

document_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
document_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
  • document_id : string
  • key : string
  • value : double

List of the vectors of cluster centroids

cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
  • cluster_id : string
  • key : string
  • value : double

Format of Output Data

List of clusters (output of clustering)

cluster_id1 \t document_id1 \t document_id2 \t document_id3 \t ...\n
cluster_id2 \t document_id4 \t document_id5 \t document_id6 \t ...\n
...
  • cluster_id : integer (>= 1)
  • document_id : string

List of the clusters with similarity values between documents and clusters (if perform clustering with --point option)

cluster_id1 \t document_id1 \t point1 \t document_id2 \t point2 \t ...\n
cluster_id2 \t document_id3 \t point3 \t document_id4 \t point4 \t ...\n
...
  • cluster_id : integer (>= 1)
  • document_id : string
  • point : double

List of the vectors of cluster centroids (if perform clustering with --clvector option)

cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
  • cluster_id : integer (>= 1)
  • key : string
  • value : double

List of similar clusters for each input documents

document_id1 \t cluster_id1 \t point1 \t cluster_id2 \t point2 \t ...\n
document_id2 \t cluster_id3 \t point3 \t cluster_id4 \t point4 \t ...\n
...
  • document_id : string
  • cluster_id : string
  • point : double

Requirement

  • C++ compiler with STL (Standard Template Library)

Recommended

  • google-sparsehash
    • If google-sparsehash not installed, this clustering tool uses "gnu_cxx::hash_map" or "std::map"

License

GPL2 (Gnu General Public License Version 2)

Author

Mizuki Fujisawa <[email protected]>

bayon's People

Contributors

fujimizu avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.