tutteinstitute / evoc Goto Github PK

Embedding Vector Oriented Clustering

License: BSD 2-Clause "Simplified" License

Python 100.00%

evoc's Introduction

EVōC

EVōC (pronounced as "evoke") is Embedding Vector Oriented Clustering. EVōC is a library for fast and flexible clustering of large datasets of high dimensional embedding vectors. If you have CLIP-vectors, outputs from sentence-transformers, or openAI, or Cohere embed, and you want to quickly get good clusters out this is the library for you. EVōC takes all the good parts of the combination of UMAP + HDBSCAN for embedding clustering, improves upon them, and removes all the time-consuming parts. By specializing directly to embedding vectors we can get good quality clustering with fewer hyper-parameters to tune and in a fraction of the time.

EVōC is the library to use if you want:

Fast clustering of embedding vectors on CPU

Multi-granularity clustering, and automatic selection of the number of clusters

Clustering of int8 or binary quantized embedding vectors that works out-of-the-box

As of now this is very much an early beta version of the library. Things can and will break right now. We would welcome feedback, use cases and feature suggestions however.

Basic Usage

EVōC follows the scikit-learn API, so it should be familiar to most users. You can use EVōC wherever you might have previously been using other sklearn clustering algorithms. Here is a simple example

import evoc
from sklearn.datasets import make_blobs

data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)

clusterer = evoc.EVoC()
cluster_labels = clusterer.fit_predict(data)

Some more unique features include the generation of multiple layers of cluster granularity, the ability to extract a hierarchy of clusters across those layers, and automatic duplicate (or very near duplicate) detection.

import evoc
from sklearn.datasets import make_blobs

data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)

clusterer = evoc.EVoC()
cluster_labels = clusterer.fit_predict(data)
cluster_layers = clusterer.cluster_layers_
hierarchy = clusterer.cluster_tree_
potential_duplicates = clusterer.duplicates_

The cluster layers are a list of cluster label vectors with the first being the finest grained and later layers being coarser grained. This is ideal for layered topic modelling and use with DataMapPlot. See this data map for an example of using these layered clusters in topic modelling (zoom in to access finer grained topics).

Installation

EVōC has a small set of dependencies:

numpy

scikit-learn

numba

tqdm

tbb

At some point in the near future ... you can install EVōC from PyPI using pip:

pip install evoc

For now install the latest version of EVōC from source you can do so by cloning the repository and running:

git clone https://github.com/TutteInstitute/evoc
cd evoc
pip install .

License

EVōC is BSD (2-clause) licensed. See the LICENSE file for details.

Contributing

Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.

evoc's People

Contributors

Stargazers

Watchers

Forkers

johntigue gclendenning u3ks

evoc's Issues

No predict method?

When I wanted to cluster new data with Evoc, I saw that there was no predict method and if i want to make predictions I'm gonna have to retrain the model.

How come there's no predict method? What do you recommend to do in this situation?
Thank you.

Possible to set a random seed?

Just wondering if it would be possible to allow setting a random seed so that results could be reproducible between runs with the same data and input parameters?

package not on pypi

I can't find the evoc package on pypi (pip install evoc fails). I guess I'll install from source to try it out for now.

Numba warning on initialization

I get the following warning when initializing an EVoC class object.

import evoc

clusterer = evoc.EVoC()

[c:\Users\Me\AppData\Local\miniconda3\envs\my-env\lib\site-packages\evoc\float_nndescent.py:287](file:///C:/Users/Me/AppData/Local/miniconda3/envs/my-env/lib/site-packages/evoc/float_nndescent.py:287): NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
  points = point_indices[i]

Is this anything to be concerned about?
Numba version 0.60.0

Running README example throws error

Running the example in the README

import evoc
from sklearn.datasets import make_blobs

data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)

clusterer = evoc.EVoC()
cluster_labels = clusterer.fit_predict(data)
cluster_layers = clusterer.cluster_layers_
hierarchy = clusterer.hierarchy_
potential_duplicates = clusterer.duplicates_

throws the following error

AttributeError: EVoC object has no attribute 'hierarchy_'

Feedback on `best_layer` selection

Hi there,

I'm really enjoying looking through this library. I wanted to provide some thoughts on the selection of the 'best layer' returned by fit_predict. Firstly I just wanted to check my understanding. Is the approach that the layer that produces the fewest outliers is assumed to be the one that best fits the data and that is therefore the one returned by fit_predict?

Initially I thought it was a bug that fit_predict wasn't returning the most granular layer contained in cluster_layers (in my case it was returning cluster_layers[1]) until I went looking through the code and found the best_layer calculation.

It wasn't intuitive or clear to me that the most granular layer wasn't the one returned so I think that needs to be outlined in the doc strings of both EVoC and fit_predict.

Secondly, I think it would be good to have some option of what layer is returned by fit_predict. While it is easy enough to get the most granular layer from cluster_layers[0] explicitly, for some use cases (e.g. using EVoC as a drop in clusterer in BERTopic), BERTopic is just going to call fit_predict and return whatever it thinks is best. If the user sets base_min_cluster_size to try and control the level of granularity that they expect in the resulting clusters, but then EVoC chooses a different layer as the best layer, then the user won't be getting the level of granularity they expect.

It might be nice to introduce something like layer_selection = ['best', 'bottom', 'top'] so the user can force fit_predict to return the most granular layer if desired. best could be called fewest_outliers or something to be more explicit about how 'best' is being determined. It could also take an integer to just select a given layer (with the top layer returned if the integer is out of range).

Just some ideas.

Concurrent access detected

Hello @lmcinnes, thanks for another library!

I am just pointing out what seems to be a problem for arm-based processors (I am on M2).

After executing the blobs example of the readme page, I get:

Numba workqueue threading layer is terminating: Concurrent access has been detected.

 - The workqueue threading layer is not threadsafe and may not be accessed concurrently by multiple threads. Concurrent access typically occurs through a nested parallel region launch or by calling Numba parallel=True functions from multiple Python threads.
 - Try using the TBB threading layer as an alternative, as it is, itself, threadsafe. Docs: https://numba.readthedocs.io/en/stable/user/threading-layer.html

when execution reaches float_nndescent.py#L305.

Question about integration with DataMapPlot

Great work with this package, I'm just starting to experiment with it. Very Exciting!

Just wondering about plugging the clustered data into DataMapPlot. Will UMAP (or other) still be required to reduce higher dim vectors down to 2D to supply to data_map_coords separately? Or can evoc supply that too? Just thinking if evoc is doing some of what UMAP does anyway, is there some efficiency by not recalculating the dimension reduction separately? Or is it better for the user to have discrete control over the coordinates for the visualization?

Thanks

Why are the many noise (outliers)?

Thank you for sharing it with community great tool and I would say it is UMAP+HDBSCAN on steroids!

Quick question though, when I try to cluster 30k of text embeddings, I am getting a lot of the texts being grouped as outliers. I have tried to change params like noise_level, base_min_cluster_size or min_number_clusters, about 10-15% of the population is outlier, If I run UMAP+HDBSCAN manually I get significantly low number of outliers