GithubHelp home page GithubHelp logo

evoc's Introduction

EVōC Logo

EVōC

EVōC (pronounced as "evoke") is Embedding Vector Oriented Clustering. EVōC is a library for fast and flexible clustering of large datasets of high dimensional embedding vectors. If you have CLIP-vectors, outputs from sentence-transformers, or openAI, or Cohere embed, and you want to quickly get good clusters out this is the library for you. EVōC takes all the good parts of the combination of UMAP + HDBSCAN for embedding clustering, improves upon them, and removes all the time-consuming parts. By specializing directly to embedding vectors we can get good quality clustering with fewer hyper-parameters to tune and in a fraction of the time.

EVōC is the library to use if you want:

  • Fast clustering of embedding vectors on CPU
  • Multi-granularity clustering, and automatic selection of the number of clusters
  • Clustering of int8 or binary quantized embedding vectors that works out-of-the-box

As of now this is very much an early beta version of the library. Things can and will break right now. We would welcome feedback, use cases and feature suggestions however.

Basic Usage

EVōC follows the scikit-learn API, so it should be familiar to most users. You can use EVōC wherever you might have previously been using other sklearn clustering algorithms. Here is a simple example

import evoc
from sklearn.datasets import make_blobs

data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)

clusterer = evoc.EVoC()
cluster_labels = clusterer.fit_predict(data)

Some more unique features include the generation of multiple layers of cluster granularity, the ability to extract a hierarchy of clusters across those layers, and automatic duplicate (or very near duplicate) detection.

import evoc
from sklearn.datasets import make_blobs

data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)

clusterer = evoc.EVoC()
cluster_labels = clusterer.fit_predict(data)
cluster_layers = clusterer.cluster_layers_
hierarchy = clusterer.cluster_tree_
potential_duplicates = clusterer.duplicates_

The cluster layers are a list of cluster label vectors with the first being the finest grained and later layers being coarser grained. This is ideal for layered topic modelling and use with DataMapPlot. See this data map for an example of using these layered clusters in topic modelling (zoom in to access finer grained topics).

Installation

EVōC has a small set of dependencies:

  • numpy
  • scikit-learn
  • numba
  • tqdm
  • tbb

At some point in the near future ... you can install EVōC from PyPI using pip:

pip install evoc

For now install the latest version of EVōC from source you can do so by cloning the repository and running:

git clone https://github.com/TutteInstitute/evoc
cd evoc
pip install .

License

EVōC is BSD (2-clause) licensed. See the LICENSE file for details.

Contributing

Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.

evoc's People

Contributors

lmcinnes avatar gclendenning avatar

Stargazers

Kevin Armengol avatar Peter Leimbigler avatar  avatar  avatar Derek Snow avatar Matt Ranger avatar Erin Lee avatar Mark Franey avatar Yotam avatar Tuomo Hiippala avatar Carlo Moro avatar Patrik avatar  avatar Kevin Reuning avatar Alvin Deng avatar 千古兴亡知衡权 avatar fanpan avatar SL avatar Kean Rawr avatar ili i.  avatar Hugo VASSELIN avatar Shashank Agrawal avatar  avatar Mateja Putic avatar William Mattingly avatar Wilson Marcílio Júnior avatar antx avatar Ben Labaschin avatar  avatar Dimitrije Antic avatar Jorge Osés Grijalba avatar Xander Song avatar Mikyo King avatar Kamil Slowikowski avatar James Melville avatar  avatar 一叶知秋olka avatar Jianshu_Zhao avatar CL Dixon avatar Ahmed Khaled avatar  avatar Bastian Rieck avatar  avatar Sergei Pashakhin avatar Torsten Sprenger avatar Ian Johnson avatar  avatar Henry Wallace avatar Enrique Millán Valbuena avatar 爱可可-爱生活 avatar mg20400 avatar Matthew Franglen avatar Yunus Güngör avatar Tom Theile avatar Jichen Wen avatar Oleg Baskov avatar Dhruv Anand avatar Avashlin Moodley avatar Srikanth K S avatar  avatar Greg Hochmuth avatar Vincent avatar Sean Pedersen avatar Paco GB avatar  avatar Alex Diaz-Papkovich avatar Mike Trizna avatar Lawrence Wu avatar deepfates avatar  avatar  avatar R Max Espinoza avatar Emil Hvitfeldt avatar Philip Nuzhnyi avatar baggiponte avatar PD Hall avatar  avatar Alan Chang avatar  avatar Christopher Akiki avatar Zach Nussbaum avatar Felipe Menegazzi avatar Laurent Sorber avatar Anderson Chaves avatar Hammad Bashir avatar Trent Hauck avatar raúl avatar Roland Szabo avatar Vicki Boykis avatar Aleksi Knuutila avatar  avatar  avatar Martin Laprise avatar Zachariah Mustafa avatar

Watchers

Benoit Hamelin avatar  avatar Yotam avatar William Mattingly avatar

evoc's Issues

No predict method?

When I wanted to cluster new data with Evoc, I saw that there was no predict method and if i want to make predictions I'm gonna have to retrain the model.

How come there's no predict method? What do you recommend to do in this situation?
Thank you.

Possible to set a random seed?

Just wondering if it would be possible to allow setting a random seed so that results could be reproducible between runs with the same data and input parameters?

package not on pypi

I can't find the evoc package on pypi (pip install evoc fails). I guess I'll install from source to try it out for now.

Numba warning on initialization

I get the following warning when initializing an EVoC class object.

import evoc

clusterer = evoc.EVoC()
[c:\Users\Me\AppData\Local\miniconda3\envs\my-env\lib\site-packages\evoc\float_nndescent.py:287](file:///C:/Users/Me/AppData/Local/miniconda3/envs/my-env/lib/site-packages/evoc/float_nndescent.py:287): NumbaTypeSafetyWarning: unsafe cast from uint64 to int64. Precision may be lost.
  points = point_indices[i]

Is this anything to be concerned about?
Numba version 0.60.0

Running README example throws error

Running the example in the README

import evoc
from sklearn.datasets import make_blobs

data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)

clusterer = evoc.EVoC()
cluster_labels = clusterer.fit_predict(data)
cluster_layers = clusterer.cluster_layers_
hierarchy = clusterer.hierarchy_
potential_duplicates = clusterer.duplicates_

throws the following error

AttributeError: EVoC object has no attribute 'hierarchy_'

Feedback on `best_layer` selection

Hi there,

I'm really enjoying looking through this library. I wanted to provide some thoughts on the selection of the 'best layer' returned by fit_predict. Firstly I just wanted to check my understanding. Is the approach that the layer that produces the fewest outliers is assumed to be the one that best fits the data and that is therefore the one returned by fit_predict?

Initially I thought it was a bug that fit_predict wasn't returning the most granular layer contained in cluster_layers (in my case it was returning cluster_layers[1]) until I went looking through the code and found the best_layer calculation.

It wasn't intuitive or clear to me that the most granular layer wasn't the one returned so I think that needs to be outlined in the doc strings of both EVoC and fit_predict.

Secondly, I think it would be good to have some option of what layer is returned by fit_predict. While it is easy enough to get the most granular layer from cluster_layers[0] explicitly, for some use cases (e.g. using EVoC as a drop in clusterer in BERTopic), BERTopic is just going to call fit_predict and return whatever it thinks is best. If the user sets base_min_cluster_size to try and control the level of granularity that they expect in the resulting clusters, but then EVoC chooses a different layer as the best layer, then the user won't be getting the level of granularity they expect.

It might be nice to introduce something like layer_selection = ['best', 'bottom', 'top'] so the user can force fit_predict to return the most granular layer if desired. best could be called fewest_outliers or something to be more explicit about how 'best' is being determined. It could also take an integer to just select a given layer (with the top layer returned if the integer is out of range).

Just some ideas.

Concurrent access detected

Hello @lmcinnes, thanks for another library!

I am just pointing out what seems to be a problem for arm-based processors (I am on M2).

After executing the blobs example of the readme page, I get:

Numba workqueue threading layer is terminating: Concurrent access has been detected.

 - The workqueue threading layer is not threadsafe and may not be accessed concurrently by multiple threads. Concurrent access typically occurs through a nested parallel region launch or by calling Numba parallel=True functions from multiple Python threads.
 - Try using the TBB threading layer as an alternative, as it is, itself, threadsafe. Docs: https://numba.readthedocs.io/en/stable/user/threading-layer.html

when execution reaches float_nndescent.py#L305.

Question about integration with DataMapPlot

Great work with this package, I'm just starting to experiment with it. Very Exciting!

Just wondering about plugging the clustered data into DataMapPlot. Will UMAP (or other) still be required to reduce higher dim vectors down to 2D to supply to data_map_coords separately? Or can evoc supply that too? Just thinking if evoc is doing some of what UMAP does anyway, is there some efficiency by not recalculating the dimension reduction separately? Or is it better for the user to have discrete control over the coordinates for the visualization?

Thanks

Why are the many noise (outliers)?

Thank you for sharing it with community great tool and I would say it is UMAP+HDBSCAN on steroids!

Quick question though, when I try to cluster 30k of text embeddings, I am getting a lot of the texts being grouped as outliers. I have tried to change params like noise_level, base_min_cluster_size or min_number_clusters, about 10-15% of the population is outlier, If I run UMAP+HDBSCAN manually I get significantly low number of outliers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.