GithubHelp home page GithubHelp logo

dborrelli / chat-intents Goto Github PK

View Code? Open in Web Editor NEW
166.0 5.0 24.0 6.54 MB

Clustering sentence embeddings to extract message intent

License: MIT License

Jupyter Notebook 97.79% Python 2.21%
nlp sentence-embeddings clustering unsupervised-learning document-embeddings

chat-intents's Introduction

chat-intents

ChatIntents provides a method for automatically clustering and applying descriptive group labels to short text documents containing dialogue intents. It uses UMAP for performing dimensionality reduction on user-supplied document embeddings and HDSBCAN for performing the clustering. Hyperparameters are automatically tuned by performing a Bayesian search (using hyperopt) on a constrained optimization of an objective function using user-supplied bounds.

See the associated Medium post for additional description and motivation.

Installation

Installation can be done using PyPI:

pip install chatintents

Note: Depending on your system setup and environment, you may encounter an error associated with the pip install of HDSBCAN (failure to build the hdbscan wheel). This is a known issue with HDSCAN and has several possible solutions. If you are already using a conda virtual environment, an easy solution is to conda install HDBSCAN before installing the chatintents package using:

conda install -c conda-forge hdbscan

Sentence embeddings

The chatintents package doesn't include or specify how to create the sentence embeddings of the documents. Two popular pre-trained embedding models, as shown in the tutorial notebook, are the Unversal Sentence Encoder (USE) and Sentence Transformers.

Sentence Transformers can be installed by:

pip install -U sentence-transformers

Universal Sentence Encoder requires installing

pip install tensorflow
pip install --upgrade tensorflow-hub

Quick Start

The below example uses a Sentence Transformer model to embed the messages and create a model instance:

import chatintents
from chatintents import ChatIntents

from sentence_transformers import SentenceTransformer

all_intents = list(docs['text'])
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(all_intents)

model = ChatIntents(embeddings, 'st1')

Creating a ChatIntents instance requires inputs of an embedding representation of all documents and a short-text string description of the model (no spaces).

Generating clusters

Methods are provided for generating clusters using user-supplied hyperparameters, from random search, and from a Bayesian search.

User-supplied hyperparameters and manually scoring

clusters = model.generate_clusters(n_neighbors = 15, 
                                   n_components = 5, 
                                   min_cluster_size = 5, 
                                   min_samples = None,
                                   random_state=42)

labels, cost = model.score_clusters(clusters)

Random search

To run 100 evaluations of randomly-selected hyperparameter values within user-supplied ranges:

space = {
        "n_neighbors": range(12,16),
        "n_components": range(3,7),
        "min_cluster_size": range(2,15),
        "min_samples": range(2,15)
    }

df_random = model.random_search(space, 100)

Bayesian search

Perform a Bayesian search of the hyperparameter space using hyperopt and user-supplied upper and lower bounds for the number of expected clusters:

hspace = {
    "n_neighbors": hp.choice('n_neighbors', range(3,16)),
    "n_components": hp.choice('n_components', range(3,16)),
    "min_cluster_size": hp.choice('min_cluster_size', range(2,16)),
    "min_samples": None,
    "random_state": 42
}

label_lower = 30
label_upper = 100
max_evals = 100

model.bayesian_search(space=hspace,
                      label_lower=label_lower, 
                      label_upper=label_upper, 
                      max_evals=max_evals)

Running the bayesian_search method on a model instance saves the best parameters and best clusters to that instance as variables. For example:

>>> model.best_params

{'min_cluster_size': 5,
 'min_samples': None,
 'n_components': 11,
 'n_neighbors': 3,
 'random_state': 42}

Applying labels to best clusters from Bayesian search

After running the bayesian_search method to identify the best clusters for a given embedding model, descriptive labels can then be applied with:

df_summary, labeled_docs = model.apply_and_summarize_labels(docs[['text']])

This yields two results. The df_summary dataframe summarizing the count and descriptive label of each group:

alt text

and the labeled_docs dataframe with each document in the dataset and it's associated cluster number and descriptiive label:

alt text

Evaluating performance if ground truth is known

Two methods are also supplied for evaluating and comparing the performance of different models if the ground truth labels happen to be known:

models = [model_use, model_st1, model_st2, model_st3]

df_comparison, labeled_docs_all_models = chatintents.evaluate_models(docs[['text', 
                                                                           'category']],
                                                                           models)

An example df_comparison dataframe comparing model performance is shown below:

alt text

Tutorial

See this tutorial notebook for an example of using the chatintents package for comparing four different models on a dataset.

chat-intents's People

Contributors

dborrelli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

chat-intents's Issues

Label extraction only for english.

Hi,
I am using chat-intents and the clustering works very well.
However, I am working with french data and the label extraction gives poor results. I assume it's because this method necessarily uses a specialized spacy model for English.
I was wondering if the name of the loaded spacy model or at least the language could be passed as a parameter of apply_and_summarize_labels for example ?
This way, the performance could be much better for all languages other than English.

How can I use all CPUs when tuning hyperparams

@dborrelli When I specify a value for the "random_state" parameter in the "bayesian_search," I receive the following warning: "UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism. warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")

Hyper Param tuning is taking significant amount of time.I want to use the 'random_state' parameter to ensure reproducibility, while also setting 'n_jobs' to -1 to enable parallel processing. What's the best way to achieve this?"

AttributeError: 'numpy.ndarray' object has no attribute 'unique'

Hi, while I'm using apply_and_summarize_labels,
it's causing an issue as below. Please help.

df_summary, labeled_docs = model.apply_and_summarize_labels(data_sample.sentence)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_21756/2555802596.py in <module>
----> 1 df_summary, labeled_docs = model.apply_and_summarize_labels(data_sample.sentence)

/opt/conda/lib/python3.9/site-packages/chatintents/ChatIntents.py in apply_and_summarize_labels(self, df_data)
    418         df_clustered[category_col] = self.best_clusters.labels_
    419 
--> 420         numerical_labels = df_clustered[category_col].unique()
    421 
    422         # create dictionary mapping the numerical category to the generated

AttributeError: 'numpy.ndarray' object has no attribute 'unique'


Install not working

!pip install chatintents
leads to

ERROR: Could not find a version that satisfies the requirement chatintents (from versions: none)
ERROR: No matching distribution found for chatintents

I'm using google colab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.