GithubHelp home page GithubHelp logo

smitakshigupta / bertopic Goto Github PK

View Code? Open in Web Editor NEW

This project forked from maartengr/bertopic

0.0 0.0 0.0 1.23 MB

Leveraging BERT and a class-based TF-IDF to create easily interpretable topics.

License: MIT License

Jupyter Notebook 61.84% Python 38.16%

bertopic's Introduction

PyPI - Status PyPI - Python PyPI - License PyPI - PyPi

BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

Corresponding medium post can be found here.

Table of Contents

  1. About the Project
  2. Algorithm
    2.1. Sentence Transformer
    2.2. UMAP + HDBSCAN
    2.3. c-TF-IDF
  3. Getting Started
    3.1. Installation
    3.2. Basic Usage
    3.3. Overview
  4. Google Colaboratory

1. About the Project

Back to ToC

The initial purpose of this project was to generalize Top2Vec such that it could be used with state-of-art pre-trained transformer models. However, this proved difficult due to the different natures of Doc2Vec and transformer models. Instead, I decided to come up with a different algorithm that could use BERT and ๐Ÿค— transformers embeddings. The results is BERTopic, an algorithm for generating topics using state-of-the-art embeddings.

2. Getting Started

Back to ToC

2.1. Installation

PyTorch 1.2.0 or higher is recommended. If the install below gives an error, please install pytorch first here.

Installation can be done using pypi:

pip install bertopic

PyTorch
If you get

If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

2.2. Usage

Below is an example of how to use the model. The example uses the 20 newsgroups dataset.

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all')['data']

model = BERTopic("distilbert-base-nli-mean-tokens", verbose=True)
topics = model.fit_transform(docs)

The resulting topics can be accessed through model.get_topic(topic):

>>> model.get_topic(9)
[('game', 0.005251396890032802),
 ('team', 0.00482651185323754),
 ('hockey', 0.004335032060690186),
 ('players', 0.0034782716706978963),
 ('games', 0.0032873248432630227),
 ('season', 0.003218987432255393),
 ('play', 0.0031855141725669637),
 ('year', 0.002962343114817677),
 ('nhl', 0.0029577648449943144),
 ('baseball', 0.0029245163154193524)]

2.3. Overview

Methods Code Returns
Access single topic model.get_topic(12) Tuple[Word, Score]
Access all topics model.get_topic() List[Tuple[Word, Score]]
Get single topic freq model.get_topic_freq(12) int
Get all topic freq model.get_topics_freq() DataFrame
Fit the model model.fit(docs]) -
Predict new documents model.transform([new_doc]) List[int]
Save model model.save("my_model") -
Load model BERTopic.load("my_model") -

NOTE: The embeddings itself are not preserved in the model as they are only vital for creating the clusters. Therefore, it is advised to only use fit and then transform if you are looking to generalize the model to new documents. For existing documents, it is best to use fit_transform directly as it only needs to generate the document embeddings once.

3. Algorithm

Back to ToC
The algorithm contains, roughly, 3 stages:

  • Extract document embeddings with Sentence Transformers
  • Cluster document embeddings to create groups of similar documents with UMAP and HDBSCAN
  • Extract and reduce topics with c-TF-IDF

3.1. Sentence Transformer

We start by creating document embeddings from a set of documents using sentence-transformer. These models are pre-trained for many language and are great for creating either document- or sentence-embeddings.

If you have long documents, I would advise you to split up your documents into paragraphs or sentences as a BERT-based model in sentence-transformer typically has a token limit.

3.2. UMAP + HDBSCAN

Next, in order to cluster the documents using a clustering algorithm such as HDBSCAN we first need to reduce its dimensionality as HDBCAN is prone to the curse of dimensionality.

Thus, we first lower dimensionality with UMAP as it preserves local structure well after which we can use HDBSCAN to cluster similar documents.

3.3. c-TF-IDF

What we want to know from the clusters that we generated, is what makes one cluster, based on their content, different from another? To solve this, we can modify TF-IDF such that it allows for interesting words per topic instead of per document.

When you apply TF-IDF as usual on a set of documents, what you are basically doing is comparing the importance of words between documents. Now, what if, we instead treat all documents in a single category (e.g., a cluster) as a single document and then apply TF-IDF? The result would be importance scores for words within a cluster. The more important words are within a cluster, the more it is representative of that topic. In other words, if we extract the most important words per cluster, we get descriptions of topics!

Each cluster is converted to a single document instead of a set of documents. Then, the frequency of word t are extracted for each class i and divided by the total number of words w. This action can now be seen as a form of regularization of frequent words in the class. Next, the total, unjoined, number of documents m is divided by the total frequency of word t across all classes n.

4. Google Colaboratory

Back to ToC
Since we are using transformer-based embeddings you might want to leverage gpu-acceleration to speed up the model. For that, I have created a tutorial Google Colab Notebook that you can use to run the model as shown above.

If you want to tweak the inner workings or follow along with the medium post, use this notebook instead.

References

Angelov, D. (2020). Top2Vec: Distributed Representations of Topics. arXiv preprint arXiv:2008.09470.

bertopic's People

Contributors

maartengr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.