Comments (3)
Although the approach may look similar, their implementation is actually quite different. In practice, you will not be able to recreate KeyBERT with BERTopic and vice versa. To make this clear, I'll go through the models individually and then compare them.
BERTopic
The procedure of BERTopic is demonstrated below:
Here, you can see that there are three distinct steps:
- Embedding documents
- Clustering documents
- Creating a topic representation.
The main output of BERTopic is a set of words per topic. Thus, multiple documents have the same topic representation.
KeyBERT
KeyBERT can roughly be divided into the following steps:
- Embedding documents
- Creating candidate keywords
- Calculating best keywords through either MMR, Max Sum Similarity, or Cosine Similarity
The main output of KeyBERT is a set of words per document. Thus, each document is expected to have different keywords.
BERTopic vs. KeyBERT
The main similarities between the two methods are that they embed documents and leverage MMR (although both models may opt not to). To me, that is essentially where the similarities end. The main difference is everything that happens between embedding documents and, in some cases, leveraging MMR. For example, BERTopic aims to cluster documents and create a broad representation of multiple documents whereas KeyBERT does not. Moreover, when it comes down to algorithmic implementation, the UMAP/HDBSCAN/c-TF-IDF route is quite different from generating candidate keywords and comparing them to the individual documents.
When to use BERTopic vs. KeyBERT
As you might have already noticed from the descriptions above, both the purpose and output of the methods differ. BERTopic, and in that sense most topic modeling techniques, are meant to explore the data to create an understanding of the perhaps millions of documents that you have collected. KeyBERT, in contrast, is not able to do this as it creates a completely different set of words per document. An example of using KeyBERT, and in that sense most keyword extraction algorithms, is automatically creating relevant keywords for content (blogs, articles, etc.) that businesses post on their website.
P.S. I kinda went overboard with this explanation but seeing as there were several people that liked your question it seemed to be important to several others. If I wasn't clear of if you have any follow-up questions, don't hesitate to ask!
from keybert.
Hello @MaartenGr and thanks a lot for the clean clarification!
from keybert.
Great explanation really appreciated being able to find this thanks!
from keybert.
Related Issues (20)
- KeyLLM fails when no GPU is available HOT 1
- extraction of keywords should be ignored when the LLM does not know or does not find them HOT 2
- Is there a batched-based keyword extraction approach with keyBERT? HOT 2
- KeyLLM error with bedrock model HOT 9
- KeyLLM parameter control HOT 3
- Langchain produces error based on instructions in sourcecode HOT 1
- Stopwords on KeyBERT HOT 6
- KeyLLM - Mistral token issue HOT 1
- Running Keybert for a list of docs to extract arabic keywords HOT 1
- KeyLLM - page_content error with bedrock model HOT 3
- Efficient KeyLLM + KeyBERT - Torch not compiled with CUDA enabled HOT 1
- Allow KeyBERT to pass `batch_size` to `llm.encode()` method HOT 5
- Make system content as variable HOT 2
- Fail to parse OpenAI api response HOT 2
- Extract keywords from multiple documents given a nested list of candidates for each document. HOT 1
- can't import keybert HOT 9
- Using KeyBERT with a locally saved model HOT 1
- Not able to use gensim HOT 5
- Why and how the same model for doc_embeddings and word_embeddings? HOT 1
- Setup check. Script to get keywords for comparing against SimpleMaths, TextRank and Philology results HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from keybert.