GithubHelp home page GithubHelp logo

Comments (3)

MaartenGr avatar MaartenGr commented on May 8, 2024 15

Although the approach may look similar, their implementation is actually quite different. In practice, you will not be able to recreate KeyBERT with BERTopic and vice versa. To make this clear, I'll go through the models individually and then compare them.

BERTopic

The procedure of BERTopic is demonstrated below:

image

Here, you can see that there are three distinct steps:

  1. Embedding documents
  2. Clustering documents
  3. Creating a topic representation.

The main output of BERTopic is a set of words per topic. Thus, multiple documents have the same topic representation.

KeyBERT

KeyBERT can roughly be divided into the following steps:

  1. Embedding documents
  2. Creating candidate keywords
  3. Calculating best keywords through either MMR, Max Sum Similarity, or Cosine Similarity

The main output of KeyBERT is a set of words per document. Thus, each document is expected to have different keywords.

BERTopic vs. KeyBERT

The main similarities between the two methods are that they embed documents and leverage MMR (although both models may opt not to). To me, that is essentially where the similarities end. The main difference is everything that happens between embedding documents and, in some cases, leveraging MMR. For example, BERTopic aims to cluster documents and create a broad representation of multiple documents whereas KeyBERT does not. Moreover, when it comes down to algorithmic implementation, the UMAP/HDBSCAN/c-TF-IDF route is quite different from generating candidate keywords and comparing them to the individual documents.

When to use BERTopic vs. KeyBERT

As you might have already noticed from the descriptions above, both the purpose and output of the methods differ. BERTopic, and in that sense most topic modeling techniques, are meant to explore the data to create an understanding of the perhaps millions of documents that you have collected. KeyBERT, in contrast, is not able to do this as it creates a completely different set of words per document. An example of using KeyBERT, and in that sense most keyword extraction algorithms, is automatically creating relevant keywords for content (blogs, articles, etc.) that businesses post on their website.

P.S. I kinda went overboard with this explanation but seeing as there were several people that liked your question it seemed to be important to several others. If I wasn't clear of if you have any follow-up questions, don't hesitate to ask!

from keybert.

shoegazerstella avatar shoegazerstella commented on May 8, 2024

Hello @MaartenGr and thanks a lot for the clean clarification!

from keybert.

voxmenthe avatar voxmenthe commented on May 8, 2024

Great explanation really appreciated being able to find this thanks!

from keybert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.