GithubHelp home page GithubHelp logo

aprill23 / topic-model-diversity Goto Github PK

View Code? Open in Web Editor NEW

This project forked from silviatti/topic-model-diversity

0.0 0.0 0.0 28 KB

A collection of topic diversity measures for topic modeling

License: MIT License

Python 71.14% Jupyter Notebook 28.86%

topic-model-diversity's Introduction

A collection of Topic Diversity measures for topic modeling.

Here I collected and implemented most of the known topic diversity measures used for measuring how different topics are. The more diverse the resulting topics are, the higher will be the coverage of the various aspects of the analyzed corpus. It is therefore important to also obtain topics that are different from each other (rather than just considering how much coherent the topics are).

List of the currently implemented metrics:

How to use:

The necessary input for all the metrics is a list of topics, i.e. a list of list of strings. For example:

topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]

You can also specify the parameter topk which represents the number of words considered for each list. Note that topk must be less or equal than the length of the a topic list.

Here you can find a notebook with some examples: https://github.com/silviatti/topic-model-diversity/blob/master/topic_diversity_experiments.ipynb

Proportion of Unique Words:

topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
proportion_unique_words(topics, topk=3)

Out[1]: 1.0

Pairwise Jaccard Diversity:

topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
pairwise_jaccard_diversity(topics, topk=3)

Out[1]: 1.0

Word Embedding-based Centroid Distance

This metric requires a word embedding space as input to compute distances (parameter word_embedding_model). Please, use gensim to load the word embedding space.

import gensim
wv = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
pairwise_word_embedding_distance(topics, wv, topk=3)

Out[1]: 0.6379696850836505

Word Embedding-based Pairwise Distance

This metric requires a word embedding space as input to compute distances (parameter word_embedding_model). Please, use gensim to load the word embedding space.

import gensim
wv = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
centroid_distance(topics, wv, topk=3)

Out[1]: 0.8380562411966147

Inverted Rank-Biased Overlap

Parameter weight controls how top-weighted the metric is. The smaller the weight, the more top-weighted the metric is. When weight = 0, only the top-ranked word is considered.

topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
print("irbo p=0.5:",irbo(topics, weight=0.5, topk=3))

Out[1]: 1.0

Word Embedding-based Rank-Biased Overlap

This metric requires a word embedding space as input to compute distances (parameter word_embedding_model). Please, use gensim to load the word embedding space. Parameter weight controls how top-weighted the metric is. The smaller the weight, the more top-weighted the metric is. When weight = 0, only the top-ranked word is considered.

import gensim
wv = gensim.models.fasttext.load_facebook_model('cc.en.300.bin.gz')
topics = [['cat', 'animal', 'dog'], ['building', 'bank', 'house'], ['nature', 'wilderness', 'lake']]
word_embedding_irbo(topics,wv, weight=0.9, topk=3)

Out[1]: 0.8225350005800525

Credits:

For the implementation of inversed ranked-biased overlap, I included the https://github.com/dlukes/rbo package, all the rights reserved to the author of that package.

References:

Adji Bousso Dieng, Francisco J. R. Ruiz, and David M.Blei. 2020. Topic modeling in embedding spaces.Trans. Assoc. Comput. Linguistics, 8:439–453.
Nam Khanh Tran, Sergej Zerr, Kerstin Bischoff, Claudia Niederée, Ralf Krestel: Topic Cropping: Leveraging Latent Topics for the Analysis of Small Corpora. TPDL 2013: 297-308
Federico Bianchi, Silvia Terragni, Dirk Hovy: Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. Preprint (2020a)
Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, Elisabetta Fersini: Cross-lingual Contextualized Topic Models with Zero-shot Learning. Preprint (2020b)

topic-model-diversity's People

Contributors

silviatti avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.