GithubHelp home page GithubHelp logo

alvant / optimalnumberoftopics Goto Github PK

View Code? Open in Web Editor NEW

This project forked from machine-intelligence-laboratory/optimalnumberoftopics

0.0 0.0 0.0 22.8 MB

A set of methods for finding an appropriate number of topics in a text collection

License: MIT License

Python 100.00%

optimalnumberoftopics's Introduction

OptimalNumberOfTopics

To begin with, searching for an optimal number of topics in a text collection seems a very poorly stated task, because this number of topics heavily depends on the task at hand. One can take 10 topics and it might be enough, or 100 topics, or 1000. What's more, the whole notion of a topic is a bit obscure: people think of topics just as of some meaningful stories, concepts or ideas. And there is a parent-child relationship between such topics, eg. topic "Coconut juice" is a child of topic "Beverages". This means that for one dataset one can train a good topic model with, let's say 10 big parent topics, or another good topic model with, for example 100 more concrete, smaller topics.

So, what is this repository about then? It gives an opportunity to try different method to find an appropriate, approximate number of topics, the number which in order of magnitude is close to the number of not-so-small topics.

Optimize Scores

The first method is just about optimizing something for the number of topics. That is, train several models with different number of topics, calculate some quality function for those models, and find the one which is the best.

The idea behind scores optimization

Scores, available for optimizing:

Let's say, one have her text collection as a vowpal wabbit file vw.txt:

doc_1 |@publisher mann_ivanov_ferber |@title atlas obscura |@text earth:8 travel:10 baobab:1 ...
doc_2 |@publisher chook_and_geek |@title black hammer |@text hero:10 whiskey:2 barbalien:4 ...
doc_3 |@publisher eksmo |@title dune |@text sand:7 arrakis:6 spice:12 destiny:2 ...
...

Then it is possible to find an optimal number of topics for this collection by looking at some topic model's characteristics (scores) and choosing the number of topics which corresponds to the best model.

The searching process can be started like this:

python run_search.py \
    vw.txt \                    # path to vowpal wabbit file
    @text:1 \                   # main modality and its weight
    result.json \               # output file path (the file may not exist)
    -m @publisher:5 \           # other modality and its weight
    --modality @title:2 \       # other modality and its weight
    optimize_scores \           # search method
    --min-num-topics 1 \        # minimum number of topics in the text collection
    --max-num-topics 10 \       # maximum number of topics in the text collection
    --num-topics-interval 2 \   # search step in number of topics
    --num-fit-iterations 100 \  # number of fit iterations for each model training
    --num-restarts 10 \         # number of training restarts that differ in seed
    perplexity \                # what score to optimize
    renyi_entropy \             # another score to optimize
    --threshold-factor 2.0 \    # previous score parameter
    intratext_coherence \       # one more score
    top_tokens_coherence \      # and yet another one

And the result.json file will look like this: (TODO: try on real data to get meaningful values)

{
    "score_results":
    {
        "perplexity_score":
        {
            "optimum": 9.0,
            "optimum_std": 0.0,
            "num_topics_values": [1.0, 3.0, 5.0, 7.0, 9.0],
            "score_values": [1374.69, 685.37, 494.05, 377.24, 313.09],
            "num_topics_values_std": [0.0, 0.0, 0.0, 0.0, 0.0],
            "score_values_std": [0.0, 0.0, 0.0, 0.0, 0.0]
        },
        "renyi_entropy_score":
        {
            "optimum": 3.0,
            "optimum_std": 0.0,
            "num_topics_values": [1.0, 3.0, 5.0, 7.0, 9.0],
            "score_values": [1983797813.52, 1.37, 1.63, 1.84, 2.00],
            "num_topics_values_std": [0.0, 0.0, 0.0, 0.0, 0.0],
            "score_values_std": [9.87e-07, 2.30e-16, 2.30e-16, 4.60e-16, 6.90e-16]
        },
        "intratext_coherence_score":
        {
            "optimum": 1.0,
            "optimum_std": 0.0,
            "num_topics_values": [1.0, 3.0, 5.0, 7.0, 9.0],
            "score_values": [72.90, 21.92, 12.73, 9.21, 6.88],
            "num_topics_values_std": [0.0, 0.0, 0.0, 0.0, 0.0],
            "score_values_std": [1.47e-14, 0.0, 3.68e-15, 1.84e-15, 2.76e-15]
        },
        "top_tokens_coherence_score":
        {
            "optimum": 1.0,
            "optimum_std": 0.0,
            "num_topics_values": [1.0, 3.0, 5.0, 7.0, 9.0],
            "score_values": [0.834, 0.42, 0.76, 0.79, 0.53],
            "num_topics_values_std": [0.0, 0.0, 0.0, 0.0, 0.0],
            "score_values_std": [3.45e-16, 1.15e-16, 1.15e-16, 2.30e-16, 1.15e-16]
        }
    }
}

Here optimum means the optimal number of topics according to the score, score_values are the values of the score, each value corresponds to the number of topics in num_topics_values by the same index.

Another way to run the process may be via bash script

#!/bin/bash

general_args=(
    ./sample/vw.txt
    @text:1
    result.json
    -m @title:3
    --modality @publisher:2
)

search_method_args=(
    optimize_scores
    --max-num-topics 10
    --min-num-topics 1
    --num-topics-interval 2
    --num-fit-iterations 2
    --num-restarts 3
    perplexity
    renyi_entropy
    intratext_coherence
    top_tokens_coherence
    --cooc-file ./sample/cooc_values.json
)

python run_search.py "${general_args[@]}" "${search_method_args[@]}"

Or sitting in a .py file or a Jupyter Notebook:

from topnum.data import VowpalWabbitTextCollection
from topnum.scores import (
    DiversityScore,
    EntropyScore,
    IntratextCoherenceScore,
    PerplexityScore,
    SophisticatedTopTokensCoherenceScore,
)
from topnum.search_methods import OptimizeScoresMethod


modalities={
    '@text': 1,
    '@title': 3,
    '@publisher': 2,
}
text_collection = VowpalWabbitTextCollection(
    'sample/vw.txt',
    main_modality='@text',
    modalities=modalities,
)
modality_names = list(modalities.keys())

scores = [
    PerplexityScore(
        'perplexity_score',
        class_ids=modality_names,
    ),
    EntropyScore(
        'renyi_entropy_score',
        class_ids=modality_names,
    ),
    DiversityScore(
        'diversity_score',
        class_ids=modality_names,
    ),
    IntratextCoherenceScore(
        'intratext_coherence_score',
        data=text_collection,
    ),
    SophisticatedTopTokensCoherenceScore(
        'top_tokens_coherence_score',
        data=text_collection,
    )
]

optimizer = OptimizeScoresMethod(
    scores=scores,
    min_num_topics=1,
    max_num_topics=10,
    num_topics_interval=2,
    num_fit_iterations=2,
    num_restarts=3,
)

optimizer.search_for_optimum(text_collection)

with open('result.json', 'w') as f:
    f.write(json.dumps(optimizer._result))

More about available scores one can find here in the module.

TopicBank

The idea is to search for new interpretable topics as long as possible, training many topic models. As the searching for an appropriate number of topics in a document collection is a task at hand, when all the interpretable topics are collected in the bank, their number may serve as this appropriate number of topics.

For some more details one may look here.

Renormalization

The approach is described in the following paper:
Sergei Koltcov, Vera Ignatenko, and Sergei Pashakhin. "Fast tuning of topic models: an application of Rényi entropy and renormalization theory.", 2019.

Briefly, one model with a big number of topics is trained. Then, the number of topics is gradually reduced to one single topic: on each iteration two topics are selected by some criterion and merged into one. Minimum value of entropy is supposed to show the best, optimal, number of topics, when the model is most stable.

The method can be invoked like this:

python run_search.py \
    vw.txt \                    # path to vowpal wabbit file
    @text:1 \                   # main modality and its weight
    result.json \               # output file path (the file may not exist)
    -m @publisher:5 \           # other modality and its weight
    --modality @title:2 \       # other modality and its weight
    renormalize \               # search method
    --max-num-topics 100 \      # maximum number of topics in the text collection
    --num-fit-iterations 100 \  # number of fit iterations for each model training
    --num-restarts 10 \         # number of training restarts that differ in seed
    --matrix phi                # matrix to use for renormalization

Stability

By assumption, optimal number of topics is supposed provide some stability in model training, when models trained on different subsets of documents from the same corpus are alike.

The idea is similar to the one described in the following paper:
Derek Greene, Derek O’Callaghan, and Pádraig Cunningham. "How many topics? stability analysis for topic models", 2014. However, here we are not using such notion as reference ranking set. We just train several topic models on different parts of the corpus and compare them all in pairs.

Also one may take a look at this demo notebook about the stability approach used in the library.

Structure

.
├── run_search.py       # Main script which handles all the methods and their parameters and provides a way to run the process through the command line
├── demos               # Demo notebooks with experiments on real data
├── sample              # Toy data sample and scripts to try
└── topnum              # Core library functionality
    ├── data            # Train data handling (eg. Vowpal Wabbit files)
    ├── scores          # Scores that are available for optimizing or tracking
    └── search_methods  # Some techniques and ideas that can be used for finding an appropriate number of topics

optimalnumberoftopics's People

Contributors

alvant avatar bt2901 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.