topicwizard

Pretty and opinionated topic model visualization in Python.

topicwizard_new_release-2023-04-25_09.38.23.mp4

New in version 0.4.0 🌟

Introduced topic pipelines that make it easier and safer to use topic models in downstream tasks and interpretation. 🔩

Features

Investigate complex relations between topics, words, documents and groups/genres/labels
Easy to use pipelines that can be utilized for downstream tasks
Sklearn, Gensim and BERTopic compatible 🔩
Highly interactive web app
Interactive and composable Plotly figures
Automatically infer topic names, oooor...
Name topics manually
Easy deployment 🌍

Installation

Install from PyPI:

pip install topic-wizard

Pipelines

The main abstraction of topicwizard around a topic model is a topic pipeline, which consists of a vectorizer, that turns texts into bag-of-tokens representations and a topic model which decomposes these representations into vectors of topic importance. topicwizard allows you to use both scikit-learn pipelines or its own TopicPipeline.

Let's build a pipeline. We will use scikit-learns CountVectorizer as our vectorizer component:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=5, max_df=0.8, stop_words="english")

The topic model I will use for this example is Non-negative Matrix Factorization as it is fast and usually finds good topics.

from sklearn.decomposition import NMF

model = NMF(n_components=10)

Then let's put this all together in a pipeline. You can either use sklearn Pipelines...

from sklearn.pipeline import make_pipeline

topic_pipeline = make_pipeline(vectorizer, model)

Or TopicPipeline from topicwizard:

from topicwizard.pipeline import make_topic_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model, norm_rows=False)

Let's load a corpus that we would like to analyze, in this example I will use 20newsgroups from sklearn.

from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data

# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]

Then let's fit our pipeline to this data:

topic_pipeline.fit(corpus)

The advantages of using a TopicPipeline over a regular pipeline are numerous:

Output dimensions (topics) are named
You can set the output to be a pandas dataframe (topic_pipeline.set_output(transform="pandas")) with topics as columns.
You can treat topic importances as pseudoprobability-distributions (topic_pipeline.norm_row = True)
You can freeze components so that the pipeline will stay frozen when fitting downstream components (topic_pipeline.freeze = True)

Here's an example of how you can easily display a heatmap over topics in a document using TopicPipelines.

import plotly.express as px

pipeline = make_topic_pipeline(vectorizer, model).set_output(transform="pandas")
texts = [
   "Coronavirus killed 50000 people today.",
   "Donald Trump's presidential campaing is going very well",
   "Protests against police brutality have been going on all around the US.",
]
topic_df = pipeline.transform(texts)
topic_df.index = texts
px.imshow(topic_df).show()

You didn't even have to use topicwizards own visualizations for this!!

You can also use TopicPipelines for downstream tasks, such as unsupervised text labeling with the help of human-learn.

pip install human-learn

from hulearn.classification import FunctionClassifier
from sklearn.pipeline import make_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model).fit(texts)

# Investigate topics
topicwizard.visualize(topic_pipeline)

# Creating rule for classifying something as a corona document
def corona_rule(df, threshold=0.5):
    is_about_corona = df["11_vaccine_pandemic_virus_coronavirus"] > threshold
    return is_about_corona.astype(int)

# Freezing topic pipeline
topic_pipeline.freeze = True
classifier = FunctionClassifier(corona_rule)
cls_pipeline = make_pipeline(topic_pipeline, classifier)

Web Application

You can launch the topic wizard web application for interactively investigating your topic models. The app is also quite easy to deploy in case you want to create a client-facing interface.

import topicwizard

topicwizard.visualize(corpus, pipeline=topic_pipeline)

From version 0.3.0 you can also disable pages you do not wish to display thereby sparing a lot of time for yourself:

# A large corpus takes a looong time to compute 2D projections for so
# so you can speed up preprocessing by disabling it alltogether.
topicwizard.visualize(corpus, pipeline=topic_pipeline, exclude_pages=["documents"])

Topics	Words	Documents	Groups

Figures

If you want customizable, faster, html-saveable interactive plots, you can use the figures API. Here are a couple of examples:

from topicwizard.figures import word_map, document_topic_timeline, topic_wordclouds, word_association_barchart

Word Map	Timeline of Topics in a Document
`word_map(corpus, pipeline=topic_pipeline)`	`document_topic_timeline( "Joe Biden takes over presidential office from Donald Trump.", pipeline=topic_pipeline)`

Wordclouds of Topics	Topic for Word Importance
`topic_wordclouds(corpus, pipeline=topic_pipeline)`	`word_association_barchart(["supreme", "court"], corpus=corpus, pipeline=topic_pipeline)`

For more information consult our Documentation

tsido / topicwizard Goto Github PK

topicwizard's Introduction

topicwizard

New in version 0.4.0 🌟

Features

Installation

Pipelines

Web Application

Figures

topicwizard's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs