bmabey / pyldavis Goto Github PK

Python library for interactive topic model visualization. Port of the R LDAvis package.

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.03% CSS 0.02% Python 0.98% R 0.02% JavaScript 2.56% Jupyter Notebook 96.39%

pyldavis's Introduction

pyLDAvis

Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley.

pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The visualization is intended to be used within an IPython notebook but can also be saved to a stand-alone HTML file for easy sharing.

Note: LDA stands for latent Dirichlet allocation.

Installation

Stable version using pip:

pip install pyldavis

Development version on GitHub

Clone the repository and run python setup.py

Usage

The best way to learn how to use pyLDAvis is to see it in action. Check out this notebook for an overview. Refer to the documentation for details.

For a concise explanation of the visualization see this vignette from the LDAvis R package.

Video demos

Ben Mabey walked through the visualization in this short talk using a Hacker News corpus:

Carson Sievert created a video demoing the R package. The visualization is the same and so it applies equally to pyLDAvis:

Visualizing & Exploring the Twenty Newsgroup Data

pyldavis's People

Contributors

Stargazers

Watchers

Forkers

paul-english davegerson darcy0511 fangzheng354 beefcrack derrickhiggins easonchan1213 prateekmehta xiangze tdhopper bearnshaw ike-okonkwo killedision codingafuture likaiguo tjrileywisc kcompher eb777ez snachx ibarria0 wkirwin mattilyra liying0420 yuwin samiratzn yunatseng tborgstadt gregcaporaso wangxiong2015 cemoody jessilee nishantkalsi janrygl 7472741 laisun xsongx gucasbrg tomdyq bobflagg caohy1988 abelsonlive trevorprater totalgood guoruijiao winstoncse napjon ktomanek cimsweb sleitner bloody76 maackle ww880412 imclab zbxzc35 nooralahzadeh dtpryce pduckworth davidchu201 riccitensor marcgameroff yetanothertimes ssaleh2 alanponce jtkostman danromuald mpuig napsternxg rcprasanth arianpasquali chirayukong technologic27 cruncharlie chenjun0210 partho dingchaoz aabercrombie0492 bytearchive fatdopa austinrochford sudarshan1413 reshphil anhnguyendepocen datascisp alexfok machinelearningreply drstatsvenu zuzannna hiendt58 mulloymorrow epgauss mageswaran1989 cxu23 jz2575 sachintyagi22 vinodhinir siva2k16 jw15 pablocelayes d4le sapphirus15

pyldavis's Issues

AssertionError on running pyLDAVis on Windows and Anaconda distribution

System Windows 10
Python Distribution Anaconda
Installed latest pyLDAVis
I am running the following code and am getting error

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

import gensim
import pyLDAvis
pyLDAvis.enable_notebook()

dictionary = gensim.corpora.Dictionary.load('ID2Word.dict')
corpus = gensim.corpora.MmCorpus('BOWCORPUS.mm')
lda = gensim.models.ldamodel.LdaModel.load('LDAModel_BOW.model')

import pyLDAvis.gensim

pyLDAvis.gensim.prepare(lda, corpus, dictionary)

On running the last line of my code the execution never finishes and I keep getting continuous error messages like the following:

  File "E:\Anaconda\lib\multiprocessing\forking.py", line 488, in prepare
    assert main_name not in sys.modules, main_name
AssertionError: __main__
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 380, in main
    prepare(preparation_data)
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 488, in prepare
    assert main_name not in sys.modules, main_name
AssertionError: __main__
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 380, in main
    prepare(preparation_data)
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 488, in prepare
    assert main_name not in sys.modules, main_name
AssertionError: __main__
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 380, in main
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    prepare(preparation_data)
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 488, in prepare
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 380, in main
        prepare(preparation_data)
assert main_name not in sys.modules, main_name
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 488, in prepare
AssertionError: __main__
    assert main_name not in sys.modules, main_name
AssertionError: __main__
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 380, in main
    prepare(preparation_data)
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 488, in prepare
    assert main_name not in sys.modules, main_name
AssertionError: __main__
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 380, in main
    prepare(preparation_data)
  File "E:\Anaconda\lib\multiprocessing\forking.py", line 488, in prepare
    assert main_name not in sys.modules, main_name
AssertionError: __main__

Visualization does not show even though there are no errors

TypeError: (-0.0025023526479494543+0j) is not JSON serializable , with sklearn & tfidf dtm

First of all thanks to the creator and all the contributors of this amazing module.

Today I encountered this issue. I was following the example sklearn notebook and was able to successfully get the visualization for LDA model with tf (CountVectorizer) dtm .

But when I tried to use the TfidfVectorizer , I am getting this issue . Please find below the my code snippet as well the stack-trace of the issue.

pyLDAvis.sklearn.prepare(lda_tfidf, tfidf, tfidf_vectorizer, R=10,sort_topics=False)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\IPython\core\formatters.py in __call__(self, obj)
    337                 pass
    338             else:
--> 339                 return printer(obj)
    340             # Finally look for special method names
    341             method = _safe_get_formatter_method(obj, self.print_method)

C:\Anaconda3\lib\site-packages\pyLDAvis\_display.py in <lambda>(data, kwds)
    311     formatter = ip.display_formatter.formatters['text/html']
    312     formatter.for_type(PreparedData,
--> 313                        lambda data, kwds=kwargs: prepared_data_to_html(data, **kwds))
    314 
    315 

C:\Anaconda3\lib\site-packages\pyLDAvis\_display.py in prepared_data_to_html(data, d3_url, ldavis_url, ldavis_css_url, template_type, visid, use_http)
    176                            d3_url=d3_url,
    177                            ldavis_url=ldavis_url,
--> 178                            vis_json=data.to_json(),
    179                            ldavis_css_url=ldavis_css_url)
    180 

C:\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in to_json(self)
    414 
    415     def to_json(self):
--> 416        return json.dumps(self.to_dict(), cls=NumPyEncoder)

C:\Anaconda3\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    235         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    236         separators=separators, default=default, sort_keys=sort_keys,
--> 237         **kw).encode(obj)
    238 
    239 

C:\Anaconda3\lib\json\encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

C:\Anaconda3\lib\json\encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

C:\Anaconda3\lib\site-packages\pyLDAvis\utils.py in default(self, obj)
    144         if isinstance(obj, np.float64) or isinstance(obj, np.float32):
    145             return float(obj)
--> 146         return json.JSONEncoder.default(self, obj)

C:\Anaconda3\lib\json\encoder.py in default(self, o)
    178 
    179         """
--> 180         raise TypeError(repr(o) + " is not JSON serializable")
    181 
    182     def encode(self, o):

TypeError: (-0.0025023526479494543+0j) is not JSON serializable

Any help to resolve this would be much appreciated.

I am also trying to find a resolution for this issue and if I could resolve it on my own , I would let you know .

port over new bar calculation logic from R LDAvis

cpsievert/LDAvis#41

TypeError: cannot sort an Index object in-place, use sort_values instead : For Gensim LDA Model

For Gensim LDA model with 150K vocab size, the following error is thrown when I do the following:

model_filename = "150k_LdaModel_topics_"+ topics +"_passes_"+passes +".model"

dictionary = gensim.corpora.Dictionary.load('LDADictSpecialRemoved150k.dict')
corpus = gensim.corpora.MmCorpus('LDACorpusSpecialRemoved150k.mm')
ldamodel = gensim.models.ldamodel.LdaModel.load(model_filename)

import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.save_html(vis, "topic_viz_"+topics+"_passes_"+passes+".html")

Traceback (most recent call last):
  File "create_vis.py", line 36, in <module>
    vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
  File "/local/lib/python2.7/site-packages/pyLDAvis/gensim.py", line 110, in prepare
    return vis_prepare(**opts)
  File "/local/lib/python2.7/site-packages/pyLDAvis/_prepare.py", line 398, in prepare
    token_table        = _token_table(topic_info, term_topic_freq, vocab, term_frequency)
  File "/local/lib/python2.7/site-packages/pyLDAvis/_prepare.py", line 267, in _token_table
    term_ix.sort()
  File "/local/lib/python2.7/site-packages/pandas/indexes/base.py", line 1703, in sort
    raise TypeError("cannot sort an Index object in-place, use "
TypeError: cannot sort an Index object in-place, use sort_values instead

Numerical stability and summing to 1

I'm not sure if this is much of an issue, but I had trouble with one or two models where rows of the topic distribution didn't sum to one. This causes pyldavis to fail in _input_check here,

https://github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/_prepare.py#L46-L50

I think the models were correct and gave appropriately accurate probability distributions, but maybe some numerical stability issues meant that __num_dist_rows__ didn't quite add up all the rows correctly. If all the distributions are normalized, is this input check required?

Python Visualization Fails to scroll - the nagging continues

Mr. Mabbey

Great thanks for your last answer - I was able (took me less than a year) to open a notebook and run the code. The code ran, the graphic appeared, but I can only see the top part of it and toggling scrolling of the output does not help. I realize the issue here is not so much the code as my ignorance, but any help/advice is greately appreciated.

Thanks
Vadim

Error installing pyLDAvis

When I use pip to install this package it shows error with one of the dependencies (scikit-bio)

After a long search it turns out that scikit-bio is not compatible with windows yet. Are you aware of this problem ? Do you have any other method to install pyLDAvis because pip is not working?

deprecation warnings

Hi bmabey, if I import gensim and import pyLDAvis.gensim at the top of my file, and then later on I call gensim.models.ldamodel.LdaModel(), I get deprecation warnings. But if I import gensim, then call LdaModel(), then import pyLDAvis, I do not get any deprecation warnings. It's not a big deal but I just thought you'd like to know. I love pyLDAvis by the way, I was doing my preprocessing in Python and then using R for the lda and visualization before finding pyLDAvis.

Choosing number of Terms('R') in pyLDAvis.gensim.prepare()

we cant change no. of top terms from 30 to any other no.

allow for visualizing multiple models via a dropdown

Often times you are comparing a number of models. It is annoying to have to scroll the the different ones. So I propose a dropbox with user-provided named of different models that can be selected which will update the vis. The datastructure sent to the client will be a map of model name -> vis data (as is currently being sent).

show intersection of topics

This idea comes from @dbickson:

"in case there are two clusters which have a joined intersection. Can you press on the intersection and then list all the keywords which appear in both clusters? "

This is probably best done on the client so we don't increase the data required to be sent to the browser. Finding the union between two topics and showing the top words would be easy but you may want to use the raw word counts in each topic so they are ranked according to proportion. The trickier part may be deciding on how to select multiple topics UI wise.

Add static language resources

Add static language resources to display static titles in different languages.
Add current language detection for session and choose appropriate language bundle to display

pyLDAvis.gensim DeprecationWarning: inspect.getargspec()

I'm getting a deprecation warning on importing pyLDAvis.gensim (but not pyLDAvis), in python 3.5.2, pyLDAvis 2.0.0, funcy 1.7.1. (Besides this, the package is working fine. Thanks!)

(...)/python3.5/site-packages/funcy/decorators.py:56: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
spec = inspect.getargspec(func)

gensim.py in python 2.7 needs to have from future absolute_import

Hello,

Please check out this SO question

It appears that you need to have from __future__ import absolute_import in gensim.py to avoid the module from trying to load itself.

On a side note, I noticed that the _extract_data function in gensim.py is indented with 3 spaces, not 4.

sklearn support

Hi, first of all thanks a lot for the R port of this package :). Is there any chance for that you add model support for sklearn? They just released LDA in their kit:
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation

Very small visualization

pyLDAvis.display(movies_vis_data) generates LDA visualization, but it shows top part of the visualization.

add svd-tsne as built in option for dim. reduction

I'm thinking about doing it like so:

conditionally define a svd-tsne option if the needed deps are present (e.g. sci-kit learn)
allow the dem. method to be passed in as a function, as it currently is, or a string to look up the built in options
allow for multiple reductions to be done and then allow for the frontend to select which one they want to use. We can use object permanence as we readjust the distance map in the vis.

Microsoft Visual C++ 10.0 is required

customize MSVCCompiler
customize MSVCCompiler using build_ext
No module named 'numpy.distutils.msvccompiler' in numpy.distutils; trying fr

om distutils
customize MSVCCompiler
Missing compiler_cxx fix for MSVCCompiler
customize MSVCCompiler using build_ext
building 'numexpr.interpreter' extension
compiling C sources
Warning: Assuming default configuration (numexpr\tests/{setup_tests,setup}.p
y was not found)error: Microsoft Visual C++ 10.0 is required (Unable to find vcv
arsall.bat).

autocomplete box for terms

When working with a model with many topics built on a corpus you are familiar with you often want to know what topics a particular term is contributing to. Having an autocomplete would allow people an alternative way of selecting a term by typing in the desired term.

allow for setting R in the browser

The backed sends R top terms. On the frontend it would sometimes be nice to reduce that number via a dropdown.

Gensim breaking change to show_topics in 0.12.3

FYI : Gensim 0.12.3 show_topics changes:

All models with the show_topics method should return a list of
(topic_number, topic) tuples, where topic is a list of
(word, probability) tuples.

via changelog (https://github.com/piskvorky/gensim/blob/develop/CHANGELOG.txt)

Unable to install pyLDAvis due to Numpy error

I am trying to install a development version of pyLDAvis. I cloned the github copy and I am trying to install it using the following command: "sudo python setup.py develop".

I get the following error and it fails to install:'dict' object has no attribute 'NUMPY_SETUP'

Is this a bug or am I using the wrong steps?

Visualizing topics without corpus

I train a gensim LDA model with a very large corpus (almost 100M documents), and I train it in the streaming fashion. Therefore, I cannot reproduce the corpus that is needed for the prepare method of pyLDAvis.

But the corpus is needed in this method. I try to fake a small corpus with a sample of the training data but it does not work, and give me the following exception:

anaconda/lib/python2.7/site-packages/pyLDAvis/gensim.pyc in _extract_data(topic_model, corpus, dictionary, doc_topic_dists)
     26    beta = 0.01
     27    fnames_argsort = np.asarray(list(dictionary.token2id.values()), dtype=np.int_)
---> 28    term_freqs = corpus_csc.sum(axis=1).A.ravel()[fnames_argsort]
     29    term_freqs[term_freqs == 0] = beta
     30    doc_lengths = corpus_csc.sum(axis=0).A.ravel()

IndexError: index 66999 is out of bounds for axis 1 with size 66988

Is there any methods to work around this problem?

TypeError: object of type 'map' has no len()

I can't seem to get pyLDAvis to work on even a toy collection. I build a model with gensim as described in the gensim docs. Then, when I try to run

vis_data = gensimvis.prepare(lda, corpus, dictionary)

I get TypeError: object of type 'map' has no len(). I've tried this with other document collections, getting the same problem.

Minimal code to duplicate is below. I'm using Python 3.4.3 on a Mac (via Anaconda), gensim 0.10.3, and pyLDAvis 1.1.0. Hope I'm just doing something obviously dumb. Thanks for any help you can provide.

import gensim
from gensim import corpora, models, similarities
import pyLDAvis.gensim as gensimvis
import pyLDAvis
import sys

print(sys.version)
print(pyLDAvis.__version__)
print(gensim.__version__)

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=10)

vis_data = gensimvis.prepare(lda, corpus, dictionary)

fix numpy deprecation warning messages

Error in gensim prepare when corpus is passed as a matrix

A test is present in the _extract_data function in gensim.py, presumably so that the corpus can be passed as a sparse matrix:

if not gensim.matutils.ismatrix(corpus):
      corpus_csc = gensim.matutils.corpus2csc(corpus)
   else:
      corpus_csc = corpus

Later, however, the length of the corpus is tested:

   assert doc_lengths.shape[0] == len(corpus), 'Document lengths and corpus have different sizes {} != {}'.format(doc_lengths.shape[0], len(corpus))

When corpus is a sparse matrix, len(corpus) will raise an exception because the length of a sparse matrix is ambiguous.

literal error

I just noticed a small typo Marginal topic distribtion injs/ldavis.js line 352.

name 'basestring' is not defined error in python3

pyLDAvis/_display.py", line 354, in save_html

if isinstance(fileobj, basestring):
    fileobj = open(fileobj, 'w')

basestring does not work for python3.
this code works in both:

try:
    if isinstance(fileobj, basestring):
        fileobj = open(fileobj, 'w')
except NameError:
    if isinstance(fileobj, str):
        fileobj = open(fileobj, 'w')

KeyError in gensim.prepare

Hi there, I'm using gensim to do LDA on a collection of novels (using just 40 for testing, I have several hundreds). Building the corpus and dictionary seems to work fine, as does the modeling process itself. I can also inspect the resulting model (topics in documents and words in topics, for example). However, when attempting to use pyLDAvis, I run into a KeyError.

I'm on Linux (Ubuntu 14.04) and using Python 3.4 and the following versions of relevant modules:
pyLDAvis 1.2.0
numpy 1.9.2
gensim 0.11.1-1

This is my code (loading corpus, dictionary and model from previous step):

def gensim_output(modelfile, corpusfile, dictionaryfile): 
    """Displaying gensim topic models"""
    ## Load files from "gensim_modeling"
    corpus = corpora.MmCorpus(corpusfile)
    dictionary = corpora.Dictionary.load(dictionaryfile) # for pyLDAvis
    myldamodel = models.ldamodel.LdaModel.load(modelfile)    

    ## Interactive visualisation
    import pyLDAvis.gensim
    vis = pyLDAvis.gensim.prepare(myldamodel, corpus, dictionary)
    pyLDAvis.display(vis)

This is the output I get:

Traceback (most recent call last):

  File "<ipython-input-79-940daa51d8a9>", line 1, in <module>
    runfile('/home/[PATH]/an5/mygensim.py', wdir='/home/christof/Dropbox/0-Analysen/2015/rp_Sydney/an5')

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 586, in runfile
    execfile(filename, namespace)

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 48, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)

  File "/home/[PATH]/an5/mygensim.py", line 84, in <module>
    main("./5_lemmata/*.txt", "gensim_corpus.dict", "gensim_corpus.mm", "gensim_modelfile.gensim")

  File "/home/[PATH]/an5/mygensim.py", line 82, in main
    gensim_output(modelfile, corpusfile, dictionaryfile)

  File "/home/[PATH]/an5/mygensim.py", line 75, in gensim_output
    vis = pyLDAvis.gensim.prepare(myldamodel, corpus, dictionary)

  File "/usr/local/lib/python3.4/dist-packages/pyLDAvis/gensim.py", line 61, in prepare
    return vis_prepare(**_extract_data(topic_model, corpus, dictionary))

  File "/usr/local/lib/python3.4/dist-packages/pyLDAvis/gensim.py", line 24, in _extract_data
    term_freqs = [term_freqs_dict[id] for id in xrange(N)]

  File "/usr/local/lib/python3.4/dist-packages/pyLDAvis/gensim.py", line 24, in <listcomp>
    term_freqs = [term_freqs_dict[id] for id in xrange(N)]

KeyError: 6

Not sure whether this is a bug or bad usage of the module. Any help would be very much appreciated.

pyLDAvis index 1098 is out of bounds for axis 1 with size 707

I have the following exception:

IndexErrorTraceback (most recent call last)
<ipython-input-17-209fc1d6a743> in <module>()
----> 1 data1 =  pyLDAvis.gensim.prepare(lda_model_1, corpus1, dictionary)
      2 pyLDAvis.display(data1)

/usr/local/lib/python2.7/dist-packages/pyLDAvis/gensim.pyc in prepare(topic_model, corpus,   dictionary, doc_topic_dist, **kwargs)
 95     See `pyLDAvis.prepare` for **kwargs.
 96     """
---> 97     opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
 98     return vis_prepare(**opts)

/usr/local/lib/python2.7/dist-packages/pyLDAvis/gensim.pyc in _extract_data(topic_model, corpus, dictionary, doc_topic_dists)
 26    beta = 0.01
 27    fnames_argsort = np.asarray(list(dictionary.token2id.values()), dtype=np.int_)
---> 28    term_freqs = corpus_csc.sum(axis=1).A.ravel()[fnames_argsort]
 29    term_freqs[term_freqs == 0] = beta
 30    doc_lengths = corpus_csc.sum(axis=0).A.ravel()

IndexError: index 1098 is out of bounds for axis 1 with size 707

I have asked to gensim google group and they have told me that my code seemed ok so I should ask here too.

I want to build two LDA models with gensim which share a dictionary. So I have two corpus and one dictionary for both. You can see my code here (and you can reproduce the error):
https://gist.github.com/HaritzPuerto/993c8d07ede22e1649265c7f55220cf9

That exception only occurs when I apply ldavis to the first model. ldavis works well with the second model.
Please, find attached the data I've used.
Data.zip

Data about my pc:
Vagrant virtual machine with ubuntu 14.04, python 2.7 and pyldavis 1.5.1

Do I have a problem with my code or is it a bug? A guy in the gensim google group told me that shared dictionary may have a conflict with pyldavis.

Thank you

IPython is not visualizing the gensim model.

First and foremost wanted to thank everyone for helping me get this far. I am able to generate a gensim model, run it in IPython notebook, and get to see some results - but not the beautiful graphic we all were hoping for. I'm running WinPython 3.4 QT5 (latest I believe) and I installed both the genism and pyLDAvis also today, so everything is fresh. Here is what my output looks like:

In[9]: pyLDAvis.enable_notebook()
In[10]: pyLDAvis.gensim.prepare(lda, corpus, dictionary)
C:\WinPython\python-3.4.3.amd64\lib\site-packages\skbio\stats\ordination_principal_coordinate_analysis.py:109: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -0.009952346420900118 and the largest is 0.034359155356682575.
RuntimeWarning
Out[10]:
PreparedData(topic_coordinates= Freq cluster topics x y
topic
24 20.249823 1 1 0.055595 0.006318
3 18.859849 1 2 0.003016 0.038028
17 16.519686 1 3 -0.117297 0.009020
18 9.578099 1 4 -0.014738 0.006581
....
13 0.000586 1 24 0.003268 0.007722
9 0.000586 1 25 -0.008638 -0.009829, topic_info= Category Freq Term Total loglift logprob
2300 Default 1280.000000 gladia 1280 30.0000 30.0000
5920 Default 984.000000 giskard 984 29.0000 29.0000
1512 Default 676.000000 amadiro 676 28.0000 28.0000
...
2252 Topic25 0.000562 anacreon 117 -0.3992 -6.2565
9440 Topic25 0.000626 madam 372 -1.5745 -6.2751

[1929 rows x 6 columns], token_table= Topic Freq Term
term
3268 1 0.181818 abilities
3268 2 0.318182 abilities
...
1155 10 0.019608 york

[1984 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'ylab': 'PC2', 'xlab': 'PC1'}, topic_order=[25, 4, 18, 19, 21, 11, 23, 17, 22, 8, 12, 24, 15, 13, 2, 7, 20, 9, 16, 3, 6, 5, 1, 14, 10])

Remove dependencies on scikit-bio

I propose to remove the dependencies on scikit-bio.

scikit-bio has recently undergo incompatible changes to the API, especially with regards to the pcoa() function. In addition, related to issue 57, it is still incompatible with Windows machines.

After going through the codes, I see that only the pcoa() and DistanceMatrix() functions are used from the scikit-bio package. These can be reimplemented with functions from scikit-learn only. Given the maturity of the sklearn package, it should be a good idea.

I can try to implement these portions if necessary.

add tests for gensim prepare!

A simple smoke test would be enough! Just test to see if it runs in both python 2 and 3.

Source from csv files

Hi,
is there an API to load from csv files, as opposed to json?

support for gensim hdp

Is it possible to also support hdp models from gensim? as far as I understand it the underlying voodoo as well as the output are very similar, but you don't have to choose the number of topics for hdp:
https://radimrehurek.com/gensim/models/hdpmodel.html

line 48 of gensim module has incorrect logic

I just got a comical "25!=25" error because of this line:

assert doc_topic_dists.shape[1] == num_topics, 'Document topics and number of topics do not match {} != {}'.format(doc_topic_dists.shape[0], num_topics)

Error preparing gensim model

Using python 3.4, trying to prepare a gensim-generated LDA model gives the following stack trace:

TypeError Traceback (most recent call last)
in ()
1 import pyLDAvis.gensim
2
----> 3 pyLDAvis.gensim.prepare(lda, corpus, dictionary)

/usr/local/lib/python3.4/site-packages/pyLDAvis/gensim.py in prepare(topic_model, corpus, dictionary, **kargs)
64 http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb
65 """
---> 66 opts = fp.merge(_extract_data(topic_model, corpus, dictionary), kargs)
67 return vis_prepare(**opts)

/usr/local/lib/python3.4/site-packages/pyLDAvis/gensim.py in _extract_data(topic_model, corpus, dictionary)
30
31 topics = topic_model.show_topics(formatted=False, num_words=len(vocab), num_topics=topic_model.num_topics)
---> 32 topics_df = pd.DataFrame([dict((y,x) for x, y in tuples) for tuples in topics])[vocab]
33 topic_term_dists = topics_df.values
34

/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in getitem(self, key)
1912 return self._getitem_multilevel(key)
1913 else:
-> 1914 return self._getitem_column(key)
1915
1916 def _getitem_column(self, key):

/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key)
1919 # get column
1920 if self.columns.is_unique:
-> 1921 return self._get_item_cache(key)
1922
1923 # duplicate columns & possible reduce dimensionaility

/usr/local/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
1086 """ return the cached item, item represents a label indexer """
1087 cache = self._item_cache
-> 1088 res = cache.get(item)
1089 if res is None:
1090 values = self._data.get(item)

TypeError: unhashable type: 'dict_keys'

Topic number in pyLDAvis different from that in gensim.ldamodel

For instance, the topic 19 in pyLDAvis refers to topic 29 in the original lda model.

Is there a way to keep the topic numbering order the same as lda model?

Python Visualization Fails to start

Mr. Mabey

My original problem has come back after I used the "gensim_speed" branch variation of the bug fix for 1.3.1. The original problem #33 that you closed last week appears to be deeper than originally thought. I an running Win 10, Python 3.4.3.6, and pyLDAvis 1.3.1 with the gensim_speed correction. I used the ipython -i at the commend prompt to enter the intractive mode, loaded the model and and ran prepare, and got the same printout as it was in #33 . PFA the screenshots with the problem. I suspect I'm doing something wrong - but I have no idea what can it possibly be, since I'm only running the simplest commands.

Thanks
Vadim

Visualizing Dynamic Topic Models

Are there any ideas so far for visualizing Dynamic Topic Models? They are an interesting spin-off of LDA which is time-based and would be pretty cool to visualize. This report sort of plays around with visualizing them, but I'm pretty sure a much more comprehensive job can be done with the same.

Gensim Prepare

Preparing a gensim lda model does not work for me (Linux, Python 3.4) because of the following error:
``
pyLDAvis.gensim.prepare(lda, corpus, dictionary)

TypeError Traceback (most recent call last)
in ()
----> 1 pyLDAvis.gensim.prepare(lda, corpus, dictionary)

/home/methodds/anaconda3/lib/python3.4/site-packages/pyLDAvis/gensim.py in prepare(topic_model, corpus, dictionary, **kargs)
64 http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb
65 """
---> 66 opts = fp.merge(_extract_data(topic_model, corpus, dictionary), kargs)
67 return vis_prepare(**opts)

/home/methodds/anaconda3/lib/python3.4/site-packages/pyLDAvis/gensim.py in _extract_data(topic_model, corpus, dictionary)
30
31 topics = topic_model.show_topics(formatted=False, num_words=len(vocab), num_topics=topic_model.num_topics)
---> 32 topics_df = pd.DataFrame([dict((y,x) for x, y in tuples) for tuples in topics])[vocab]
33 topic_term_dists = topics_df.values
34

/home/methodds/anaconda3/lib/python3.4/site-packages/pyLDAvis/gensim.py in (.0)
30
31 topics = topic_model.show_topics(formatted=False, num_words=len(vocab), num_topics=topic_model.num_topics)
---> 32 topics_df = pd.DataFrame([dict((y,x) for x, y in tuples) for tuples in topics])[vocab]
33 topic_term_dists = topics_df.values
34

TypeError: 'int' object is not iterable
``
Any idea what is going on here?

to_dict() got an unexpected keyword argument 'orient'

I'm getting the following error on ubuntu

to_dict() got an unexpected keyword argument 'orient'

Exception Type: TypeError
Exception Value:
to_dict() got an unexpected keyword argument 'orient'
Exception Location: /usr/local/lib/python2.7/dist-packages/pyLDAvis/_prepare.py in to_dict, line 303

Thanks for helping

Feature request: build for Conda

Ben — Could you consider releasing pyLDAvis on Anaconda.org?

Python 3.4 deprecation

When I use PyLDAVis w/python 3.4, I get the following warnings. The package works perfectly, just for future reference.

/usr/local/lib/python3.4/site-packages/pyLDAvis/_prepare.py:283: FutureWarning: order is deprecated, use sort_values(...)
topic_proportion = (topic_freq / topic_freq.sum()).order(ascending=False)
/usr/local/lib/python3.4/site-packages/pyLDAvis/_prepare.py:154: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
sort('saliency', ascending=False).
/usr/local/lib/python3.4/site-packages/pyLDAvis/_prepare.py:203: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
return token_table.sort(['Term', 'Topic'])
/usr/local/lib/python3.4/site-packages/pyLDAvis/_prepare.py:132: FutureWarning: order is deprecated, use sort_values(...)
return relevance.T.apply(lambda s: s.order(ascending=False).index).head(R)

Add colouring of topics bubbles by presence of terms from list provided

Color topic bubble by presence of terms from provided list.
Intensivety of color proportional amount of words found in topic multiplied by term weight in topic.

data folder cannot be cloned without git-lfs

I was trying to clone the repo on my browser but some of the data files require git-lfs to be installed. I would suggest the data files and big files to be removed to a different location so that people who don't have access to the git-lfs can also clone the repo.

I am getting the following error:

> git clone https://github.com/bmabey/pyLDAvis.git
Cloning into 'pyLDAvis'...
remote: Counting objects: 540, done.
remote: Compressing objects: 100% (3/3), done.
emote: Total 540 (delta 0), reused 0 (delta 0), pack-reused 537
:  89% (481/540), 628.01 KiB | 1.17 MiB/s
Receiving objects: 100% (540/540), 1.45 MiB | 1.17 MiB/s, done.
Resolving deltas: 100% (290/290), done.
Checking connectivity... done.
git-lfs smudge 'tests/data/movie_reviews_input.json': git-lfs: command not found

error: external filter git-lfs smudge %f failed -1
error: external filter git-lfs smudge %f failed
fatal: tests/data/movie_reviews_input.json: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

Can't run notebook example

The library looks great. Thank you!

Just trying to run the notebook example with IPython 3.1.0 and hitting this error (including only the top):

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ag/.virtualenvs/py3/lib/python3.4/site-packages/joblib/parallel.py", line 92, in __call__
    return self.func(*args, **kwargs)
  File "/home/ag/.virtualenvs/py3/lib/python3.4/site-packages/pyLDAvis/_prepare.py", line 136, in _find_relevance_chunks
    return pd.concat(map(lambda l: _find_relevance(log_ttd, log_lift, R, l), lambda_seq))
  File "/usr/lib/python3/dist-packages/pandas/tools/merge.py", line 929, in concat
    verify_integrity=verify_integrity)
  File "/usr/lib/python3/dist-packages/pandas/tools/merge.py", line 944, in __init__
    '"{0}"'.format(type(objs).__name__))
AssertionError: first argument must be a list-like of pandas objects, you passed an object of type "map"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ag/.virtualenvs/py3/lib/python3.4/site-packages/joblib/parallel.py", line 102, in __call__
    raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException

Add "Documents" link to topic panel

On the topic panel add link "Documents"
Clicking this link bring up documents list sorted by relevance.

add topic cohesion metrics

For now just optionally display the metrics for the current selected topic. Later we can maybe sort the topics on them somehow.