ardavans / shdp Goto Github PK

View Code? Open in Web Editor NEW

73.0 73.0 19.0 12.36 MB

Nonparametric Topic Modeling with Word Vectors

License: MIT License

Python 100.00%

shdp's People

Contributors

Stargazers

Watchers

Forkers

kayhan-batmanghelich johof hclent rw teinhonglo codeaudit xiongshufeng quandb micimize omarjnb vyraun wangxinqi94 ckingdev chrichristal markclemens301 bonaldli

shdp's Issues

Document-topic distribution?

Hi again Ardavan!

May I ask is it possible to obtain a document-topic distribution from sHDP? So I can look at each document and see which topic is the most appropriate.

Thanks.

Parameter for number of maximum topics

Hi,

I was experimenting with the code you guys have here, particularly with the parameter 'K', which I think means the maximum number of topics expected. I am encountering a problem owing to inconsistencies between the number of topics discovered in the topic-word distribution and doc-topic distribution outputs.

I see that with K=40, I get 40 unique topics in the doc_topic distribution but with K=100, I get only 2 topics in the doc_topic distribution with a very high probability >98%.
But for both K, in the topic_word distribution output, the number of topics reaches the limit we specify and have 40 and 100 unique topics respectively.

So, it looks like for the K we specify, the number of topics in the topic_word distribution always reaches this upper bound but the same is not the case with the output from doc_topic distribution.

Perhaps there are some parameters that I need to set or wrong assumptions that I am making? Please let me know.

Thank you for the great work!

num & denom = 0 in _compute_expected_direction

Hi,

First, thank you for this project ! I really liked the paper and have directly downloaded the github.

I tried the example with nips, works fine. However, I wanted to use it on another dataset and other word embeddings. I used the exact same file format.

However, I got an error in the method _compute_expected_direction:
Traceback (most recent call last):
File "./runner.py", line 225, in
HDPRunner(args)
File "./runner.py", line 121, in HDPRunner
components=[vonMisesFisherLogNormal(**obs_hypparams) for itr in range(K)]
File "./runner.py", line 121, in
components=[vonMisesFisherLogNormal(**obs_hypparams) for itr in range(K)]
File "/root/sHDP/core/core_distributions.py", line 244, in init
self.resample() # initialize from prior
File "/root/sHDP/core/core_distributions.py", line 425, in resample
self._Expected_mu = self._compute_expected_direction( D, self.mu_0, self.C_0 )
File "/root/sHDP/core/core_distributions.py", line 303, in _compute_expected_direction
val = num/denom*direct
FloatingPointError: invalid value encountered in double_scalars

As observation, in this case, num & denom = 0.0. With the nips case (without any changed in the repo), at the beginning these have some values and quickly "inf". I don't know if this behavior is expected.

So 2 questions:

How can I fix to avoid null denom ?
Are the values "inf" for num & denom expected at some points ?

Thank you very much for your help !

EDIT: using your word embeddings work. However, using these I've trained and with a dimension of 500 (also changed in runner.py), it doesn't. I did the L2 normalization for each embedding, the input text contains words with existing word embeddings. What could I make wrong ?

Input data format

Thanks very much for your work Ardavan!

I have noticed sHDP requires a picked data file and a pickled word embedding in dict format. I have a bunch of tweets that I want to test on your model (one tweet per line). May I ask how do I come about transforming it to the required format please?
In texts, e.g. does ('time', 2) mean the token 'time' occurred twice in the whole corpus? and each list of tuples is a tokenised document? Un-ordered?
Is vectors_dict merely subsampled pre-trained word2vec embeddings as it only has 4768 items?

Thanks again!

Unable to run using a custom dataset

I am attempting to utilize this tool on a custom dataset. This dataset has been created by using #2 as a reference for the format, however it is not running and I'm not sure why. The traceback is:

/Users/a.varela/Downloads/sHDP-master/core/util/stats.py:152: UserWarning: Not sure about sampling vMF, use with caution!!!! 
  warn('Not sure about sampling vMF, use with caution!!!! ')
Traceback (most recent call last):
  File "./runner.py", line 225, in <module>
    HDPRunner(args)
  File "./runner.py", line 132, in HDPRunner
    HDP.meanfield_sgdstep(data, np.array(data).shape[0] / np.float(training_size), rho_t)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/models.py", line 138, in meanfield_sgdstep
    s.meanfieldupdate()
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 213, in meanfieldupdate
    self.mf_trans_matrix[self.doc_num,:],self.mf_aBl)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 189, in mf_aBl
    aBl[:,idx] = o.expected_log_likelihood([i[0] for i in self.data]).ravel()
  File "/Users/a.varela/Downloads/sHDP-master/core/core_distributions.py", line 566, in expected_log_likelihood
    return self._Expected_log_partition + self._Expected_kappa*np.array(x).dot(self._Expected_mu)
ValueError: shapes (75,100) and (50,) not aligned: 100 (dim 1) != 50 (dim 0)

How to Check Model Coherence Score

I have run the HDP model on my custom dataset. How can I check the coherence score of my model topics?
Also, is there a way to generate the topic that has been assigned to each document? So that a supervised dataset can be generated.

Kindly help.
Thanks

Runtime error

Hello, this is a great project!
I have some questions to ask you, I am download your code, and then implement the following problems, may I ask what be the reason?

D{'Nmax': 40, 'tau': 0.8, 'mbsize': 10.0, 'infSeed': 1, 'dataset': 'nipsD', 'alpha': 1.0, 'kappa_sgd': 0.6, 'gamma': 2.0}
Traceback (most recent call last):
File "./runner.py", line 225, in
HDPRunner(args)
File "./runner.py", line 41, in HDPRunner
temp_file = open(project_path+'data/'+datasetname+'/texts.pk', 'rb')
IOError: [Errno 2] No such file or directory: 'data/nipsD/texts.pk'

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Running the code brings a Value error:

0.8 -mbsize 10 -dataset nips
{'infSeed': 1, 'alpha': 1.0, 'gamma': 2.0, 'Nmax': 40, 'kappa_sgd': 0.6, 'tau': 0.8, 'mbsize': 10.0, 'dataset': 'nips'}
Loading the glove dict file....
Main runner ...
num_docs: 1566
Traceback (most recent call last):
File "./runner.py", line 225, in
HDPRunner(args)
File "./runner.py", line 121, in HDPRunner
components=[vonMisesFisherLogNormal(**obs_hypparams) for itr in range(K)]
File "./runner.py", line 121, in
components=[vonMisesFisherLogNormal(**obs_hypparams) for itr in range(K)]
File "/home/noxius/Git/sHDP/core/core_distributions.py", line 242, in init
if (mu,kappa) == (None,None) and None not in (mu_0,C_0,m_0,sigma_0):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I tried different methods, including changing the boolean check in core_distributions.py, line 243:

if ((mu,kappa) == (None,None)) and any(x != None for x in (mu_0, C_0, m_0, sigma_0)):
or
if ((mu,kappa) == (None,None)) and not any(x == None for x in (mu_0, C_0, m_0, sigma_0)):

However it doesn't change anything and the error stays. Have you any solution?

License

Thanks for releasing this code along with your research paper.

Do you have a license that you can add to the source code, so that users know what the restrictions may be? If you have no preference, I suggest the 3-clause BSD license.

ardavans / shdp Goto Github PK

shdp's People

Contributors

Stargazers

Watchers

Forkers

shdp's Issues

Document-topic distribution?

Parameter for number of maximum topics

num & denom = 0 in _compute_expected_direction

Input data format

Unable to run using a custom dataset

How to Check Model Coherence Score

Runtime error

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

License

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs