GithubHelp home page GithubHelp logo

shdp's People

Contributors

ardavans avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

shdp's Issues

Document-topic distribution?

Hi again Ardavan!

May I ask is it possible to obtain a document-topic distribution from sHDP? So I can look at each document and see which topic is the most appropriate.

Thanks.

Bo

Parameter for number of maximum topics

Hi,

I was experimenting with the code you guys have here, particularly with the parameter 'K', which I think means the maximum number of topics expected. I am encountering a problem owing to inconsistencies between the number of topics discovered in the topic-word distribution and doc-topic distribution outputs.

  1. I see that with K=40, I get 40 unique topics in the doc_topic distribution but with K=100, I get only 2 topics in the doc_topic distribution with a very high probability >98%.

  2. But for both K, in the topic_word distribution output, the number of topics reaches the limit we specify and have 40 and 100 unique topics respectively.

So, it looks like for the K we specify, the number of topics in the topic_word distribution always reaches this upper bound but the same is not the case with the output from doc_topic distribution.

Perhaps there are some parameters that I need to set or wrong assumptions that I am making? Please let me know.

Thank you for the great work!

num & denom = 0 in _compute_expected_direction

Hi,

First, thank you for this project ! I really liked the paper and have directly downloaded the github.

I tried the example with nips, works fine. However, I wanted to use it on another dataset and other word embeddings. I used the exact same file format.

However, I got an error in the method _compute_expected_direction:
Traceback (most recent call last):
File "./runner.py", line 225, in
HDPRunner(args)
File "./runner.py", line 121, in HDPRunner
components=[vonMisesFisherLogNormal(**obs_hypparams) for itr in range(K)]
File "./runner.py", line 121, in
components=[vonMisesFisherLogNormal(**obs_hypparams) for itr in range(K)]
File "/root/sHDP/core/core_distributions.py", line 244, in init
self.resample() # initialize from prior
File "/root/sHDP/core/core_distributions.py", line 425, in resample
self._Expected_mu = self._compute_expected_direction( D, self.mu_0, self.C_0 )
File "/root/sHDP/core/core_distributions.py", line 303, in _compute_expected_direction
val = num/denom*direct
FloatingPointError: invalid value encountered in double_scalars

As observation, in this case, num & denom = 0.0. With the nips case (without any changed in the repo), at the beginning these have some values and quickly "inf". I don't know if this behavior is expected.

So 2 questions:

  • How can I fix to avoid null denom ?
  • Are the values "inf" for num & denom expected at some points ?

Thank you very much for your help !

EDIT: using your word embeddings work. However, using these I've trained and with a dimension of 500 (also changed in runner.py), it doesn't. I did the L2 normalization for each embedding, the input text contains words with existing word embeddings. What could I make wrong ?

Input data format

Thanks very much for your work Ardavan!

I have noticed sHDP requires a picked data file and a pickled word embedding in dict format. I have a bunch of tweets that I want to test on your model (one tweet per line). May I ask how do I come about transforming it to the required format please?
In texts, e.g. does ('time', 2) mean the token 'time' occurred twice in the whole corpus? and each list of tuples is a tokenised document? Un-ordered?
Is vectors_dict merely subsampled pre-trained word2vec embeddings as it only has 4768 items?

Thanks again!

Unable to run using a custom dataset

I am attempting to utilize this tool on a custom dataset. This dataset has been created by using #2 as a reference for the format, however it is not running and I'm not sure why. The traceback is:

/Users/a.varela/Downloads/sHDP-master/core/util/stats.py:152: UserWarning: Not sure about sampling vMF, use with caution!!!! 
  warn('Not sure about sampling vMF, use with caution!!!! ')
Traceback (most recent call last):
  File "./runner.py", line 225, in <module>
    HDPRunner(args)
  File "./runner.py", line 132, in HDPRunner
    HDP.meanfield_sgdstep(data, np.array(data).shape[0] / np.float(training_size), rho_t)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/models.py", line 138, in meanfield_sgdstep
    s.meanfieldupdate()
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 213, in meanfieldupdate
    self.mf_trans_matrix[self.doc_num,:],self.mf_aBl)
  File "/Users/a.varela/Downloads/sHDP-master/HDP/internals/hmm_states.py", line 189, in mf_aBl
    aBl[:,idx] = o.expected_log_likelihood([i[0] for i in self.data]).ravel()
  File "/Users/a.varela/Downloads/sHDP-master/core/core_distributions.py", line 566, in expected_log_likelihood
    return self._Expected_log_partition + self._Expected_kappa*np.array(x).dot(self._Expected_mu)
ValueError: shapes (75,100) and (50,) not aligned: 100 (dim 1) != 50 (dim 0)

How to Check Model Coherence Score

I have run the HDP model on my custom dataset. How can I check the coherence score of my model topics?
Also, is there a way to generate the topic that has been assigned to each document? So that a supervised dataset can be generated.

Kindly help.
Thanks

Runtime error

Hello, this is a great project!
I have some questions to ask you, I am download your code, and then implement the following problems, may I ask what be the reason?

D{'Nmax': 40, 'tau': 0.8, 'mbsize': 10.0, 'infSeed': 1, 'dataset': 'nipsD', 'alpha': 1.0, 'kappa_sgd': 0.6, 'gamma': 2.0}
Traceback (most recent call last):
File "./runner.py", line 225, in
HDPRunner(args)
File "./runner.py", line 41, in HDPRunner
temp_file = open(project_path+'data/'+datasetname+'/texts.pk', 'rb')
IOError: [Errno 2] No such file or directory: 'data/nipsD/texts.pk'

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Running the code brings a Value error:

0.8 -mbsize 10 -dataset nips
{'infSeed': 1, 'alpha': 1.0, 'gamma': 2.0, 'Nmax': 40, 'kappa_sgd': 0.6, 'tau': 0.8, 'mbsize': 10.0, 'dataset': 'nips'}
Loading the glove dict file....
Main runner ...
num_docs: 1566
Traceback (most recent call last):
File "./runner.py", line 225, in
HDPRunner(args)
File "./runner.py", line 121, in HDPRunner
components=[vonMisesFisherLogNormal(**obs_hypparams) for itr in range(K)]
File "./runner.py", line 121, in
components=[vonMisesFisherLogNormal(**obs_hypparams) for itr in range(K)]
File "/home/noxius/Git/sHDP/core/core_distributions.py", line 242, in init
if (mu,kappa) == (None,None) and None not in (mu_0,C_0,m_0,sigma_0):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I tried different methods, including changing the boolean check in core_distributions.py, line 243:

if ((mu,kappa) == (None,None)) and any(x != None for x in (mu_0, C_0, m_0, sigma_0)):
or
if ((mu,kappa) == (None,None)) and not any(x == None for x in (mu_0, C_0, m_0, sigma_0)):

However it doesn't change anything and the error stays. Have you any solution?

License

Thanks for releasing this code along with your research paper.

Do you have a license that you can add to the source code, so that users know what the restrictions may be? If you have no preference, I suggest the 3-clause BSD license.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.