GithubHelp home page GithubHelp logo

bab2min / tomotopy Goto Github PK

View Code? Open in Web Editor NEW
554.0 12.0 62.0 2.39 MB

Python package of Tomoto, the Topic Modeling Tool

Home Page: https://bab2min.github.io/tomotopy

License: MIT License

Python 9.46% C++ 84.95% C 2.01% HTML 3.25% Shell 0.05% CMake 0.29%
topic-modeling latent-dirichlet-allocation hierarchical-dirichlet-processes nlp pachinko-allocation dirichlet-multinomial-regression python-library correlated-topic-model supervised-lda topic-models

tomotopy's People

Contributors

bab2min avatar claudinoac avatar dpfens avatar jonaschn avatar jucendrero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tomotopy's Issues

How do Tomotopy models handle bigrams/trigrams?

Hello.

One of the key aspects of gensim's LDA models are that they are capable of handling bigrams/trigrams via their dictionaries. I have searched through out your documentations but I couldn't find any notes/tutorials that also included using n-grams. I would be grateful if you could help me with this.

Thanks!

`infer` method topic distribution of doc mostly zeros

Hi -

I fitted an HDP model tried to obtain the topic distribution for an unseen document. I do get a list, however most of the entries are zeros so I'm thinking there might be a rounding issues in the code.

Here's an example of how it looks like

token_list = ['strong', 'organization', 'rusnews', 'line',  'misery', 'write', 'faq', 'ever', 'get', 
'modify', 'define', 'strong', 'atheist', 'believe', 'word']

doc_inst = hdp_model.make_doc(token_list)
topic_dist, ll = hdp_model.infer(doc_inst)

topic_dist
[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0, ## <--- Here's the only non-zero element which is correct, but I'd like to get %'s
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

Here's some other info on my OS

Darwin-18.5.0-x86_64-i386-64bit
Python 3.7.6 (default, Dec 30 2019, 19:38:28) 
[Clang 11.0.0 (clang-1100.0.33.16)]
NumPy 1.18.1
SciPy 1.4.1
tomotopy 0.7.1

Issue in HDPModel

When I am training HDP model by using tomotopy, while computing log likelihood it is crashing.

Reproducibility issues even after setting model seed

Hi, thank you for all your work on this amazing library!

I'm running into a strange issue with reproducibility: Even after setting the model seed, I'm still sometimes getting different LDA results with the same documents (a processed subset of the BBC news dataset).

My code is very simple -- It reads from a text file, where each line represents a single document with space-separated tokens, and trains an LDAModel over the data. I've turned off parallel processing to prevent any randomness from coming in there as well.

import tomotopy as tp

with open("docs.txt", "r", encoding="utf8") as fp:
    model = tp.LDAModel(k=5, seed=123456789)
    for line in fp:
        model.add_doc(line.split())

for i in range(0, 1000, 100):
    model.train(100, workers=1, parallel=tp.ParallelScheme.NONE)
    print(f"Iteration: {i + 100} LL: {model.ll_per_word:.5f}")

When I run the code, I usually get the following output:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88406
Iteration: 400 LL: -7.86940
Iteration: 500 LL: -7.85939
Iteration: 600 LL: -7.84511
Iteration: 700 LL: -7.84116
Iteration: 800 LL: -7.83339
Iteration: 900 LL: -7.83029
Iteration: 1000 LL: -7.82927

But about 30% of the time I get the following output instead, where the stats seem to diverge at iteration 300:

Iteration: 100 LL: -7.94113
Iteration: 200 LL: -7.90128
Iteration: 300 LL: -7.88715
Iteration: 400 LL: -7.87158
Iteration: 500 LL: -7.86242
Iteration: 600 LL: -7.84669
Iteration: 700 LL: -7.84028
Iteration: 800 LL: -7.82794
Iteration: 900 LL: -7.82512
Iteration: 1000 LL: -7.82317

The results seem to switch randomly between these two possibilities (I haven't seen any other variations turn up), but I just can't seem to figure out where the indeterminacy is coming from. Would appreciate any advice or help you could provide!

Attached:
docs.txt

add num_timepoints & num_docs_by_timepoint for DTModel

The current tomotopy.DTModel is missing some features about the number of timepoints and documents. Suggest adding following properties into DTModel.

  • num_timepoints (or t) : the value that is input as t of __init__
  • num_docs_by_timepoint : the number of documents belonging to each timepoint

Ruby Library

Hey @bab2min, thanks for this awesome library! Just wanted to let you know there are now Ruby bindings for it. The code and docs were incredibly easy to follow.

If you have any feedback, feel free to let me know. Thanks!

Is term weighting implemented correctly?

Term weighting is implemented here by simply multiplying the counts with the weights before and after sampling.

updateCnt<DEC>(doc.numByTopic[tid], INC * weight);
updateCnt<DEC>(ld.numByTopic[tid], INC * weight);
updateCnt<DEC>(ld.numByTopicWord(tid, vid), INC * weight);

However I don't think you can do that. It's ok for doc.numByTopic because here the document dependency is still kept. However for both ld counts the results are different from the implementation in the paper "Term Weighting Schemes for Latent Dirichlet Allocation" (eq. 6)

In the paper the original counts are multiplied with the weights during sampling. In code this should look like this (using numpy syntax and your variable names). termweights is a NumberOfDocument x VocabSize array and ld.numByTopicWord a NumberOfTopics x VocabSize array (with counts, not weights)

np.sum(termweights[docid,:][None,:] * ld.numByTopicWord[tid, vid], axis=-1)

I did some testing with my pure python implementation, and there this expression yields a different result as ld.numByTopic[tid] using the weights update in addWordTo.

Note that this should only matter for weighting schemes where the same tokens can have different weights for different documents (like PMI)

Error when installing tomotopy

I am installing the package:

$pip3 install tomotopy

The installation freezes, it takes hours, the wheel spins very slowly (/)

Building wheels for collected packages: tomotopy
Building wheel for tomotopy (setup.py) ... error
ERROR: Failed building wheel for tomotopy
Running setup.py clean for tomotopy
Failed to build tomotopy
Installing collected packages: tomotopy
Running setup.py install for tomotopy ... error

Any help is apprecciatted. Thanks.

Error when calling train(0) with LDAModel

Hi, I'm following the example to train and use LDA model. After adding my documents with the method add_doc of LDAModel, when i call mdl.train(0) the code returns error (" Process finished with exit code -1073741819 (0xC0000005) "). I'm using tomotopy==0.7.1 with a python3.7 virtual environment on windows. Thanks

Issue in tomotopy

I have installed tomotopy by when I want to use it for LDA modeling it is showing error
AttributeError: module 'tomotopy' has no attribute 'LDAmodel'

[new feature] Convenient model description

It is good for topic model instances to have a simple method showing its status and description like:

mdl = tp.LDAModel(k=10, alpha=0.1, eta=0.1)

# do some works

mdl.model_summary() # will print like:

# LDAModel
# - hyperparameters
# -- term weight scheme: one
# -- k, number of topics: 10
# -- initial alpha, concentrate parameters for doc-topic dist: 0.1
# -- eta, concentrate parameters for topic-word dist: 0.1
# -- number of docs: 1000
# -- size of vocabs: 23456
# -- number of total words: 100000
# - parameters
# -- alpha: [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
# ...

Possible to expose the corpus class?

Thanks for tomotopy! I have one small issue though, and that's rebuilding the corpus every time I want to change the model. Is there a way to do something like this:

import tomotopy as tp

corpus = tp.Corpus()
for document in documents:
    corpus.add_doc(document)

model = tp.LDAModel(k=5, corpus=corpus)
model.train(100)

Thanks again for a great package!

Typo on the website

Your HDP model area has a typo: it looks like the convert_to_lda and is_live_topic methods are exactly the same, although I'm assuming that they're not.

Segfaults with alpha

Hi @bab2min, just wanted to report some segfaults I came across when building the Ruby library.

import tomotopy as tp

model = tp.GDMRModel()
print(model.alpha)

model = tp.DTModel()
print(model.alpha)

tomotopy version: 0.9.1

DTModel에서 시점별 토픽 분포 계산

안녕하세요 tomotopy, kiwipiepy를 애용하는 대학원생입니다. (bab2min님은 사랑입니다.)
전에 kiwipiepy 관련해서도 질문햇었는데 기억하시나 모르겠네요.

얼마전에 DTModel이 추가되서 기존 gensim에서 너무나 느리게 돌아가던 DTModel을 tomotopy로 돌리니 비약적으로 학습시간이 줄어서 애용하고 있습니다. 정말 감사합니다.

문제는 시각화할때 인데요, 일단 시점별 토픽내 단어들의 분포 변화는 get_topic_word_dist를 활용하면 될거같은데요. 토픽 간의 분포 변화를 알고 싶습니다. 예컨대 이 코드 처럼(https://github.com/GSukr/dtmvisual) time 변화에 따른 토픽들의 비율 변화를 plot으로 그리려 하는데, gensim에서는 문헌별 토픽 분포 파라미터인 gamma_를 가져와서 토픽들의 비율변화를 계산했는데요.

tomotopy에서는 get_alpha를 바로 활용하면 될까요? DTModel의 경우 get_topic_dist 함수가 없어서 이렇게 질문 남기네요.

좋은 패키지 만들어주셔서 감사합니다. 얼마전에 송민 교수님과 쓰신 논문도 리뷰중입니다.

Document-topic matrix for hierarchical LDA

Is there a way to get the topic mixture of each document back out from a hierarchical model? I am training a HLDAModel:

h_mdl = tp.HLDAModel(depth=4,corpus=corpus,seed=1)
    
for i in range(0, 100, 10): #Train the model using Gibbs-sampling
    h_mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, h_mdl.ll_per_word))

I am using the Document class to access instances of documents with the get_topic_dist() method. Is there a way to access a topic mixture of a document at a given depth of the model?

PLDA performance/memory issue

Hi,
thanks so much for implementing the Partially Labeled LDA model :)

But there seems to be an issue with the add_docs method. Firstly it's much slower than any of the other models in this library and there is also a memory issue. After adding only about 300 (not overly long documents) I run out of memory.

I am using PLDAModel(latent_topics=20, topics_per_label=1) and the documents added to this point only contain 3 different labels.

Segmentation fault (core dumped) when using labels in LLDA

Hello, I am not an expert and I may be very wrong here, but when I try to make a labelled LDA model I am getting 'Segmentation fault (core dumped)' and it may be a bug?

After declaring;
model = tp.LLDAModel(tw=TermWeight.ONE, min_cf=0, rm_top=0, k=20, alpha=0.1, eta=0.01)

If I do not specify labels, as in:

model.add_doc(myDocument)

instead of

model.add_doc(myDocument,labels=myLabel)

I can add all documents and create a working model just fine, but if I try to put the labels, the program gives me the segmentation error while adding the very first document.

I have also been able to create sLDA and LDA models without any single problem.

define _mm256_set_m128i for gcc compilers < 8.0

During installation in Python 3.6.9 with gcc 7.5.0 on Ubuntu 18.04:

sudo pip3 install tomotopy

I received the following error:

src/python/../Labeling/../Utils/EigenAddonOps.hpp:79:8: error: ‘_mm256_set_m128i’ was not declared in this scope
        u = _mm256_set_m128i(
            ^~~~~~~~~~~~~~~~
    src/python/../Labeling/../Utils/EigenAddonOps.hpp:79:8: note: suggested alternative: ‘_mm256_set_epi8’
        u = _mm256_set_m128i(
            ^~~~~~~~~~~~~~~~
            _mm256_set_epi8

The computer that has the following CPU information:

doug@doug-desktop:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          21
Model:               1
Model name:          AMD FX(tm)-8120 Eight-Core Processor
Stepping:            2
CPU MHz:             1419.705
CPU max MHz:         3100.0000
CPU min MHz:         1400.0000
BogoMIPS:            6242.06
Virtualization:      AMD-V
L1d cache:           16K
L1i cache:           64K
L2 cache:            2048K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt fma4 nodeid_msr topoext perfctr_core perfctr_nb cpb hw_pstate ssbd ibpb vmmcall arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold

I believe that this can be resolved by defining _mm256_set_m128i

'model.vocabs' gives full list of tokens even if it was filtered

'model.vocabs' gives full list of tokens even if it was filtered. At the same time 'num_vocabs' gives number of tokens after filtering. It is confusing as model gives topic-words distribution as list on numbers on 'num_vocabs' and does not provides id2token dictionary. A kind of 'used_vocabs' needed to make any down-steam analysis on produced topics

Trying to use PAM model giving by input processed text

Hi. I'm trying to implement PAM Model by giving as input a processed text (list of lemmatized words, I removed punctuation and stop words).

def topicExtraction(corpus):
    model = tomotopy.PAModel(k=20)
    model.add_doc(corpus)
    model.train()

    print(model.get_topic_words(k, top_n = 20))

But it returns error.
Just one thing: This project looks great.

How can I get the K2 distribution of topics out of PAModel

Currently PAModel.infer(doc) returns only K1 numbers + likelihood. I am trying to build an index and I am having difficulty getting K2 numbers out of the infer() method. I looked at the documentation and infer method is directly inherited from LDA. Is there a way to get the document x topic matrix for both K1 and K2 topics?

different results even if seed is fixed

Depending on the environment(32bit or 64bit / SSE2, AVX or AVX2) in which tomotopy are installed, different results will be produced with the same seed.
It is possibly related to #60.

PLDA documentation, order of topics

I understand that after training the PLDAModel k is the number of latent topics + the topics resulting from the labels (topics_per_label * unique labels).

What is the order of the topics now? For example using get_topic_word_dist(0), would that give me the first latent topic or the first topic of the first label?

Unrelated to this, it would be great if tomotopy.Dictionary would get a nicer string representation.

>>> plda_model.topic_label_dict
<tomotopy.Dictionary object at 0x0000012034EADD90>

It could just be the same output as list(m.topic_label_dict) for example.

Thanks again for your great work. I don't think there is any other topic modeling library with that many models and such a great performance.

Issue with model loading (0.6.1)

Hi @bab2min

Something seems wrong with the newest version 0.6.1.

models created with 0.6.1 cannot be loaded with 0.6.1, it throws an exception "Exception: 'lda.model.1000.bin' is not valid model file"

I saw that you changed the model file format: "Since version 0.6.0, the model file format has been changed. Thus model files saved in version 0.6.0 or later are not compatible with versions prior to 0.5.2."

training (this works)

mdl = tp.LDAModel(tw=tp.TermWeight.IDF, k=1000)
for description in descriptions:
    ch = description.split()
    mdl.add_doc(ch)
mdl.burn_in = 100
mdl.train(0)

print('Training...', file=sys.stderr, flush=True)
for i in range(0, 1000, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

print('Saving...', file=sys.stderr, flush=True)
mdl.save("lda.model.1000.bin", True)

load model and infer documents (this doesn't)

mdl = tp.LDAModel.load("lda.model.1000.bin")

docs = []
for record in records:
    docs.append(mdl.make_doc(record["description_cleaned"].split()))

infered_docs = mdl.infer(docs, together=True, parallel=3)

segmentation fault from the `extract` method with trained `HDPModel` and `CTModel` :

I don't know if it is relevant, but I sometimes get a segmentation fault from the extract method with trained HDPModel and CTModel :

    extractor = tp.label.PMIExtractor(min_cf=10, min_df=5, max_len=5, max_cand=10000)
    cands = extractor.extract(mdl)
Fatal Python error: Segmentation fault
Current thread 0x00007fbd1009b740 (most recent call first)
...
Segmentation fault (core dumped)

I tried to debug more with the faulthandler module of Python 3, but I cannot get a more detailed output.

EDIT:
Here is the stacktrace using gdb. I hope it helps:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffc74cec6d in tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >* tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >::makeNext<tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&>(unsigned int const&, tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&) () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
(gdb) backtrace
#0  0x00007fffc74cec6d in tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >* tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >::makeNext<tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&>(unsigned int const&, tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&) () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
#1  0x00007fffc74cec8b in tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >* tomoto::TrieEx<unsigned int, unsigned long, tomoto::ConstAccess<std::map<unsigned int, int, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, int> > > > >::makeNext<tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&>(unsigned int const&, tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const::{lambda()#1}&) () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
#2  0x00007fffc74cfbcf in tomoto::label::PMIExtractor::extract(tomoto::ITopicModel const*) const () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
#3  0x00007fffc7079dd8 in ExtractorObject::extract(ExtractorObject*, _object*, _object*) () from /home/henry/anaconda3/envs/gdpr/lib/python3.6/site-packages/_tomotopy.cpython-36m-x86_64-linux-gnu.so
#4  0x00005555556654f4 in _PyCFunction_FastCallDict () at /tmp/build/80754af9/python_1578429706181/work/Objects/methodobject.c:231
#5  0x00005555556ecdac in call_function () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4851
#6  0x000055555570f66a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:3335
#7  0x00005555556e6ebb in _PyFunction_FastCall (globals=<optimized out>, nargs=2, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4933
#8  fast_function () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4968
#9  0x00005555556ece85 in call_function () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4872
#10 0x000055555570f66a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:3335
#11 0x00005555556e7c09 in _PyEval_EvalCodeWithName (qualname=0x0, name=<optimized out>, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=<optimized out>, kwargs=0x0, kwnames=0x0, argcount=0, args=0x0,
    locals=0x7ffff7f55120, globals=0x7ffff7f55120, _co=0x7ffff6aaba50) at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4166
#12 PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:4187
#13 0x00005555556e89ac in PyEval_EvalCode (co=co@entry=0x7ffff6aaba50, globals=globals@entry=0x7ffff7f55120, locals=locals@entry=0x7ffff7f55120) at /tmp/build/80754af9/python_1578429706181/work/Python/ceval.c:731
#14 0x0000555555768c64 in run_mod () at /tmp/build/80754af9/python_1578429706181/work/Python/pythonrun.c:1025
#15 0x0000555555769061 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1578429706181/work/Python/pythonrun.c:978
#16 0x0000555555769263 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1578429706181/work/Python/pythonrun.c:419
#17 0x000055555576936d in PyRun_AnyFileExFlags () at /tmp/build/80754af9/python_1578429706181/work/Python/pythonrun.c:81
#18 0x000055555576cd53 in run_file (p_cf=0x7fffffffdddc, filename=0x5555558a76c0 L"gdpr_topic_modelling.py", fp=0x5555558f5110) at /tmp/build/80754af9/python_1578429706181/work/Modules/main.c:340
#19 Py_Main () at /tmp/build/80754af9/python_1578429706181/work/Modules/main.c:811
#20 0x00005555556373be in main () at /tmp/build/80754af9/python_1578429706181/work/Programs/python.c:69
#21 0x00007ffff77e6b97 in __libc_start_main (main=0x5555556372d0 <main>, argc=2, argv=0x7fffffffdfe8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdfd8) at ../csu/libc-start.c:310
#22 0x0000555555716084 in _start () at ../sysdeps/x86_64/elf/start.S:103

Originally posted by @g3rfx in #40 (comment)

Malloc error when training LLDA model

System: OSX (10.14.6)
Version: Tomotopy 0.5.2

Example code (mostly from lda example and using the same text file) :

import tomotopy as tp
from random import seed
from random import randint

seed(1)

def llda_example(input_file, save_path):
    topics = ['technology', 'art', 'economics', 'politics', 'religion', 'sport']
    mdl = tp.LLDAModel(tw=tp.TermWeight.ONE, min_cf=3, rm_top=5, k=20)
    for n, line in enumerate(open(input_file, encoding='utf-8')):
        ch = line.strip().split()
        labels = []
        for i in range(randint(1, 6)):
            labels.append(topics[randint(0, 5)])
        mdl.add_doc(ch,labels)
    mdl.burn_in = 100
    mdl.train(0)
            
print('Running LLDA')
llda_example('enwiki-stemmed-1000.txt', 'test.lda.bin')```

Result: Kernal dies - following message in console:
python(28393,0x111d1b5c0) malloc: *** error for object 0x7fc6f98cdfe0: pointer being freed was not allocated
python(28393,0x111d1b5c0) malloc: *** set a breakpoint in malloc_error_break to debug

[new feature] converting HDP to LDA

Gensim provides hdp_to_lda method for training or inference with topics fixed in a specific state. It is good for tomotopy to have this feature too.

Alpha and gamma values not what are being fed in as arguments

It's entirely possible that this issue stems from my lack of understanding of this model or of HDP/LDA in general:

I've written a method that cycles through different hyperparameter values and trains the model so that you can see how the output changes with different values.

I just ran the method in PyCharm, and I saw some strange behavior: I input alpha as 10 ** -4 (0.0001), eta as 10 ** -1 (0.1), and gamma as 10 ** 0 (1). However, once the model was trained, I got the following values:

hdp.alpha = 7.38756571081467e-05
hdp.eta = 0.10000000149011612
hdp.gamma = 3.130246162414551

Is this normal? Should those values be changing once the model has trained?

Partially Labelled LDA (PLDA): 'k' is an invalid keyword argument for this function

I received the following error when instantiating a tomotopy.PLDA instance:

python3.5 -i process_pllda.py 
Traceback (most recent call last):
  File "process_pllda.py", line 27, in <module>
    mdl = tp.PLDAModel(k=50)
TypeError: 'k' is an invalid keyword argument for this function

The documentation for tomotopy.PLDA indicates that k is a valid keyword argument for the constructor method, but based on the source code it is not. k is also not listed as a valid parameter later in the documentation, but latent_topics is (source). I am going to assume that k should be latent_topics.

Return value of document.get_topic_dist does not sum to 1.0

I expected the return value of Document.get_topic_dist() to sum to 1.0 (within some epsilon). Yet that's not what I see after training an LDA model.

>>> sum(lda.docs[52904].get_topic_dist())
0.05711954347498249

There are many documents in my model where the sum is not close to 1.0.

Could you please explain what the values returned by Document.get_topic_dist() represents?

Apologies if this is my misunderstanding.

Tomotopy와 pyLDAvis를 연계할 수 있나요?

저는 tomotopy hdp모델을 활용하고 있습니다.

pyLDAvis를 활용하기 위해서는 인자로 model, corpus, dictionary가 필요한데, tomotopy를 통해 학습된 모델을 넣으면
'tomotopy.HDPModel' object has no attribute 'num_topics'
문제가 발생합니다.

그 어떤 결과보다 tomotopy hdp모델이 결과가 좋아 꼭 활용하고 싶습니다. 연계가 가능할까요?

추가적으로 아래 과정과 같이 전처리가 완료되어 저장된 코퍼스의 문자열을 반환할 수 있을까요?

corpus = tp.utils.Corpus(
    tokenizer=tokenizer
)
# 입력 파일에는 한 줄에 문헌 하나씩이 들어가 있습니다.
corpus.process((line, kiwi.async_analyze(line)) for line in open('입력 텍스트 파일.txt', encoding='utf-8'))
# 전처리한 코퍼스를 저장한다.
corpus.save('k.cps')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.