sronnqvist / doc2topic Goto Github PK

Neural topic modeling

Python 100.00%

doc2topic's Introduction

doc2topic -- Neural topic modeling

This is a neural take on LDA-style topic modeling, i.e., based on a set of documents, it provides a sparse topic distribution per document. A topic is described by a distribution over words. Documents and words are points in the same latent semantic space, whose dimensions are the topics.

The implementation is based on a lightweight neural architecture and aims to be a scalable alternative to LDA. It readily makes use of GPU computation and has been tested successfully on 1M documents with 200 topics (on a Titan Xp card with 12GB of memory).

Getting started: python -m tests.basic.py data/my_docs.txt

Method

The doc2topic network structure is inspired by word2vec skip-gram, where instead of modeling co-occurrences between center and context words, co-occurrences between a word and its document ID is modeled. In order to avoid heavy softmax calculation on an output layer the size of the vocabulary (or number of documents), the model is implemented as follows.

The network takes as input a word ID and a document ID, which are feed through two separate embedding layers of the same dimensionality. Each embedding dimension represents a topic. The embedding layers are L1 activity regularized in order to obtain sparse representations, i.e., a parse assignment of topics. The document embeddings are more heavily regularized than the word embeddings, as sparsity is important primarily for topic-document assignments, but document and word embeddings are supposed to be comparable.

The network is trained by negative sampling, i.e., for any document both actual co-occurring words and random (supposed non-co-occurring) words are feed to the network. The two embeddings are compared by dot product, and a sigmoid activation function is applied in order to obtain values from 0 to 1. The training output label is 1 for co-occurring words and 0 for negative samples. This will push document vectors towards the vectors of the words of the document.

doc2topic's People

Contributors

Stargazers

Watchers

Forkers

zhaohuiqiang yusueliu cherry-1024 teresaibarra eridgd srihari-palivela raveeshmayya kandakji guoruijiao imdiptanu

doc2topic's Issues

How to run

Hello there!

My team and I are a little new to Keras and we were wondering if we could get a bit of help.
Say we wanted to get the topics for some document not within patents.txt. How would we be able to get those?

Thanks!

Generated topics for document only come from first document

Hi there,

I was able to run this on my own dataset, but I'm getting some strange and unexpected behavior. I'm able to generate topics and weights for each document, but the topics returned are only words from the first document. I've attached some examples if this isn't clear. A fork to my code can be found here.

Changing the number of topics seems to help get indices of unique topics but I'm unsure how that impacts the generated model.

Running print_topic_words gives me the below output, which shows there are some great topics within my dataset:

1: walk, may, away, beauti, look, voic, hi, memori, turn, everyon
2: fun, look, hi, great, buy, hey, big, doe, better, new
3: bar, answer, peopl, befor, day, help, look, good, moment, da
4: drink, dum, thousand, fli, well, rememb, away, la, look, feet
5: bye, lovin, honey, faith, may, bop, wheel, find, readi, gone
6: gonna, way, heart, look, feelin, sometim, need, goodby, happen, someon
7: ah, ya, need, oo, wanna, nobodi, togeth, miss, know, ever
8: doo, song, oh, sing, boy, ever, wa, would, chee, miss
9: ladi, dead, deep, littl, boy, guy, leav, look, insid, bit
10: summer, easi, lay, heart, look, tryin, fall, la, two, black
11: river, kill, aliv, hard, long, day, summer, look, busi, shout
12: christma, snow, happi, year, go, knock, like, wa, train, littl
13: la, give, sing, gotta, look, ah, two, hundr, ring, onli
14: run, sail, wind, befor, hard, must, look, without, fun, make

Here are the topics generated for each song:

  {
    "artist": "ABBA",
    "title": "Ahe's My Kind Of Girl",
    "topics": [
      "make",
      "pleas",
      "plan",
      "believ",
      "without",
      "go",
      "squeez",
      "park",
      "fine",
      "could",
      "someth",
      "face",
      "mean",
      "talk",
      "hand",
      "wonder",
      "gentli",
      "hold",
      "hour"
    ]
  },
  {
    "artist": "ABBA",
    "title": "Andante, Andante",
    "topics": [
      "park",
      "mine",
      "hour",
      "like",
      "squeez",
      "blue",
      "face",
      "gentli",
      "thing",
      "pleas",
      "walk",
      "make",
      "believ",
      "take",
      "feel",
      "look",
      "girl",
      "see",
      "easi",
      "smile",
      "lucki",
      "without",
      "plan",
      "wonder",
      "fellow",
      "go"
    ]
  },
  {
    "artist": "ABBA",
    "title": "As Good As New",
    "topics": [
      "lucki",
      "fine",
      "make",
      "ever",
      "feel",
      "kind"
    ]
  }...

Note that all the generated topics are words from the first song "She's My Kind of Girl", but are not relevant for any other documents in my dataset. Please let me know if what I'm saying is unclear.

InvalidArgumentError indices [in docvecs]

Hi again. I translated your model to keras (in R):

n_topics = 4
input_dim = 10000
n_doc = 11995

input_d <- keras::layer_input(shape = 1, dtype="int32")
input_w <- keras::layer_input(shape = 1, dtype="int32")
  
embed_d <- input_d %>% 
 keras::layer_embedding(
   input_dim = n_doc, 
   output_dim = n_topics, 
   input_length = 1, 
   activity_regularizer = keras::regularizer_l1(0.000002), 
   name="docvecs"
 ) %>% 
 layer_activation("relu") %>% 
 layer_reshape(c(n_topics, 1))

embed_w <- input_w %>% 
 keras::layer_embedding(
   input_dim = input_dim, 
   output_dim = n_topics, 
   input_length = 1, 
   activity_regularizer = keras::regularizer_l1(0.000000015), 
   name="wordvecs"
 ) %>% 
 layer_activation("relu") %>% 
 layer_reshape(c(n_topics, 1))

dot_prod <- keras::layer_dot(list(embed_d, embed_w), axes = 1, normalize = F) %>% 
 keras::layer_reshape(target_shape = 1)

output <- dot_prod %>% 
  layer_activation("sigmoid")

model <- keras::keras_model(inputs = list(input_d, input_w), outputs = output) %>% 
  keras::compile(
   loss = "binary_crossentropy",
   optimizer = "adam"
  )

With the following data input:

Observations: 2,706,666
Variables: 3
$ doc_id   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ token_id <int> 102, 2269, 113, 8360, 8746, 566, 496, 5930, 113, 119, 17, 2356, 803, …
$ outcome  <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, …

$doc_id
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1    2966    5990    5994    9014   11995 
$token_id
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1     493    2231    3248    5640   10000 
$outcome
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     0.5     0.5     1.0     1.0

But I get an error and don't know exactly why -

model %>% 
 fit(
  x = list(sam$doc_id, sam$token_id), sam$outcome,
  batch_size = 100,
  epochs = 2,
  shuffle = TRUE
)

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  InvalidArgumentError: indices[52,0] = 11995 is not in [0, 11995)
	 [[Node: docvecs_7/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@training_3/Adam/Assign_2"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](docvecs_7/embeddings/read, _arg_input_17_0_0, training_3/Adam/gradients/docvecs_7/embedding_lookup_grad/concat/axis)]]

Do you know how to fix this?

Would be great! Thanks in Advance,
Simon

Hi, is there a paper that supports your code?

looking forward to your answer

thank you

sronnqvist / doc2topic Goto Github PK

doc2topic's Introduction

doc2topic -- Neural topic modeling

Method

doc2topic's People

Contributors

Stargazers

Watchers

Forkers

doc2topic's Issues

How to run

Generated topics for document only come from first document

InvalidArgumentError indices [in docvecs]

Hi, is there a paper that supports your code?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs