GithubHelp home page GithubHelp logo

doc2topic's Introduction

doc2topic -- Neural topic modeling

This is a neural take on LDA-style topic modeling, i.e., based on a set of documents, it provides a sparse topic distribution per document. A topic is described by a distribution over words. Documents and words are points in the same latent semantic space, whose dimensions are the topics.

The implementation is based on a lightweight neural architecture and aims to be a scalable alternative to LDA. It readily makes use of GPU computation and has been tested successfully on 1M documents with 200 topics (on a Titan Xp card with 12GB of memory).

Getting started: python -m tests.basic.py data/my_docs.txt

Method

The doc2topic network structure is inspired by word2vec skip-gram, where instead of modeling co-occurrences between center and context words, co-occurrences between a word and its document ID is modeled. In order to avoid heavy softmax calculation on an output layer the size of the vocabulary (or number of documents), the model is implemented as follows.

Architecture of doc2topic

The network takes as input a word ID and a document ID, which are feed through two separate embedding layers of the same dimensionality. Each embedding dimension represents a topic. The embedding layers are L1 activity regularized in order to obtain sparse representations, i.e., a parse assignment of topics. The document embeddings are more heavily regularized than the word embeddings, as sparsity is important primarily for topic-document assignments, but document and word embeddings are supposed to be comparable.

The network is trained by negative sampling, i.e., for any document both actual co-occurring words and random (supposed non-co-occurring) words are feed to the network. The two embeddings are compared by dot product, and a sigmoid activation function is applied in order to obtain values from 0 to 1. The training output label is 1 for co-occurring words and 0 for negative samples. This will push document vectors towards the vectors of the words of the document.

doc2topic's People

Contributors

sronnqvist avatar

Stargazers

Idriss Oulahbib avatar Jing Yang avatar  avatar mysoulmq avatar Diptanu Sarkar avatar riki.m avatar Varvara Papazoglou avatar peco avatar Zafer Cavdar avatar Jon Chun avatar songyf avatar  avatar Shashank Gupta avatar Long Sha avatar  avatar kandakji avatar Terry avatar  avatar Sai avatar Jacob Danovitch avatar Sani avatar Xinyi Wang avatar Evan Davis avatar Roberto Salas avatar Simon Roth avatar Alex Rigler avatar zhaohq avatar  avatar David Lenz avatar

Watchers

James Cloos avatar  avatar Evan Davis avatar  avatar  avatar David Lenz avatar

doc2topic's Issues

How to run

Hello there!

My team and I are a little new to Keras and we were wondering if we could get a bit of help.
Say we wanted to get the topics for some document not within patents.txt. How would we be able to get those?

Thanks!

Generated topics for document only come from first document

Hi there,

I was able to run this on my own dataset, but I'm getting some strange and unexpected behavior. I'm able to generate topics and weights for each document, but the topics returned are only words from the first document. I've attached some examples if this isn't clear. A fork to my code can be found here.

Changing the number of topics seems to help get indices of unique topics but I'm unsure how that impacts the generated model.

Running print_topic_words gives me the below output, which shows there are some great topics within my dataset:

1: walk, may, away, beauti, look, voic, hi, memori, turn, everyon
2: fun, look, hi, great, buy, hey, big, doe, better, new
3: bar, answer, peopl, befor, day, help, look, good, moment, da
4: drink, dum, thousand, fli, well, rememb, away, la, look, feet
5: bye, lovin, honey, faith, may, bop, wheel, find, readi, gone
6: gonna, way, heart, look, feelin, sometim, need, goodby, happen, someon
7: ah, ya, need, oo, wanna, nobodi, togeth, miss, know, ever
8: doo, song, oh, sing, boy, ever, wa, would, chee, miss
9: ladi, dead, deep, littl, boy, guy, leav, look, insid, bit
10: summer, easi, lay, heart, look, tryin, fall, la, two, black
11: river, kill, aliv, hard, long, day, summer, look, busi, shout
12: christma, snow, happi, year, go, knock, like, wa, train, littl
13: la, give, sing, gotta, look, ah, two, hundr, ring, onli
14: run, sail, wind, befor, hard, must, look, without, fun, make

Here are the topics generated for each song:

  {
    "artist": "ABBA",
    "title": "Ahe's My Kind Of Girl",
    "topics": [
      "make",
      "pleas",
      "plan",
      "believ",
      "without",
      "go",
      "squeez",
      "park",
      "fine",
      "could",
      "someth",
      "face",
      "mean",
      "talk",
      "hand",
      "wonder",
      "gentli",
      "hold",
      "hour"
    ]
  },
  {
    "artist": "ABBA",
    "title": "Andante, Andante",
    "topics": [
      "park",
      "mine",
      "hour",
      "like",
      "squeez",
      "blue",
      "face",
      "gentli",
      "thing",
      "pleas",
      "walk",
      "make",
      "believ",
      "take",
      "feel",
      "look",
      "girl",
      "see",
      "easi",
      "smile",
      "lucki",
      "without",
      "plan",
      "wonder",
      "fellow",
      "go"
    ]
  },
  {
    "artist": "ABBA",
    "title": "As Good As New",
    "topics": [
      "lucki",
      "fine",
      "make",
      "ever",
      "feel",
      "kind"
    ]
  }...

Note that all the generated topics are words from the first song "She's My Kind of Girl", but are not relevant for any other documents in my dataset. Please let me know if what I'm saying is unclear.

InvalidArgumentError indices [in docvecs]

Hi again. I translated your model to keras (in R):

n_topics = 4
input_dim = 10000
n_doc = 11995

input_d <- keras::layer_input(shape = 1, dtype="int32")
input_w <- keras::layer_input(shape = 1, dtype="int32")
  
embed_d <- input_d %>% 
 keras::layer_embedding(
   input_dim = n_doc, 
   output_dim = n_topics, 
   input_length = 1, 
   activity_regularizer = keras::regularizer_l1(0.000002), 
   name="docvecs"
 ) %>% 
 layer_activation("relu") %>% 
 layer_reshape(c(n_topics, 1))

embed_w <- input_w %>% 
 keras::layer_embedding(
   input_dim = input_dim, 
   output_dim = n_topics, 
   input_length = 1, 
   activity_regularizer = keras::regularizer_l1(0.000000015), 
   name="wordvecs"
 ) %>% 
 layer_activation("relu") %>% 
 layer_reshape(c(n_topics, 1))

dot_prod <- keras::layer_dot(list(embed_d, embed_w), axes = 1, normalize = F) %>% 
 keras::layer_reshape(target_shape = 1)

output <- dot_prod %>% 
  layer_activation("sigmoid")

model <- keras::keras_model(inputs = list(input_d, input_w), outputs = output) %>% 
  keras::compile(
   loss = "binary_crossentropy",
   optimizer = "adam"
  )

With the following data input:

Observations: 2,706,666
Variables: 3
$ doc_id   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ token_id <int> 102, 2269, 113, 8360, 8746, 566, 496, 5930, 113, 119, 17, 2356, 803, …
$ outcome  <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, …

$doc_id
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1    2966    5990    5994    9014   11995 
$token_id
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1     493    2231    3248    5640   10000 
$outcome
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     0.5     0.5     1.0     1.0 

But I get an error and don't know exactly why -

model %>% 
 fit(
  x = list(sam$doc_id, sam$token_id), sam$outcome,
  batch_size = 100,
  epochs = 2,
  shuffle = TRUE
)

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  InvalidArgumentError: indices[52,0] = 11995 is not in [0, 11995)
	 [[Node: docvecs_7/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@training_3/Adam/Assign_2"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](docvecs_7/embeddings/read, _arg_input_17_0_0, training_3/Adam/gradients/docvecs_7/embedding_lookup_grad/concat/axis)]]

Do you know how to fix this?

Would be great! Thanks in Advance,
Simon

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.