GithubHelp home page GithubHelp logo

gerhobbelt / warplda Goto Github PK

View Code? Open in Web Editor NEW

This project forked from thu-ml/warplda

0.0 2.0 0.0 48 KB

Cache efficient implementation for Latent Dirichlet Allocation

License: MIT License

Shell 0.53% C++ 94.44% Python 2.87% CMake 2.16%

warplda's Introduction

WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation

Introduction

WarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).

Installation

Prerequisites:

  • GCC (>=4.8.5)
  • CMake (>=2.8.12)
  • git
  • libnuma
    • CentOS: yum install libnuma-devel
    • Ubuntu: apt-get install libnuma-dev

Clone this project

git clone https://github.com/thu-ml/warplda

Install third-party dependency

./get_gflags.sh

Download some data, and split it as training and testing set

cd data
wget https://raw.githubusercontent.com/sudar/Yahoo_LDA/master/test/ydir_1k.txt
head -n 900 ydir_1k.txt > ydir_train.txt
tail -n 100 ydir_1k.txt > ydir_test.txt
cd ..

Compile the project

./build.sh
cd release/src
make -j

Quick-start

Format the data

./format -input ../../data/ydir_train.txt -prefix train
./format -input ../../data/ydir_test.txt -vocab_in train.vocab -test -prefix test

Train the model

./warplda --prefix train --k 100 --niter 300

Check the result. Each line is a topic, its id, number of tokens assigned to it, and ten most frequent words with their probabilities.

vim train.info.full.txt

Infer latent topics of some testing data.

./warplda --prefix test --model train.model --inference -niter 40 --perplexity 10

Data format

The data format is identical to Yahoo! LDA. The input data is a text file with a number of lines, where each line is a document. The format of each line is

id1 id2 word1 word2 word3 ...

id1, id2 are two string document identifiers, and each word is a string, separated by white space.

Output format

WarpLDA generates a number of files:

.vocab (generated by .format)

Each line of it is a word in the vocabulary.

.info.full.txt (generated by warplda -estimate)

The most frequent words for each topic. Each line is a topic, with its topic it, number of tokens assigned to it, and a number of most frequent words in the format (probability, word). The number of most frequent words is controlled by -ntop. .info.words.txt is a simpler version which only contains words.

.model (generated by warplda -estimate)

The word-topic count matrix. The first line contains four integers

<size of vocabulary> <number of topics> <alpha> <beta>

Each of the remaining lines is a row of the word-topic count matrix, represented in the libsvm sparse vector format,

<number of elements> index:count index:count ...

For example, 0:2 on the first line means that the first word in the vocabulary is assigned to topic 0 for 2 times.

.z.estimate (generated by warplda -estimate)

The topic assignments of each token in the libsvm format. Each line is a document,

<number of tokens> <word id>:<topic id> <word id>:<topic id> ...

.z.inference (generated by warplda -inference)

The format is the same as .z.estimate.

Other features

  • Use custom prefix for output -prefix myprefix

  • Output perplexity every 10 iterations -perplexity 10

  • Tune Dirichlet hyperparameters -alpha 10 -beta 0.1

  • Use UCI machine learning repository data

      wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.nips.txt
      wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz
      gunzip docword.nips.txt.gz
      ./uci-to-yahoo docword.nips.txt vocab.nips.txt -o nips.txt
      head -n 1400 nips.txt > nips_train.txt
      tail -n 100 nips.txt > nips_test.txt
    

License

MIT

Reference

Please cite WarpLDA if you find it is useful!

@inproceedings{chen2016warplda,
  title={WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation},
  author={Chen, Jianfei and Li, Kaiwei and Zhu, Jun and Chen, Wenguang},
  booktitle={VLDB},
  year={2016}
}

warplda's People

Contributors

chnlkw avatar cjf00000 avatar gerhobbelt avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.