GithubHelp home page GithubHelp logo

foroughp / slda Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chbrown/slda

0.0 2.0 0.0 1.29 MB

Supervised Latent Dirichlet Allocation for Classification

License: GNU General Public License v2.0

Makefile 0.45% C++ 92.03% C 7.52%

slda's Introduction

Supervised Latent Dirichlet Allocation for Classification

This is a C++ implementation of supervised latent Dirichlet allocation (sLDA) for classification.

Note that this code depends on the GNU Scientific Library.

Compiling

git clone https://github.com/chbrown/slda
cd slda
make

You may need to install the gsl first. E.g., on a Mac:

brew install gsl

Estimation

Estimate the model by executing:

slda est <data path> <label path> <settings path> <alpha> <k> <initialization> <output directory>
  • <data path> should point to a single file containing your training data.
    • This should be a file where each line is of the form:

        <M> <term_1>:<count> <term_2>:<count> ... <term_N>:<count>
      
    • where <M> is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. (For an example, see test/images/train-data.dat.)

  • <label path> points to a file of labels
    • Each line should consist of a single integer, starting with 0, up to C-1, if we have C classes.
    • This file should have the same number of lines as the file specified by <data path>.
  • <settings path> should point to a file with various settings, e.g., settings.txt
  • <alpha> is a floating point hyperparameter (a prior)
  • <k> is the number of topics
  • <initialization> specifies the initialization method. There are three options:
    • "seeded"
    • "random"
    • <model path> (a path to some pre-existing model)
  • <output directory> should point to a directory where the estimator's output will be stored. This directory will be created if it does not already exist.
    • The estimator outputs models in two types of files:

      • <iteration>.model is the model saved in the binary format, which is easy and fast to use for inference.
      • <iteration>.model.text is the model saved in the text format, which is convenient for printing topics or further analysis using a scripting language.
    • It also produces variational posterior Dirichlets in a file called:

      • <iteration>.gamma
    • Running the estimator on the 8-class image dataset produces the output:

        010.gamma
        010.model
        010.model.text
        020.gamma
        020.model
        020.model.text
        final.gamma
        final.model
        final.model.text
        likelihood.dat
        word-assignments.dat
      

Example usage:

./slda est test/images/train-data.dat test/images/train-label.dat \
    settings.txt 1.0 10 random tmp/

Inference

To perform inference on a different set of data (in the same format as for estimation), execute:

slda inf <data path> <label path> <settings path> <model path> <output directory>
  • <data path>, <label path>, and <settings path> are all the same as in the estimation step.
  • <model path> is the binary final.model file from the estimation step.
  • <output directory> is the output directory, where the predicted labels will be stored.
    • Each output file has one line per input document.
      • inf-gamma.dat describes the variational posterior Dirichlets
      • inf-labels.dat displays the predicted labels
      • inf-likelihood.dat depicts each document's likelihood

Example usage:

./slda inf test/images/test-data.dat test/images/test-label.dat \
    settings.txt tmp/final.model tmp/

This will also produce a final line of output, evaluating against the labels specified in the <label path> argument:

average accuracy: 0.679

Sample data

The sample data in test/images was downloaded from http://www.cs.cmu.edu/~chongw/data/images.tgz on July 12, 2013.

Description of data from original site:

A preprocessed 8-class image dataset from Labelme.

UIUC Sports annotation files: annotations and meta information. The source image files can be found here. (Note: there might be some discrepancies and I don't seem to know why...)

License

Copyright © 2009, Chong Wang, David Blei and Li Fei-Fei

Licensed under both the GPL v2 and GPL v3, as well as any future version of the GNU General Public License.

slda's People

Contributors

chbrown avatar

Watchers

James Cloos avatar Forough avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.