GithubHelp home page GithubHelp logo

eponvert / maul Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jasonbaldridge/maul

0.0 2.0 0.0 1.36 MB

A simple command-line access to computing topic models with Mallet.

License: Apache License 2.0

Scala 2.76% Shell 0.26% Java 96.98%

maul's Introduction

maul

A simple codebase to provide command-line access for computing topic models with Mallet and language models with BerkeleyLM. Also, it has some data for others to try the stuff I did for my blog post on SXSW proposal titles in maul/data/sxsw.

Note: Mallet is used as a standard dependency. The code for BerkeleyLM is included directly in this project as there is no up-to-date published version of it.

Installation

To use it, you need to obtain and compile the code. Here's the recipe, with an attempt to be helpful if you are not an experienced Unix user.

First, clone this repository and compile the code.

$ git clone https://github.com/jasonbaldridge/maul.git
$ cd maul
$ ./build compile

If this went smoothly, you should be ready to go!

For topic modeling on 20 newsgroups data.

First, you need some data, so go to the maul top-level and do the following:

$ wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
$ tar xzf 20news-bydate.tar.gz 

This will give you the 20 Newsgroup data. We'll use the 20news-bydate-train directory that you should have as a result of this.

To compute topics from it, boot up SBT:

$ ./build

You will now be in SBT, which can compile the (minimal) code in this repository and ensure that all the relevant classes (e.g. from Mallet) are available. Now, you can learn topics for the 20 Newsgroups data above by doing the following:

> run-main maul.topics.MalletLda 20news-bydate-train

This will show Mallet estimating the topics and end with output.

There are a number of options, which you can see by using the --help option with either MalletLda or FactorieLda

> run-main MalletLda --help
> run-main FactorieLda --help

Here are example commands with options:

> run-main maul.topics.MalletLda --num-topics 200 --num-iterations 1000 --num-threads 1 20news-bydate-train
> run-main maul.topics.FactorieLda --num-topics 200 --num-iterations 1000 --num-threads 1 --read-dirs 20news-bydate-train

There is also a script to try it out without SBT. Run the following command (assuming that you have compiled the project):

$ ./maul.sh mallet-lda --num-topics 200 --num-iterations 1000 --num-threads 1 20news-bydate-train
$ ./maul.sh factorie-lda --num-topics 200 --num-iterations 1000 --num-threads 1 --read-dirs 20news-bydate-train

Here is an example of some further options that (a) turn off standard tokenization (useful if you have non-language data or just want to be sure to split on whitespace), (b) output topics to a named file, (c) output the topics as one word per line (each topic is headed by it's topic index with its words on following lines), and (d) specify the number of words to display per topic.

$ ./maul.sh mallet-lda --num-topics 200 --num-iterations 1000 --whitespace-tokenization --output topics-20news-output.txt --output-new-lines --words-to-display 100 20news-bydate-train

maul's People

Contributors

hardik-vala avatar jasonbaldridge avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.