GithubHelp home page GithubHelp logo

jkowalewski / alto-acl-2016 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from foroughp/alto-acl-2016

0.0 1.0 0.0 5.71 MB

ALTO: Active Learning with Topic Overviews for Speeding Label Induction and Document Labeling

CSS 1.29% JavaScript 39.43% HTML 21.93% Java 37.35%

alto-acl-2016's Introduction

Introduction

This file walks through the process for using ALTO (http://aclweb.org/anthology/P/P16/P16-1110.pdf) for label induction and document labeling. The code also allows you to use the other three conditions in the paper.

Check out

git clone https://github.com/Foroughp/ALTO-ACL-2016.git

Requirements

You need to have tomcat installed on your computer. Find your tomcat related directories. Let $TOMCAT be the tomcat directory.

Compiling and Running

Let BASEDIR be the directory of ALTO code.

Please try the synthetic data first:

Set constants:

Open $BASEDIR/src/util/Constants.java, set the ABS_BASE_DIR to the abs path of $BASEDIR/WebContent/results, and TEXT_DATA_DIR to the abs path of $BASEDIR/text_data/synthetic/

Compile:

  • cd $BASEDIR

  • mkdir WebContent/WEB-INF/classes

  • javac -cp WebContent/WEB-INF/lib/*:$TOMCAT/* src/*/*.java -d WebContent/WEB-INF/classes

Copy to the tomcat webapps:

  • cp -r WebContent $TOMCAT/webapps/

  • mv $TOMCAT/webapps/WebContent $TOMCAT/webapps/alto-release

Run server:

  • Start the server: $TOMCAT/bin/catalina.sh start

  • Open your browser: http://localhost:8080/alto-release/

  • Start using ALTO (see next step)

  • Stop the server: $TOMCAT/bin/catalina.sh stop

Try ALTO:

check the ALTO paper in ACL 2016 for details about this interface

  • Type user name, choose a condition, and click "Start".
  • You can click the documents to see the actual content.
  • If in TA or TR conditions, check the topics: the top words and top related documents for each topic are displayed. You can click the documents to see the actual content.
  • You can assign a label to the document either from the existing labels or creating a new one.
  • Once you have at least two labels, the system will suggest a document to label by scrolling to that document and drawing a red box around it.
  • The logs are available in the results directory

Trying a new dataset:

Here are some steps for trying a new dataset with ALTO: Assume your data name is CORPUS. Put the data in the $BASEDIR/text_data/$CORPUS folder. Each document needs to be in a separate file.

  • mkdir $BASEDIR/WebContent/results/$CORPUS
  • mkdir $BASEDIR/WebContent/results/$CORPUS/input

Generate the files in the data folder:

  • $CORPUS.html: Follow the synthetic.html format. Keep the headers the same, and follow the same way to format your documents. Note that you don't have to make all your documents in one html file, you can split them to several small ones, as long as they are consistent with the url defined in $CORPUS.url.
  • $CORPUS.titles: Follow the synthetic.titles format. The content will be used for display in the interface. For synthetic data, the titles are the same as actual content. However, for a real-world data, titles will be shorter than content.

Generate the files in the input folder:

  • The mallet input data $CORPUS-topic-input.mallet: ./bin/mallet import-dir --input $BASEDIR/text_data/$CORPUS --output $BASEDIR/WebContent/results/$CORPUS/input/$CORPUS-topic-input.mallet --keep-sequence (Please refer to mallet website for more details.)

  • $CORPUS.url: Follow the synthetic.url format. Notice this is related with the *.html in $BASEDIR/WebContent/data/. The url basically defines the path to look for the document in the synthetic.html file.

  • $CORPUS.gold: the gold labels associated with documents. Follow the format in synthetic.gold. Note that this file is optional. If you don't have gold label standards for your dataset, purity will be calculated as -1.

Generate the files in the output folder:

  • The model.topics and model.docs files: ./bin/mallet train-topics --input $BASEDIR/WebContent/results/$CORPUS/input/$CORPUS-topic-input.mallet --num-topics $NUMTOPICS --topic-word-weights-file $BASEDIR/WebContent/results/$CORPUS/output/model.topics --output-doc-topics $BASEDIR/WebContent/results/$CORPUS/output/model.docs (Please refer to mallet website for more details.)

Setup new data in the interface:

  • Change the CORPUS_NAME and TEXT_DATA_DIR variables in $BASEDIR/WebContent/src/util/Constants.java.
  • Change the topic model setting in $BASEDIR/src/util/Constants.java.
  • If need be, change the counting down time: Look for "var start_itm = 40;" in WebContent/ui.html and change the number "15" to the time you want.

alto-acl-2016's People

Contributors

akkikiki avatar foroughp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.