GithubHelp home page GithubHelp logo

content-structure's Introduction

Incorporating Content Structure into Text Analysis Applications

Christina Sauper Aria Haghighi Regina Barzilay
[email protected] [email protected] [email protected]

Abstract

In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.

Full Text: http://groups.csail.mit.edu/rbg/code/content_structure/sauper-emnlp-10.pdf

Code

This code is available for research use only.

These instructions mainly pertain to the Amazon and Yelp data sets for the multi-aspect phrase extraction task.

Running

Source code is included, but to just get started running the system quickly, I recommend installing maven2, then compiling with mvn compile in the base directory. After that, run the system as follows (substituting your desired config file and memory usage):

java -ea -server -mx5G -Djava.ext.dirs=lib -cp target/classes phrase.jointtopic.Main amazon-conf.yaml

Config files

amazon-conf.yaml yelp-conf.yaml

These are the config files which specify parameters for the model. See GlobalOptions.java for more potential options.

conf/amazon.list conf/yelp.list

Lists of token files (see below for format). Those marked as TRAIN_LABELED will be used as labeled input documents; those marked as TEST_LABELED will be used at test time.

Data

We performed experiments on three separate corpora, a set of Amazon HDTV reviews (59 labeled, 12.8k unlabeled), a set of Yelp restaurant reviews (96 labeled, 31k unlabeled), and a set of IGN DVD reviews (665 labeled).

Formats

*.tok

Tokenized file, one word per line. The columns of this file are as follows: word sentence # word #(sent) start-char end-char start-char(sent) end-char(sent)

*.ann

Annotations corresponding to the tokenized file; one word's label per line. Some annotation files are tagged with begin / inside / end tokens; ability to automatically strip these is controlled by an option in the config file.

Demo

A demo system is online at http://condensr.com. Backend code is online at https://github.com/csauper/condensr.

The demo system puts together results from multi-aspect phrase extraction on Yelp reviews with Google maps and other restaurant metadata in order to display a concise summary of restaurants based on a search in a particular area.

Each restaurant summary presents the following:

  • Basic restaurant information (name, address, location)
  • Automatically extracted and categorized highlights from all Yelp reviews for the restaurant in several categories (food, ambiance, service, value, overall)
  • Automatically determined sentiment classification (positive, negative, neutral)
  • Supporting highlights to expand on the main points

content-structure's People

Stargazers

Hyunhwan "Aiden" Lee avatar Christian Hochfilzer avatar  avatar

Watchers

Christian Hochfilzer avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.