GithubHelp home page GithubHelp logo

rayxu14 / swda Goto Github PK

View Code? Open in Web Editor NEW

This project forked from glicerico/swda

0.0 1.0 0.0 13.34 MB

Switchboard Dialog Act Corpus with Penn Treebank links

License: GNU General Public License v2.0

Python 100.00%

swda's Introduction

Switchboard Dialog Act Corpus with Penn Treebank links

Overview

The Switchboard Dialog Act Corpus (SwDA) extends the Switchboard-1 Telephone Speech Corpus, Release 2 with turn/utterance-level dialog-act tags. The tags summarize syntactic, semantic, and pragmatic information about the associated turn. The SwDA project was undertaken at UC Boulder in the late 1990s.

The SwDA is not inherently linked to the Penn Treebank 3 parses of Switchboard, and it is far from straightforward to align the two resources. In addition, the SwDA is not distributed with the Switchboard's tables of metadata about the conversations and their participants.

This project includes a version of the corpus (swda.zip) that pools all of this information to the best of my ability. In addition, it includes Python classes that should make it easy to work with this merged resource.

This project was originally part of my LSA Linguistic Institute 2011 course Computational Pragmatics. Additional resources from that corpus:

The code in this repository is compatible with Python 2 and Python 3. Its only other external dependency is NLTK, with the data installed so that WordNet is available.

Citation

If you use this resource, please cite

@techreport{Jurafsky-etal:1997,
	Address = {Boulder, CO},
	Author = {Jurafsky, Daniel and Shriberg, Elizabeth and Biasca, Debra},
	Institution = {University of Colorado, Boulder Institute of Cognitive Science},
	Number = {97-02},
	Title = {Switchboard {SWBD}-{DAMSL} Shallow-Discourse-Function Annotation Coders Manual, Draft 13},
	Year = {1997}}

@article{Shriberg-etal:1998,
	Author = {Shriberg, Elizabeth and Bates, Rebecca and Taylor, Paul and Stolcke, Andreas and Jurafsky, Daniel and Ries, Klaus and Coccaro, Noah and Martin, Rachel and Meteer, Marie and Van Ess-Dykema, Carol},
	Journal = {Language and Speech},
	Number = {3--4},
	Pages = {439--487},
	Title = {Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech?},
	Volume = {41},
	Year = {1998}}

@article{Stolcke-etal:2000,
	Author = {Stolcke, Andreas and Ries, Klaus and Coccaro, Noah and Shriberg, Elizabeth and Bates, Rebecca and Jurafsky, Daniel and Taylor, Paul and Martin, Rachel and Meteer, Marie and Van Ess-Dykema, Carol},
	Journal = {Computational Linguistics},
	Number = {3},
	Pages = {339--371},
	Title = {Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech},
	Volume = {26},
	Year = {2000}}

Files

  • swda.py: the module for processing this corpus distribution
  • swda.zip: the corpus; needs to be unzipped
  • swda_functions.py: some simple examples aggregating informaton with CorpusReaders
  • metadata_processor.py: auxiliary processing file used to create swda/swda-metadata.csv

Transcript objects

The code's Transcript objects model the individual files in the corpus. A Transcript object is built from a transcript filename and the corpus metadata file:

from swda import Transcript

trans = Transcript('swda/sw00utt/sw_0001_4325.utt.csv', 'swda/swda-metadata.csv')

trans.topic_description
'CHILD CARE'

trans.prompt
'FIND OUT WHAT CRITERIA THE OTHER CALLER WOULD USE IN SELECTING CHILD \
CARE SERVICES FOR A PRESCHOOLER.  IS IT EASY OR DIFFICULT TO FIND SUCH CARE?'

trans.talk_day
datetime.datetime(1992, 3, 23, 0, 0)

trans.talk_day.year
1992

trans.talk_day.month
3

trans.from_caller
1632

trans.from_caller_sex
'FEMALE'

Transcript instances have many attributes:

for a in sorted([a for a in dir(trans) if not a.startswith('_')]):
	print(a)

conversation_no
conversation_no
from_caller
from_caller_birth_year
from_caller_dialect_area
from_caller_education
from_caller_sex
header
length
metadata
prompt
ptd_basename
swda_filename
talk_day
to_caller
to_caller_birth_year
to_caller_dialect_area
to_caller_education
to_caller_sex
topic_description
utterances

Utterance objects

These have many attributes and methods. Some examples:

utt = trans.utterances[19]

utt.caller
'B'

utt.act_tag
'sv'

utt.text
'[ I guess + --'

utt.pos
'[ I/PRP ] guess/VBP --/:'

utt.pos_words()
['I', 'guess', '--']

utt.pos_lemmas(wn_lemmatize=True)
[('I', 'prp'), ('guess', 'v'), ('--', ':')]

len(utt.trees)
1

utt.trees[0].pprint()
'(S
  (EDITED
    (RM (-DFL- \\[))
    (S (NP-SBJ (PRP I)) (VP-UNF (VBP guess)))
    (IP (-DFL- \\+)))
  (NP-SBJ (PRP I))
  (VP
    (VBP guess)
    (RS (-DFL- \\]))
    (SBAR
      (-NONE- 0)
      (S (NP-SBJ (PRP we)) (VP (MD can) (VP (VB start))))))
  (. .))'

Because the trees often properly contain the utterance, they cannot be used to gather word- or phrase-level statistics unless care is taken to restrict attention to the subtrees, or fragments thereof, that represent the utterance itself.

Not all utterances have trees; only a subset of the Switchboard is fully parsed. Thus, of the 221,616 utterances in the SwDA, 118,218 (53%) have at least one tree.

CorpusReader objects

The main interface provided by swda.py is the CorpusReader, which allows you to iterate through the entire corpus, gathering information as you go. CorpusReader objects are built from just the root of the directory containing your csv files. (It assumes that swda-metadata.csv is in the first directory below that root.)

from swda import CorpusReader
corpus = CorpusReader('swda')

The two central methods for CorpusReader objects are iter_transcripts and iter_utterances. The method iter_utterances is basically an abbreviation of the following nested loop:

for trans in corpus.iter_transcripts():
    for utt in trans.utterances:
        yield utt

For some illustrations, see swda_functions.py.

For more

There's a much fuller overview here: http://compprag.christopherpotts.net/swda.html

swda's People

Contributors

cgpotts avatar glicerico avatar rayxu14 avatar sairampillai avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.