GithubHelp home page GithubHelp logo

splitta's Introduction

Improved Sentence Boundary Detection

Dan Gillick
January 21, 2009

-----

Consider the following text:

"On Jan. 20, former Sen. Barack Obama became the 44th President of 
the U.S. Millions attended the Inauguration."

The periods are potentially ambiguous, signifying either the end of a 
sentence, an abbreviation, or both. The sentence boundary detection (SBD) task
involves disambiguating the periods, and in particular, classifying each
period as end-of-sentence (<S>) or not. In the example, only the period
at the end of U.S. should be classified as <S>:

"On Jan. 20, former Sen. Barack Obama became the 44th President of 
the U.S.<S> Millions attended the Inauguration."

Chances are, if you are using some SBD system, it has an error rate of
1%-3% on English newswire text. The system described here achieves the 
best known error rate on the Wall Street Journal corpus: 0.25% and 
comparable error rates on the Brown corpus (mixed genre) and other test 
corpora.

-----

SBD is fundamental to many natural language processing problems, but only
a few papers describe solutions. A variety of rule-based systems are
floating around, and a few semi-statistical systems are available if you
know where to look. The most widely cited are:

- Alembic (Aberdeen, et al. 1995): Abbreviation list and ~100 hand-crafted 
  regular expressions.
- Satz (Palmer & Hearst at Berkeley, 1997): Part of speech features and 
  abbreviation lists as input to a classifier (neural nets and decision 
  trees have similar performance).
- mxTerminator (Reynar & Ratnaparkhi, 1997): Maximum entropy classification 
  with simple lexical features.
- Mikheev (Mikheev, 2002): Observes that perfect labels for abbreviations 
  and names gives almost perfect SBD results. Creates heuristics for 
  marking these, unsupervised, from held-out data.
- Punkt (Strunk and Kiss, 2006): Unsupervised method uses heuristics to 
  identify abbreviations and sentence starters.

I have not been able to find publicly available copies of any of these 
systems, with the exception of Punkt, which ships with NLTK. Nonetheless,
here are some error rates reported on what I believe to be the same 
subset of the WSJ corpus (sections 03-16).

- Alembic: 0.9%
- Satz: 1.5%; 1.0% with extra hand-written lists of abbreviations and 
  non-names.
- mxTerminator: 2.0%; 1.2% with extra abbreviation list.
- Mikheev: 1.4%; 0.45% with abbreviation list (assembled automatically but 
  carefully tuned; test-set-dependent parameters are a concern)
- Punkt: 1.65% (Though if you use the model that ships with NLTK, you'll get
  over 3%)

All of these systems use lists of abbreviations in some capacity, which 
I think is a mistake. Some abbreviations almost never end a sentence (Mr.), 
which makes list-building appealing. But many abbreviations are more 
ambiguous (U.S., U.N.), which complicates the decision. 

-----

While 1%-3% is a low error rate, this is often not good enough. In 
automatic document summarization, for example, including a sentence fragment 
usually renders the resulting summary unintelligible. With 10-sentence 
summaries, 1 in 10 is ruined by an SBD system with 99% accuracy. Improving 
the accuracy to 99.75%, only 1 in 40 is ruined. Improved sentence boundary
detection is also likely to help with language modeling and text alignment.

-----

I built a supervised system that classifies sentence boundaries without 
any heuristics or hand-generated lists. It uses the same training data
as mxTerminator, and allows for Naive Bayes or SVM models (SVM Light).

----------------------------------------------------------
Corpus                              SVM        Naive Bayes
----------------------------------------------------------
WSJ                                 0.25%      0.35%
Brown                               0.36%      0.45%
Complete Works of Edgar Allen Poe:  0.52%      0.44%
----------------------------------------------------------

I've packaged this code, written in Python, for general use. Word-level
tokenization, which is particularly important for good sentence boundary
detection, is included. 

Note that the included models use all of the labeled data listed here, 
meaning that the expected results are somewhat better than the numbers 
reported above. Including the Brown data as training improves the WSJ 
result to 0.22% and the Poe result to 0.37 (using the SVM).

-----

A few other notes on performance. The standard WSJ test corpus includes
26977 possible sentence boundaries. About 70% are in fact sentence
boundaries. Classification with the included SVM model will give 59 
errors. Of these, 24 (41%) involve the word "U.S.", a particularly
interesting case. In training, "U.S." appears 2029 times, and 90 of these
are sentence boundaries. Further complicating the situation, "U.S."
often appears in a context like "U.S. Security Council" or "U.S.
Government", and either "Security" or "Government" are viable sentence
starters.

Other confusing cases include "U.N.", "U.K.", and state abbreviations
like "N.Y." which have similar characteristics as "U.S." but appear
somewhat less frequently.

-----

Setup:

(1) You need Python 2.5 or later. Python 3 does not seem to work.
(2) To use SVM models, you'll need SVM Light (http://svmlight.joachims.org/)
      - once installed, you'll need to modify sbd.py slightly:
        at the top, change the paths to SVM_LEARN and SVM_CLASSIFY to point
        to the files you've installed.

-----

Example calls:

(show command line options)
python sbd.py -h

(split sentences in sample.txt using the provided Naive Bayes model)
python sbd.py -m model_nb sample.txt

(now using the provided SVM model)
python sbd.py -m model_svm sample.txt

(now keeping tokenized output)
python sbd.py -m model_nb -t sample.txt

(now writing output to sample.sent)
python sbd.py -m model_nb -t sample.txt -o sample.sent

-----

Note about SVM_LIGHT:

The provided SVM model was built with SVM_LIGHT version 6.02. It seems that
SVM_CLASSIFY requires a matching version or it will crash. So, you can either
try to use version 6.02, or you can make the following quick fix:

open model_svm/svm_model
change the first line from:
SVM-light Version V6.02
to whatever your version is.
-----

Dan Gillick
January 21, 2009
Berkeley, California

[email protected]

splitta's People

Watchers

James Cloos avatar

splitta's Issues

can you give me one copy of training data?

for the SVM model, the svmlight is used for taining and predicting, however, as 
we know ,the format for training and tesing data is fixed, i want to add 
svmlight dll to my code, i find there is always some mistakes. so i hope 
whether you can give me one copy of training data, in this way, i can try other 
SVM tools, my e-mail is [email protected], best wishes!
Thanks very much!

Original issue reported on code.google.com by [email protected] on 11 Aug 2010 at 3:55

IOError: [Errno 24] Too many open files: '/tmp/tmpiy5qcd'

OS-level file descriptions from tempfile.mkstemp are not closed. Hence, you hit 
the ulimit at around 1019 files.

The following patch will close the FDs:

418c412
<         testfd, test_file = tempfile.mkstemp()

---
>         unused, test_file = tempfile.mkstemp()
424c418
<         predfd, pred_file = tempfile.mkstemp()

---
>         unused, pred_file = tempfile.mkstemp()
439,441d432
<         ## need to close file handles 
(http://ubuntuforums.org/showthread.php?t=919340)
<         os.fdopen(testfd,'w').close()
<         os.fdopen(predfd,'w').close()

Original issue reported on code.google.com by [email protected] on 6 Aug 2010 at 7:20

Question marks break sentence detection?

Using the following text as input, the sentences ending in a question mark are 
not detected as sentences.

input = "This is the tale of Mr. Morton. Who is Mr. Morton? He is the subject 
of our tale, and the predicate tells what Mr. Morton must do. Here's a short 
sentence. Mister Morton is who?\nHere's another short sentence."

The resulting split lines are the following:

This is the tale of Mr. Morton.
Who is Mr. Morton? He is the subject of our tale, and the predicate tells what 
Mr. Morton must do.
Here's a short sentence.
Mister Morton is who? Here's another short sentence.

I would expect the sentences to split after both of the question marks.

This problem occurs with Splitta versions 1.03 and svn r21, under Linux and OS 
X 10.8.4, with Python 2.7.2.

Any help with this problem would be enormously appreciated, as we are 
attempting to use Splitta as a crucial component in an NLP pipeline for a 
summer camp at JHU that is underway: 
http://hltcoe.jhu.edu/research/scale-workshops/

Thank you!

Original issue reported on code.google.com by [email protected] on 11 Jun 2013 at 2:16

insert spaces around ':' in a time such as "3:43"

What steps will reproduce the problem?
1. run splitta with tokenization on text with "3:43" for example


What is the expected output? What do you see instead?
I would expect 3:43 as single token, since it is a time

What version of the product are you using? On what operating system?
N/A

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 8 Feb 2012 at 12:12

Training file format

Hi!

I'm trying to use your system for Portuguese and other latin languages.

I've read other issues asking about the training file format but I could not 
find an answer :(

Can you post/send me a snippet of the training file?

Thank you,
Mario

Original issue reported on code.google.com by [email protected] on 22 Jun 2011 at 5:21

Format of training data?


I want to train up the system on a new batch of data.

You allude to the format (noting MXTERMINATOR, which is not well documented).

Could you actually provide a script to convert PTB2 and Brown into the correct 
format?

Or, at the very least, provide a small snippet showing example training data?

Original issue reported on code.google.com by [email protected] on 12 Aug 2010 at 6:58

__str__() method of Doc goes into infinite loop

What steps will reproduce the problem?
1. try doing "print doc" on any Doc object in the code
2.
3.

What is the expected output? What do you see instead?

Expect it to print the doc.  Instead it goes into an infinite loop.


What version of the product are you using? On what operating system?

Splitta 1.03 running under linux.

Please provide any additional information below.

A possible fix:

    def __str__(self):
        s = []
        curr = self.frag
        while curr:
          s.append(str(curr))
          curr = curr.next
        return '\n'.join(s)


Original issue reported on code.google.com by [email protected] on 25 Jul 2011 at 4:47

Untar'ing packing should create a subdirectory.

When you untar the package, it should put all files in subdir:
  splitta-1.03/

I keep forgetting that splitta doesn't obey this convention, and that it will 
unater a bunch of individual files into the current dir.

Original issue reported on code.google.com by [email protected] on 10 Aug 2010 at 5:39

get_data shouldn't assume filenames

sbd.get_data shouldn't assume that you are passing in filenames.
Sometimes you already have text, turn it into an open file stream using
StringIO.StringIO, and pass that in.
See my code here:
   http://github.com/turian/common/blob/master/tokenizer.py

A suggested workaround is to add an optional parameter files_already_opened.

The following code implements this parameter:

93c93
< def get_data(files, expect_labels=True, tokenize=False, verbose=False,
files_already_opened=False):

---
> def get_data(files, expect_labels=True, tokenize=False, verbose=False):
108,111c108
<         if files_already_opened:
<             fh = file
<         else:
<             fh = open(file)

---
>         fh = open(file)
151,154c148
<         if files_already_opened:
<             pass
<         else:
<             fh.close()

---
>         fh.close()



I urge you to include the above patch, because then my tokenize.py code
(github link above) will work on splitta without asking you to patch the code.

A slightly cleaner implementation is to make get_data assume it is passed
file handles, and make a wrapper function get_data_from_filenames that will
open each file before calling get_data.

Original issue reported on code.google.com by [email protected] on 12 Apr 2010 at 7:19

Fix zcat on Mac

Hi, on Mac "zcat" does the wrong thing (it looks for a file "feats.Z").  
Instead it needs to be "gzcat".  The following patch fixes it.


Index: sbd_util.py
===================================================================
--- sbd_util.py (revision 20)
+++ sbd_util.py (working copy)
@@ -5,9 +5,11 @@
     cPickle.dump(data, o)
     o.close()

+ZCAT = 'gzcat' if 'Darwin' in os.popen("uname -a").read().split() else zcat
+
 def load_pickle(path):
     #i = gzip.open(path, 'rb')
-    i = os.popen('zcat ' + path)
+    i = os.popen(ZCAT + ' ' + path)
     data = cPickle.load(i)
     i.close()
     return data

Original issue reported on code.google.com by [email protected] on 3 Mar 2011 at 7:59

Loading model data can be vastly sped up

Instead of using the very slow python .gz implementation try

def load_pickle(path):
    #i = gzip.open(path, 'rb')
    i = os.popen("zcat " + path)
    data = cPickle.load(i)
    i.close()
    return data

Thanks for coding this though, the results are way better than previously 
existing system. I'm using Splitta to code a couple of projects aimed at 
writers and the accuracy has made a lot features things feasible.

Original issue reported on code.google.com by [email protected] on 26 Feb 2010 at 6:46

Splitta 1.03 for Python 3

Attached is splitta 1.03, run through Python 2to3, and with next changed to 
__next__ everywhere necessary. It seems to work at least for model_nb so I 
thought I'd share.

Many thanks for the original.

Original issue reported on code.google.com by [email protected] on 29 Apr 2013 at 1:48

Attachments:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.