grantjenks / python-wordsegment Goto Github PK

English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Home Page: http://www.grantjenks.com/docs/wordsegment/

License: Other

Python 100.00%

python-wordsegment's Introduction

Python Word Segmentation

WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Based on code from the chapter "Natural Language Corpus Data" by Peter Norvig from the book "Beautiful Data" (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Features

Pure-Python
Fully documented
100% Test Coverage
Includes unigram and bigram data
Command line interface for batch processing
Easy to hack (e.g. different scoring, new data, different language)
Developed on Python 2.7
Tested on CPython 2.6, 2.7, 3.2, 3.3, 3.4, 3.5, 3.6 and PyPy, PyPy3
Tested on Windows, Mac OS X, and Linux
Tested using Travis CI and AppVeyor CI

https://ci.appveyor.com/api/projects/status/github/grantjenks/python-wordsegment?branch=master&svg=true

Quickstart

Installing WordSegment is simple with pip:

$ pip install wordsegment

You can access documentation in the interpreter with Python's built-in help function:

>>> import wordsegment
>>> help(wordsegment)

Tutorial

In your own Python programs, you'll mostly want to use segment to divide a phrase into a list of its parts:

>>> from wordsegment import load, segment
>>> load()
>>> segment('thisisatest')
['this', 'is', 'a', 'test']

The load function reads and parses the unigrams and bigrams data from disk. Loading the data only needs to be done once.

WordSegment also provides a command-line interface for batch processing. This interface accepts two arguments: in-file and out-file. Lines from in-file are iteratively segmented, joined by a space, and written to out-file. Input and output default to stdin and stdout respectively.

$ echo thisisatest | python -m wordsegment
this is a test

If you want to run WordSegment as a kind of server process then use Python's -u option for unbuffered output. You can also set PYTHONUNBUFFERED=1 in the environment.

>>> import subprocess as sp
>>> wordsegment = sp.Popen(
        ['python', '-um', 'wordsegment'],
        stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.STDOUT)
>>> wordsegment.stdin.write('thisisatest\n')
>>> wordsegment.stdout.readline()
'this is a test\n'
>>> wordsegment.stdin.write('workswithotherlanguages\n')
>>> wordsegment.stdout.readline()
'works with other languages\n'
>>> wordsegment.stdin.close()
>>> wordsegment.wait()  # Process exit code.
0

The maximum segmented word length is 24 characters. Neither the unigram nor bigram data contain words exceeding that length. The corpus also excludes punctuation and all letters have been lowercased. Before segmenting text, clean is called to transform the input to a canonical form:

>>> from wordsegment import clean
>>> clean('She said, "Python rocks!"')
'shesaidpythonrocks'
>>> segment('She said, "Python rocks!"')
['she', 'said', 'python', 'rocks']

Sometimes its interesting to explore the unigram and bigram counts themselves. These are stored in Python dictionaries mapping word to count.

>>> import wordsegment as ws
>>> ws.load()
>>> ws.UNIGRAMS['the']
23135851162.0
>>> ws.UNIGRAMS['gray']
21424658.0
>>> ws.UNIGRAMS['grey']
18276942.0

Above we see that the spelling gray is more common than the spelling grey.

Bigrams are joined by a space:

>>> import heapq
>>> from pprint import pprint
>>> from operator import itemgetter
>>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
[('of the', 2766332391.0),
 ('in the', 1628795324.0),
 ('to the', 1139248999.0),
 ('on the', 800328815.0),
 ('for the', 692874802.0),
 ('and the', 629726893.0),
 ('to be', 505148997.0),
 ('is a', 476718990.0),
 ('with the', 461331348.0),
 ('from the', 428303219.0)]

Some bigrams begin with <s>. This is to indicate the start of a bigram:

>>> ws.BIGRAMS['<s> where']
15419048.0
>>> ws.BIGRAMS['<s> what']
11779290.0

The unigrams and bigrams data is stored in the wordsegment directory in the unigrams.txt and bigrams.txt files respectively.

User Guide

References

WordSegment License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

python-wordsegment's People

Contributors

Stargazers

Watchers

python-wordsegment's Issues

Corpus python

Text with numbers doesn't segment as expected

Raising an issue that I faced while using this package.

Code for Reproducing the issue:

import wordsegment as ws    
ws.load()     
text = "increased $55 million or 23.8% for"   
ws.segment(text)

Actual Output:

['increased', '55millionor238', 'for']

Expected Output:

// If special symbols are permitted in the final output
['increased', '$55', 'million', 'or', '23.8%', 'for']

// If special symbols such as $ and % are not permitted in the final output
['increased', 'dollar', '55', 'million', 'or', '23.8', 'percent', 'for']

Tested on Python versions:

wordsegment version:

1.3.1

StackOverflow Question Link:

https://stackoverflow.com/q/53549446/202375

Prior probability calculation question

I am interested in the line

unigram_counts[word] / 1024908267229.0

where the number 1024908267229 comes from?

feature_request(mode): preserve all punctuation marks

1. Summary

It would be nice, if WordSegment at least at CLI mode will have the option to preserve all punctuation marks: ., ,, ’ and so on.

2. Problem

Try copy and paste text from these article and book.

The article:

Sharing-economyfirmsdifferfromold-powerfirmsbecausetheformertypicallyareexponentialnew-powerorganisationscharac-terisedbyPorter’scompetitiveforces.Althoughsomenew-powerfirmsmaychoosenottoembraceastakeholderfocus,stakehold-ersandothernew-powerfirmswillpunishsuchchoices.Inotherwords,counterargumentstothesharingeconomy’sstakeholderpo-tentialbasedonthequestionableactionsofsomenew-powerfirmsareovershadowedbyothernew-powerfirmsandtheirstakehold-ers’actions.
The book:

Accordingto DavidAllen,authorof the bestsellerGettingThingsDone(2001),informationprofessionalshavea hardtimeaccomplishingtasksbecauseour workis inherentlyambiguous,we takeon too manycommit-ments,andwe cannotprioritizethe bestthingto do fromthe manychoicesbeforeus. J. WesleyCochran(1992),JudithSiess(2002),SamanthaHines(2010),andotherauthorsof timemanagementtreatisesfor librar-iansconcurthatlibrarieshavebeendifficultplacesto workfor years,especiallygivenour complexworkprocessesandoftenintangibleprod-ucts.Nevertheless,we havethe abilityas individualsto adoptbetterstrategiesto managethe everydaychaos.

Yes, ideally, of course, it would be nice normally add a text layer to the PDF, but I’m not making these articles and books. From my experience, I can say that a text layer without spaces like this is a common problem. The routine work of separating words can be time-consuming.

3. Behavior

3.1. Current

CLI usage:

sharing economy firms differ from old power firms because the former typically are exponential new power organisations characterised by porters competitive forces although some new power firms may choose not to embrace a stakeholder focus stakeholders and other new power firms will punish such choices in other words counterarguments to the sharing economy s stakeholder potential based on the questionable actions of some new power firms are overshadowed by other new power firms and their stakeholders actions

according to david allen author of the best seller getting things done 2001 information professionals have a hard time accomplishing tasks because our work is inherently ambiguous we take on too many commitments and we can not prioritize the best thing to do from the many choices before us j wesley cochran 1992judithsiess2002 samantha hines2010 and other authors of time management treatises for librarians concur that libraries have been difficult places to work for years especially given our complex work processes and often intangible products nevertheless we have the ability as individuals to adopt better strategies to manage the everyday chaos

Punctuation marks are stripped. Users have to do a lot of routine work to get them back.

3.2. Expected behavior

Ordinary English texts:

Sharing economy firms differ from old power firms because the former typically are exponential new power organisations characterised by Porter’s competitive forces. Although some new power firms may choose not to embrace a stakeholder focus, stakeholders and other new power firms will punish such choices. In other words, counterarguments to the sharing economy’s stakeholder potential based on the questionable actions of some new power firms are overshadowed by other new power firms and their stakeholders’ actions.

According to David Allen, author of the bestseller Getting Things Done(2001), information professionals have a hard time accomplishing tasks because our work is inherently ambiguous, we take on too many commitments, and we can not prioritize the best thing to do from the many choices before us. J. Wesley Cochran(1992), Judith Siess(2002), Samantha Hines(2010) and other authors of time management treatises for librarians concur that libraries have been difficult places to work for years, especially given our complex work processes and often intangible products. Nevertheless, we have the ability as individuals to adopt better strategies to manage the everyday chaos.

Thanks.

`exhilarate` does not segment as expected

echo exhilarate | python -m wordsegment
output:

exhi la rate

Buffering issue in main()

I wanted to use wordsegment from php as a by-line processor so python -m wordsegment gonna be all I need. But it is completely not usable because of this piece in main():

    for line in streams.infile:
        streams.outfile.write(' '.join(segment(line)))
        streams.outfile.write(os.linesep)

stdin.infile's buffering comes into play and I basicly don't get any response after feeding a line into it:

$ sudo strace -p $(pidof python)
Process 12343 attached
read(0, "thisisatest\n", 8192)          = 12
read(0,

See it gets the line and waits for more input because stdin.infile's buffer has not been saturated yet. Btw the buffering can't be disabled with python's -u option.

Please consider remaking the main() code so by-line processing works out of the box for python -m wordsegment 👍

'helloworld' does not segment as expected

It seems that the string 'helloworld' can not be segmented,while the string 'helloeveryone' can be segmented correctly.

How to add custom values?

Hello GrantJenks,
this project is very very interesting and fantastic job. I have gone through your Using a Different Corpus references and added value according to the steps defined, but word segmentation is not returning the value as expected, for example I would like to see "lineno" output as "line no". could you please guide me on this challenge.
Once again. it is a real fantastic work. to my knowledge, no other similar works are closer to your accomplishments.

russian language

Hi!

What I need to do to train model on russian language?

Words segmenting in one direction, but not another.

Consider:

segment('treeparticle')
segment('particletree')

The first one segments correctly. It outputs: ['tree', 'particle'] ; but the second one never segments! It returns ['particletree']. Even when I put a space in between them. It always outputs ['particletree'].

I don't get it. Why? And how do I fix this?

Recursion limit exceeded

Sometimes it pops up Exception saying Recursion limit exceeded ? Is that intended or?

Can I use this from C or C++?

Hello, how could I use this from C or C++ on Linux?

ZeroDivisionError

using wordsegment-1.1.6-py2.py3-none-any.whl

import wordsegment as ws
seg=ws.Segmenter()
seg.score('industry','oil')
Traceback (most recent call last):
File "<pyshell#21>", line 1, in
seg.score('industry','oil')
File "C:\Python3\lib\site-packages\wordsegment_init_.py", line 108, in score
return self.score(word)
File "C:\Python3\lib\site-packages\wordsegment_init_.py", line 93, in score
return 10.0 / (total * 10 ** len(word))
ZeroDivisionError: float division by zero

Using with Additional corpus of spelling mistakes.

I’m pondering on using this as a service to an app for disabled people who we support who would use this to communicate. We see a lot of users who do this tapping on letters but often never use a space. But. We have a Snag in they do make errors. (See https://youtu.be/SDkE-aO3tOQ?si=0GAUyTKDh-q_sAxm and a quick app for iOS we made https://github.com/AceCentre/DragToSpeak and now contemplating using a rest api largely using word segment. )

So I was wondering about adding to the standard corpus with something like https://www.dcs.bbk.ac.uk/~ROGER/corpora.html

I read this https://stackoverflow.com/a/32364566/1123094

it looks like I can create a file of Bigrams or unigrans and weights and add to the standard corpus. Right? Or is there a better way.

Training on new data

Hi,
It would be great if you can add the feature to use your algorithm to train on any new data. The google trillion data is a bit old and if we need anything domain specific, it may not be the best dataset to train.

Support for Other Languages

The LDC has the Web 1T 5-gram 10 European Languages published at https://catalog.ldc.upenn.edu/LDC2009T25

Is there any plan to support these languages? If not, can I jump in and contribute? Would it be enough to parse the above data and get the unigram/bigram counts?

unigrams

hello
thank you for your great effort
i want to ask you
what is the solution when i get an error when call
unigrams[unknown word]
i want to return 0 not get an error how i can fix that?

Only Old word

Training on new, modern data.

Hi,
I've see #2 asking to train on new data. But this was back in 2015. Is it possible to train the algo on more, new, modern datasets?

I don't have any particular corpus/dataset in mind.

But for now I do encounter things e.g. "bitcoin" classified as bit, coin instead of bitcoin (obviously both are true) and "instagram" as insta, gram.

import error

when i use 'from wordsegment import clean' command, i get the information"ImportError: cannot import name clean"

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1286: ordinal not in range(128)

import wordsegment as WS
File "/usr/local/lib/python3.4/site-packages/wordsegment.py", line 49, in <module>
    bigram_counts = parse_file(join(basepath, 'bigrams.txt'))
  File "/usr/local/lib/python3.4/site-packages/wordsegment.py", line 45, in parse_file
    return dict((word, float(number)) for word, number in lines)
  File "/usr/local/lib/python3.4/site-packages/wordsegment.py", line 45, in <genexpr>
    return dict((word, float(number)) for word, number in lines)
  File "/usr/local/lib/python3.4/site-packages/wordsegment.py", line 44, in <genexpr>
    lines = (line.split('\t') for line in fptr)
  File "/usr/local/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1286: ordinal not in range(128)

RecursionError on segment call

Hi,

I'm having trouble with following code:

import wordsegment

wordsegment.load()
text = "The article went on to say, “For in the pizza shops rich and poor harmoniously congregate; they are the only places where the members of Neapolitan aristocracy—far haughtier than those of any other part of Italy—may be seen (eating) their favorite delicacy side by side with their own coachmen and valets and barbers.”"
wordsegment.segment(text)

It fails with a RecursionError.
RecursionError: maximum recursion depth exceeded while calling a Python object

I'm using python 3.8 on ubuntu 20.04.

Bigram doesn't work.

I have the following string.

I'm from New York.

I used the following wordsegment python package.

import wordsegment
wordsegment.segment("I'm from New York.")

However, I got the following response where New and York aren't together.

['im', 'from', 'new', 'york']

I can see that New York is in wordsegment bigrams corpus. But I'm not sure why it is not giving me New York together.

Thanks.

Return a list of the most probable segmentations.

It would be great if wordsegment returned that.
For instance:

> ws.rank("nobodyelse")
[ ["nobody", "else"], ["no", "body", "else"], ...]

or

> ws.probabilities("nobodyelse")
[
[ ["nobody", "else"], ["no", "body", "else"], ...],
[ 0.727362, 0.0012372, ...]
]

Please allow separation of numbers from text

"a frail 88-year old man" is being outputed as ["a","frail88","year","old"]

This doesn't help at all. Having numbers in a block of text is so common in any domain. It's sad that one can't use this wonderful program for this issue!

max() arg is an empty sequence

I am trying to run this:
print (segment('thisisatest'))

But I'm getting this exception:

\wordsegment_init_.py in segment(self, text)
157 def segment(self, text):
158 "Return list of words that is the best segmenation of text."
--> 159 return list(self.isegment(text))
160
161

\wordsegment_init_.py in isegment(self, text)
143 for offset in range(0, len(clean_text), size):
144 chunk = clean_text[offset:(offset + size)]
--> 145 _, chunk_words = search(prefix + chunk)
146 prefix = ''.join(chunk_words[-5:])
147 del chunk_words[-5:]

\wordsegment_init_.py in search(text, previous)
130 yield (prefix_score + suffix_score, [prefix] + suffix_words)
131
--> 132 return max(candidates())
133
134 # Avoid recursion limit issues by dividing text into chunks, segmenting

ValueError: max() arg is an empty sequence

This is for Python 3.6

License question

Even though it is licensed with Apache 2, I believe it cannot be used in any commercial application as the license of the data only allows it to be used for educational/research purposes? Is that correct?

~~Maybe, in this case, it is possible to decouple the data, and allow open source unigrams/bigrams to be gathered from somewhere?~~
Noticed the "Using a Different Corpus" link