nltk / nltk_book Goto Github PK
View Code? Open in Web Editor NEWNLTK Book
Home Page: http://www.nltk.org/book
NLTK Book
Home Page: http://www.nltk.org/book
There are paragraphs in Chapter 8, Left-Corner Parser section in our file but missing from the book. I'd like to review whether they could go back in.
Table 8-3 has a caption in the book, but lacked one (.. table:: tab-subcat) in our rst file.
The testing framework (nltk/test/runtests.py) should support pylistings and doctest-ignore.
puzzle_letters = nltk.FreqDist('egivrvonl')
obligatory = 'r'
wordlist = nltk.corpus.words.words()
[w for w in wordlist if len(w) >= 6
...and obligatory in w # hi,obviously,this doesn't work,
...and nltk.FreqDist(w) <= puzzle_letters]
Hello,
Thank you very much for NLTK and the NLTK book. I’m a literary translator, and you make the Python basics really easy to understand.
In chapter 3, section 3.1, you say we should access Gutenberg with this url:
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
This is (no longer?) possible. Cf. their Terms of use (http://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use): "This website is intended for human users only. Any perceived use of automated tools to access this website will result in a temporary or permanent block of your IP address. / If you need to download a great number of books, download them from one of our mirrors, not from the main site."
Using their main address indeed failed repeatedly for me (only when using Python or the NLTK scripts), whereas using this (European) address :
ftp://sunsite.informatik.rwth-aachen.de/pub/mirror/ibiblio/gutenberg/2/5/5/2554/2554-8.txt
worked for me.
I hope this may help other users.
J. Bégaud
.. pylisting:: code-pcfg1
:caption: Defining a Probabilistic Context Free Grammar (PCFG)
grammar = nltk.parse_pcfg("""
S -> NP VP [1.0]
VP -> TV NP [0.4]
VP -> IV [0.3]
VP -> DatV NP NP [0.3]
TV -> 'saw' [1.0]
IV -> 'ate' [1.0]
DatV -> 'gave' [1.0]
NP -> 'telescopes' [0.8]
NP -> 'Jack' [0.2]
""")
>>> print(grammar)
Grammar with 9 productions (start state = S)
S -> NP VP [1.0]
VP -> TV NP [0.4]
VP -> IV [0.3]
VP -> DatV NP NP [0.3]
TV -> 'saw' [1.0]
IV -> 'ate' [1.0]
DatV -> 'gave' [1.0]
NP -> 'telescopes' [0.8]
NP -> 'Jack' [0.2]
Failed example:
print(grammar)
Differences (unified diff with -expected +actual):
@@ -1,10 +1,10 @@
Grammar with 9 productions (start state = S)
- S -> NP VP [1.0]
+ S -> NP VP [1]
VP -> TV NP [0.4]
VP -> IV [0.3]
VP -> DatV NP NP [0.3]
- TV -> 'saw' [1.0]
- IV -> 'ate' [1.0]
- DatV -> 'gave' [1.0]
+ TV -> 'saw' [1]
+ IV -> 'ate' [1]
+ DatV -> 'gave' [1]
NP -> 'telescopes' [0.8]
NP -> 'Jack' [0.2]
Page number of error:
Cover, 3, 6, 18, 31, 46, 49, 50, 52, 60, 61, 62, 69, 86, 91, 94, 98, 99,
Location on the page:
Various locations
Detailed description of error:
In the PDF version there are major graphical artifacts on the cover pages (page 1 and 2 of the PDF). This also occurs in Figures 1-1, 1-2, 1-4, 1-5, 2-1, 2-2, 2-3, 2-4, 2-5, 2-6, 2-7, 2-8, 3-1, 3-2, 3-3, 3-4, 3-5 and many more. Note that these don't appear in the ePub file, so it seems to be something in how the PDF was processed/created. In some cases, it makes the figures hard to read properly.
In (27)a (p342 of book) 'Rue Pascal' -> 'rue Pascal'
The diagrams in the HTML version fail to honour the typographical convention of using upper case for feature labels.
Comment for: June 2009 first edition, section 5.5
First off, I liked the approach in your discussion of the limits of what you call n-gram tagging. It's interesting to see how many times tag[n-2], tag[n-1], word[n] lead to different tag[n]. This still won't give you a lower bound on held-out error, as there's no reason to believe there won't be sampling error in the original corpus -- thus held-out performance will likely be worse than estimated by your method.
At first I thought you were doing something like an HMM, where you actually use the following tags as context implicitly when you condition tag[n+1] on tag[n] and so on. So this isn't an upper bound on models that use n-gram statistics -- you can do better with an HMM because of the overlaps in the prediction contexts and outcomes.
Second, the storage isn't as bad as you make out for the key reason that it's limited by the size of training corpora. If you use Brown, there's 1M tokens, so you need to store at most 1M n-grams. So I don't think we'll be faced with the "hundreds of millions of entries" you suggest on page 208. (Of course, you could get there if you use some kind of semi-supervised learning.)
Thus I don't understand why you say "it is simply impractical for n-gram models to be conditioned on the identities of the words in context". This will again be limited by the training set size. You actually see all kinds of conditioning on the words these days in conditional random fields (CRFs). You also see conditioning on words in an HMM-like model in the old BBN approach to named entities in Nymble (which I adopted for the first tagger I implemented in LingPipe 1). You see it in Collins' parser for PCFGs.
I don't know if you've seen the Ratinov and Roth paper on
features for tagging, but it's very interesting. They actually
found a MEMM-type approach with very very little backtracking
could perform very well with unbounded context-length predictive features.
Ratinov, Lev and Dan Roth. 2009. Design Challenges and Misconceptions in Named Entity Recognition. In CoNLL.
http://www.aclweb.org/anthology/W/W09/W09-1119.pdf
Migrated from http://code.google.com/p/nltk/issues/detail?id=642 (and then from nltk/nltk#142)
Add an example showing generation of cloze tests.
Cf http://tjtest.tiddlyspace.com
Location:
At the end of the subsection "Plotting and Tabulating Distributions", in the Note "Your turn"
Old text:
conditions = ['Monday', ...]
Proposed new text:
samples = ['Monday', ...]
Thank you for putting the full text online!!!
The following has disappeared from the book version:
X-bar Syntax: [Chomsky1970RN]_, [Jackendoff1977XS]_
(The primes we use replace Chomsky's typographically more demanding horizontal bars.)
Rework this into the text?
In HTML output, add chapter number to section number, and remove subsection numbers. E.g. this:
C. Chapter
Should be:
C. Chapter
C.1. Section
Subsection
Atlas Documentation:
http://chimera.labs.oreilly.com/books/1230000000065/index.html
My guess is that if we take the O'Reilly Atlas route, we would have to author in asciidoc (http://www.methods.co.nz/asciidoc/), which is yet another wiki syntax. I don’t see anything about ‘importing’ files in other formats (i.e., ReST) in the documentation. However, http://johnmacfarlane.net/pandoc/ looks like it might convert from rst to asciidoc.
We need to check how far asciidoc contains functionality that we already use of the book. Asciidoc converts to multiple formats, including Docbook, PDF, epub, and HTML.
In the online copy of the NLTK book, Chapter 5.1, a note section says
Note
NLTK provides documentation for each tag, which can be queried using the tag, e.g. nltk.help.upenn_tagset('RB'), or a regular expression, e.g. nltk.help.upenn_brown_tagset('NN.*'). Some corpora have README files with tagset documentation, see nltk.corpus.???.readme(), substituting in the name of the corpus.
I think nltk.help.upenn_brown_tagset('NN.*')
should be just nltk.help.brown_tagset('NN.*)
, because the later works and former doesn't.
Please consult |NLTK-URL| for further materials on this chapter, including HOWTOs,
feature structures, feature grammars, Earley parsing and grammar test suites.
Hi all,
I cannot build your book.
$ make pdf
../rst.py --ref ch00.rst
Traceback (most recent call last):
File "../rst.py", line 2735, in
main()
File "../rst.py", line 2651, in main
if docutils.writers.html4css1.Image is None:
AttributeError: 'module' object has no attribute 'Image'
make: *** [ch00.ref] Error 1
Thank you!
python 2.7.3
python-nltk 2.0.1rc4
python-docutils 0.9
epydoc 3.0.1
tk 8.5.11
(probably not worth opening 3 separate tickets...)
mat
, not branch
corpora/unicode_samples
instead of 'samples' in both code examplesfrom xml.etree
instead of from nltk.etree
Should the code snippet in page 88 be
for line in b:
print b
or
for line in b:
print line
Incorporate the proposed new functionality of nltk.Text, cf nltk/nltk#546
word_tokenize()
is intended to work with single sentences. However there are examples in the book that apply it to multiple sentences. We need to redefine this function, or create a new one, or modify the examples. I favour the first option.
In nltk_book/definitions.rst
, line 194 is
.. |NLTK-HOWTO-URL| replace:: ``http://www.nltk.org/howto``����
This URL gives a 404 error at the moment. This may be an issue with nltk.org rather than the book; depends which you prefer to fix.
Discuss artistic text generation, cf
http://charlesmartinreid.com/wiki/Apollo11Junk
https://github.com/leonardr/olipy
Both Figure 4.5 and the downloaded script at http://www.nltk.org/book/pylisting/code_epytext.py are "corrupted."
Only the the top two lines of the blue box,
>>> accuracy(['ADJ', 'N', 'V', 'N'], ['N', 'N', 'V', 'ADJ'])
0.5
belong there.
Everything below that belongs in both the green box and in the downloaded script as the second half of accuracy()
.
Also, it should be zip
instead of izip
in the for
loop.
This would be a good resource to cite in chapter 8.
http://www.morganclaypool.com/doi/abs/10.2200/S00493ED1V01Y201303HLT020
Update code examples in the book, and correct any egregious errors (including link-rot).
Tag the repository before and after these changes, then communicate the diffs to O'Reilly.
Send them diffs for one chapter, to make sure they are able to deal with this.
Be careful to avoid touching linebreaks where possible, to avoid spurious diffs.
Introduce Counter, and consider using it instead of FreqDist and defaultdict(int) in various places. Perhaps the presentation of defaultdict can be postponed.
Commenting on
>>> for c in line:
... if ord(c) > 127:
... print('%r U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c)))
Trux writes:
54. "If you replace the %r (which yields the repr() value) by %s in the format string of the code sample above, and if your system supports UTF-8, you should see an output like the following:"
On my system, I get the desired output after replacing the %r by %s AND c.encode('utf8') by c.
Actually, I don't think %r vs %s makes any difference. Not encoding c seems to work in IDLE, but not in my terminal.
E.g., the reference ex-subcat1_
in the text below
The grammar in code-cfg2_ correctly generates examples like
ex-subcat1_.
.. _ex-subcat1:
.. ex::
.. ex:: The squirrel was frightened.
.. ex:: Chatterer saw the bear.
.. ex:: Chatterer thought Buster was angry.
.. ex:: Joe put the fish on the log.
gets realised as (15d) rather than (15).
The grammar in 3.3 correctly generates examples like (15d).
Migrated from nltk/nltk#85:
I am reading the online version of the book,
http://nltk.googlecode.com/svn/trunk/doc/book/ch04.html
In chapter 4, example 4.5. it looks as if the colours got messed up.
Half of the docstring is green and the rest displays in blue.
Kind regards
Migrated from http://code.google.com/p/nltk/issues/detail?id=450
StevenBird1 said, at 2010-03-12T10:42:05.000Z:
(Thanks for the report, and sorry for the delay.) The problem must be caused by the inclusion of a doctest block inside the function docstring. We need to fix this, but I'm not sure how soon its going to happen, sorry.
Trux makes a couple of correct observations about references to Earley parsers:
240. "Feature based grammars are parsed in NLTK using an Earley
chart parser (see Section 9.5 for more information about this)."
"Section 9.5 does not provide information on Earley chart parsers."
254. (exercise 7)
"Develop a wrapper for the earley_parser so that a trace is only printed if the input sequence fails to parse."
"Note that I could not find a single instance of 'earley_parser' in the whole NLTK source code."
I realize the book was published in 2009, but I would still like to use it to learn Python/NLP with my linguistics but non-computational background. I am fairly computer-literate, but all my attempts to install Python/NLTK have failed one way or the other. The farthest I have got was to successfully get Python 2.7 and the matching set-up tools onto the machine although the book refers to earlier versions (those didn't work at all), but the installation test import nltk never works, obviously because nltk is not in the registry. My computer's OS is Windows 7, it is 64-bit, and the processor is an Intel Core i5. Any help/suggestions will be greatly appreciated. Thanks.
http://www.nltk.org/book/ and http://www.nltk.org/book3/
Please report an [sic] errors on the issue tracker.
Simple typo or clever easter egg?
Location: Section 2.5, subsection "Sense and Synonyms"
Long story short, len(wn.synsets('car')) == 5
and len(wn.synsets('automobile')) == 2
; so why does the text claim that 'car' has five synsets but 'automobile' has only one?
I kind of regret typing up all this "explanatory" text, but it'd be a waste to delete it...
Unlike the words automobile and motorcar, which are unambiguous and have one synset, the word car is ambiguous, having five synsets:
>>> wn.synsets('car')
[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'),
Synset('cable_car.n.01')]
Alternative 1:
Unlike the word **motorcar**, which is unambiguous and has one synset, the word car is ambiguous, having five synsets:
Alternative 2:
Unlike the words automobile and motorcar, which are unambiguous and have one synset, the word car is ambiguous, having **four** synsets:
Using nltk_data downloaded on 2014-03-04, the corresponding calls for 'automobile' and 'motorcar' yield
>>> ss = wn.synsets('automobile') # The definition for the verb means to "travel in an automobile" !
[Synset('car.n.01'), Synset('automobile.v.01')]
>>> wn.synsets('motorcar')
[Synset('car.n.01')]
I'm not totally sure what it means for word
to "have N synsets"... If it means:
len(wn.synsets(word)) == N
, then Alternative 1 seems appropriatelen(wn.synsets(word))
only has N "codes" starting with word
, then Alternative 2 seems appropriateWhere by "code" I mean how only one of the two constructor arguments in [Synset('car.n.01'), Synset('automobile.v.01')]
starts with automobile
Bleh, I am getting confused reading this myself
... because Babelfish is gone :-\
The link at the bottom of the page to nltk-users list on the http://nltk.org/book3/ page is incorrect.
Right now it points to http://nltk.org/book3/groups.google.com/group/nltk-users
I guess it should be http://groups.google.com/group/nltk-users
Add a section on the Stanford Tagger interface to chapter 5.
"Translator's Guide" (http://code.google.com/p/nltk/wiki/TranslatorsGuide) is not accessible; is a deadlink.
The same issue with the "nltk-users mailing list" (www.nltk.org/book/groups.google.com/group/nltk-users) link.
running make ch08.html
breaks with the following error:
File "/Users/sb/git/nltk_book/tree2image.py", line 87, in tree_to_widget
raise ValueError('unbalanced parens?')
http://nltk.org/book/ch01.html#tab-inequalities
There no examples related to table 1.3 . They are commented in html.
sent7
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the',
'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
[w for w in sent7 if len(w) < 4]
[',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']
[w for w in sent7 if len(w) <= 4]
[',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']
[w for w in sent7 if len(w) == 4]
['will', 'join', 'Nov.']
[w for w in sent7 if len(w) != 4]
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board',
'as', 'a', 'nonexecutive', 'director', '29', '.']
There are some characters wrong in the preface of online book at URL http://nltk.org/book/ch00.html
The erratas are related to apostrophes. Iĺl break the lines in the places the words with the erratas.
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless
youÕre <==========
reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from
OÕReilly <==========
books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your
productÕs <==========
documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example:
ÒNatural <==========
Language Processing with Python, by Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 978-0-596-51649-9.Ó If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected].
http://nltk.org/book/ch00.html
For example:
"I.3" is http://nltk.org/book/ch00-pt.html#tab-course-plans instead of http://nltk.org/book/ch00.html#tab-course-plans
Check this:
Failed example:
nltk.data.show_cfg('grammars/book_grammars/feat0.fcfg')
Expected:
% start S
# ###################
# Grammar Productions
# ###################
# S expansion productions
S -> NP[NUM=?n] VP[NUM=?n]
# NP expansion productions
NP[NUM=?n] -> PropN[NUM=?n]
NP[NUM=?n] -> Det[NUM=?n] N[NUM=?n]
NP[NUM=pl] -> N[NUM=pl]
# VP expansion productions
VP[TENSE=?t, NUM=?n] -> IV[TENSE=?t, NUM=?n]
VP[TENSE=?t, NUM=?n] -> TV[TENSE=?t, NUM=?n] NP
# ###################
# Lexical Productions
# ###################
Det[NUM=sg] -> 'this' | 'every'
Det[NUM=pl] -> 'these' | 'all'
Det -> 'the' | 'some' | 'several'
PropN[NUM=sg]-> 'Kim' | 'Jody'
N[NUM=sg] -> 'dog' | 'girl' | 'car' | 'child'
N[NUM=pl] -> 'dogs' | 'girls' | 'cars' | 'children'
IV[TENSE=pres, NUM=sg] -> 'disappears' | 'walks'
TV[TENSE=pres, NUM=sg] -> 'sees' | 'likes'
IV[TENSE=pres, NUM=pl] -> 'disappear' | 'walk'
TV[TENSE=pres, NUM=pl] -> 'see' | 'like'
IV[TENSE=past] -> 'disappeared' | 'walked'
TV[TENSE=past] -> 'saw' | 'liked'
Got:
% start S
# ###################
# Grammar Productions
# ###################
# S expansion productions
S -> NP[NUM=?n] VP[NUM=?n]
# NP expansion productions
NP[NUM=?n] -> N[NUM=?n]
NP[NUM=?n] -> PropN[NUM=?n]
NP[NUM=?n] -> Det[NUM=?n] N[NUM=?n]
NP[NUM=pl] -> N[NUM=pl]
# VP expansion productions
VP[TENSE=?t, NUM=?n] -> IV[TENSE=?t, NUM=?n]
VP[TENSE=?t, NUM=?n] -> TV[TENSE=?t, NUM=?n] NP
# ###################
# Lexical Productions
# ###################
Det[NUM=sg] -> 'this' | 'every'
Det[NUM=pl] -> 'these' | 'all'
Det -> 'the' | 'some' | 'several'
PropN[NUM=sg]-> 'Kim' | 'Jody'
N[NUM=sg] -> 'dog' | 'girl' | 'car' | 'child'
N[NUM=pl] -> 'dogs' | 'girls' | 'cars' | 'children'
IV[TENSE=pres, NUM=sg] -> 'disappears' | 'walks'
TV[TENSE=pres, NUM=sg] -> 'sees' | 'likes'
IV[TENSE=pres, NUM=pl] -> 'disappear' | 'walk'
TV[TENSE=pres, NUM=pl] -> 'see' | 'like'
IV[TENSE=past] -> 'disappeared' | 'walked'
TV[TENSE=past] -> 'saw' | 'liked'
The following comment is well-taken
XXX The second half of the next paragraph is heavy going, for
a relatively simple idea; it would be easier to follow if
there was a diagram to demonstrate the contrast, giving
a pair of structures that are minimally different, e.g.
"put the chair on the stage" vs "saw the chair on the stage".
After this, prose could formalize the concepts.
However, the book has just dropped the 'next paragraph' and this makes much of the section incoherent; i.e.. the distinction between the trees (36) and (37) has no motivation.
I have used the Kindle version of the ebook, and an irritating bug in the book is that there are a lot of references like "see 1)", but the actual list points are "numbered" using letters like "a)", "b)", ...
Sorry if this has been reported - found no good way of searching the previous bug reports.
Introduce these earlier in the book, and update the discussion for any methods that now return iterators.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.