html5lib / html5lib-python Goto Github PK

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

License: MIT License

Python 68.63% Shell 0.02% HTML 31.35%

html5lib-python's Introduction

html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

Usage

Simple usage follows this pattern:

import html5lib
with open("mydocument.html", "rb") as f:
    document = html5lib.parse(f)

or:

import html5lib
document = html5lib.parse("<p>Hello World!")

By default, the document will be an xml.etree element instance. Whenever possible, html5lib chooses the accelerated ElementTree implementation (i.e. xml.etree.cElementTree on Python 2.x).

Two other tree types are supported: xml.dom.minidom and lxml.etree. To use an alternative format, specify the name of a treebuilder:

import html5lib
with open("mydocument.html", "rb") as f:
    lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When using with urllib2 (Python 2), the charset from HTTP should be pass into html5lib as follows:

from contextlib import closing
from urllib2 import urlopen
import html5lib

with closing(urlopen("http://example.com/")) as f:
    document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))

When using with urllib.request (Python 3), the charset from HTTP should be pass into html5lib as follows:

from urllib.request import urlopen
import html5lib

with urlopen("http://example.com/") as f:
    document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use:

import html5lib
with open("mydocument.html", "rb") as f:
    parser = html5lib.HTMLParser(strict=True)
    document = parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilder class as the tree keyword argument to use an alternative document format:

import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Hello World!")

More documentation is available at https://html5lib.readthedocs.io/.

Installation

html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:

$ pip install html5lib

The goal is to support a (non-strict) superset of the versions that pip supports.

Optional Dependencies

The following third-party libraries may be used for additional functionality:

lxml is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);
genshi has a treewalker (but not builder); and
chardet can be used as a fallback when character encoding cannot be determined.

Bugs

Please report any bugs on the issue tracker.

Tests

Unit tests require the pytest and mock libraries and can be run using the pytest command in the root directory.

Test data are contained in a separate html5lib-tests repository and included as a submodule, thus for git checkouts they must be initialized:

$ git submodule init
$ git submodule update

If you have all compatible Python implementations available on your system, you can run tests on all of them using the tox utility, which can be found on PyPI.

Questions?

Check out the docs. Still need help? Go to our GitHub Discussions.

You can also browse the archives of the html5lib-discuss mailing list.

html5lib-python's People

Contributors

Stargazers

Watchers

Forkers

gsnedders mfa valievkarim ambv hsinha lroggendorff marcdm tonylopes leixie tahajahangir mrmichalis aleray lilbludevil mindw lebenhl bertothunder wcmckee darobin simonsapin web5design huhekai j2project sigmundch al000 coinpayee fnielsen blag igloox jimbaker genba agafonovdmitry kingjo002 lastorset cratejoy jayuloy kawasaki2013 ayukisekiguchi rossrichardson yurikhan prashantjha nonameentername kleopatra999 captbaritone bendavis78 megacoderkim rhlass bopo captaincodeman dieselmachine ritwikgupta nguruprasanna peteyan msabramo roverok giorgil nicholasserra willharris laukik asset-web groklearning pallav17 johnteifel pombreda pdesperrier mgilson strugo dstufft alex ordbogen graingert pombredanne tabatkins philippeowagner an3bi losintikfos linlams nikolas tomhapbia pjha1994 brumazzi swean ezc paradoxxxzero djangsters dkchan adamchainz deelin jayvdb komaldembla musa0243 vkvns muke5hy sjl421 prepare jpic smtlify moben mattlk13 bamdart jvanasco

html5lib-python's Issues

Typo in simpletree Document, utf-8^ instead of utf-8

Reported by maks.tsoy, Dec 24, 2012

Seems like the default value for encoding must be utf-8, not utf=8^

def toxml(self, encoding="utf=8"):

Mismatched end-tag results in an adoption-agency-1.1 error

http://code.google.com/p/html5lib/issues/detail?id=174

Reported by eric.promislow, Feb 23, 2011

Given this HTML:
<!doctype html>
<title>Adoption agency bug</title>
<p>A form:</a>
I was expecting an unexpected-end-tag error, but got this value
for parser.errors:
[((3, 14), 'adoption-agency-1.1', {'name': u'a'})]
Version: 0.90
Implementation: python2

Update readme

This still refers to all the Google Code stuff.

html5lib.treebuilders.dom.dom2sax crashes on 'xml:lang' attribute

http://code.google.com/p/html5lib/issues/detail?id=200

Reported by vovanec, Mar 6, 2012

A simple test case(my program has more complex handler implementation but the problem is reproducible with the default handler):

import xml.sax.handler
import html5lib

def test(html):
    handler = xml.sax.handler.ContentHandler()
    parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder('dom'))
    dom = parser.parse(html)
    html5lib.treebuilders.dom.dom2sax(dom, handler)

html = '<html xml:lang="en">'
test(html)

With html5lib 0.95 it produces the following traceback:
python test.py 
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    test(html)
  File "test.py", line 10, in test
    html5lib.treebuilders.dom.dom2sax(dom, handler)
  File "/home/vkuznets/packages/html5lib-0.95/html5lib-0.95/html5lib/treebuilders/dom.py", line 271, in dom2sax
    for child in node.childNodes: dom2sax(child, handler, nsmap)
  File "/home/vkuznets/packages/html5lib-0.95/html5lib-0.95/html5lib/treebuilders/dom.py", line 256, in dom2sax
    del attributes[(attr.namespaceURI, attr.nodeName)]
KeyError: (None, u'xml:lang')
With previous versions(at least 0.11) there's no any error. I assume this attribute may be invalid in the xml namespace, but anyway I don't think it is ok for parser just to crash. I've seen A LOT of html documents that has such attribute in the real world.

Tested it with Python 2.6.5, Linux

Please advise.

Thanks,
--Vladimir

Support Jython

From Google Code Issue 220:

Reported by [email protected], Mar 1, 2013

What steps will reproduce the problem?
Reproducible in Jython 2.5.2 and Jython 2.7b1
>> import html5lib
import html5lib
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lib/html5lib/__init__.py", line 14, in <module>
    from html5parser import HTMLParser, parse, parseFragment
  File "lib/html5lib/html5parser.py", line 33, in <module>
    import inputstream
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 48-54: illegal Unicode character
What is the expected output? What do you see instead?
jython cannot read inputstream.py.

Please provide any additional information below.

inputstream.py contains some seriously broken Unicode characters in the range 0xD800-0xDFFF, which are known as "unpaired surrogates".

This has been closed as wont-fix: http://bugs.jython.org/issue1836

It may be necessary to modify inputstream.py to not use these unicode character literals when running in Jython.

n.b. a test for Jython:
import platform
JYTHON = (platform.system() == 'Java')

Apr 7 (2 days ago) geoffers

As I just commented on the Jython bug, I believe this is a bug in Jython not implemented Python as it is documented. I don't particularly want to add in hacks for a bug in Jython.

Apr 7 (2 days ago) geoffers

Furthermore, if I do add a hack for it, we go into an infinite loop in the testsuite. I'm changing this bug to a more generic support Jython bug — no timeframe or even decision whether this is likely to happen.

U+000D must be encoded as entity

U+000D is converted to U+000A in the pre-tokenizer, so it must be serialized as a character reference.

Add option to strip tags from sanitizer

To quote @rubys from http://code.google.com/p/html5lib/issues/detail?id=62:

My inclination is to flip this entirely. It seems inconsistent that evil CSS is
stripped, and unknown attributes are stripped, but unknown elements are escaped, and
escaped poorly (what happens if an attribute for this element has a double quote in it?).

I mean, who wants to see <object> tags. It is bad enough that YouTube videos are
stripped, but rubbing salt in the would by showing a bunch of gibberish seems
entirely unnecessary.

I'd suggest a expose_disallowed_elements=False class variable which can be set to
True if somebody really wants the current behavior.

With #26 this has become possible, but a nicer API would be better.

getElementById doesn't work with minidom

http://code.google.com/p/html5lib/issues/detail?id=79

Reported by nils.winter, Sep 1, 2008

What steps will reproduce the problem?

Take any html file and use the minidom treebuilder. See the attached example.

What is the expected output? What do you see instead?

docgetElementyById doesn't work. It should

Please provide any additional information below.

minidom seems to expect some document definitions like <!DOCTYPE x [<!ATTLIST x a ID #IMPLIED>] (or the complete doctype). Since the html5lib treebuilder is used for building the document an html5lib is provided with a valid html document, it should be html5libs business to provide the ID definitions based on the current html doctype to minidom.

XXX-undefined-error generated while recovering from a mismatched tag error

http://code.google.com/p/html5lib/issues/detail?id=176

Reported by eric.promislow, Feb 23, 2011

For this standalone code (Python 2.6, html5lib 0.90):

import cStringIO, html5lib, pprint
text = """
<!DOCTYPE html>
Here's a table
<table>
<caption>Stuff goes here</bogus>
<tr><td>col</td><td>another</td></tr>
</table>
"""
inputStream = cStringIO.StringIO(text)
parser = html5lib.HTMLParser()
doc = parser.parse(inputStream)
errors = parser.errors
pprint.pprint(errors)
I get this output:
[((5, 32), 'unexpected-end-tag', {'name': u'bogus'}),
 ((6, 4), 'XXX-undefined-error', {})]
Quick first look at the code, I would say that end-tags that aren't
in self.tree.openElements should be popped as part of error-recovery,
but then, I just started working with this project a few hours ago...

Drop RecursiveTreeWalker

Nothing has used this for years, and it was inevitably the source of issues historically (exhausting stack, etc.).

Move tokens to being objects

Using hashtables for tokens makes no sense; we should move to objects and then have the type implicit.

Localizable error messages

Currently almost all our error messages are passed through gettext. Do we actually want localizable error messages? To me, at least, this seems an anti-feature.

cc/ @ambv and @jgraham

thin and thick not in CSS whitelist

http://code.google.com/p/html5lib/issues/detail?id=52

Reported by [email protected], Aug 9, 2007

CSS keywords "thin" and "thick" are not in the whitelist.

Aug 9, 2007 [email protected]

This issue was originally noticed at:

http://wiki.whatwg.org/wiki/Sanitization_rules

I assume feedparser and friends are all affected since they share the code.

(Patch dropped because it'll be as much effort to deal with changed paths and everything as to just rewrite it.)

Foreign content attributes do not serialize properly

In [6]: html5lib.serialize(html5lib.parse("<math><mi xml:lang=en xlink:href=foo /></math>"))
Out[6]: u'<math><mi href=foo lang=en></mi></math>'

We should get u'<math><mi xml:href=foo xml:lang=en></mi></math>' or similar back.

Over-eager omission of body tag

<body><title>X</title> gets serialized as <title>X</title> with omit_optional_tags=True despite it being semantically important. Similar bug with <body><meta>.

Drop pxdom

We haven't supported or tested it in forever; it should go.

Sanitizer and lxml tree walker: TypeError: unhashable type

http://code.google.com/p/html5lib/issues/detail?id=210

Reported by r.kintzi, Aug 14, 2012

What steps will reproduce the problem?

from html5lib import HTMLParser
from html5lib.treebuilders import getTreeBuilder
from html5lib.treewalkers import getTreeWalker
from html5lib.filters.sanitizer import Filter as Sanitizer
html = "<html><body><h1>Header"

parser = HTMLParser(tree = getTreeBuilder("lxml"),
        namespaceHTMLElements = False)
doc = parser.parse(html)
root = doc.getroot()
body = doc.xpath('/html/body')
walker = getTreeWalker('lxml')
stream = walker(body)
stream = Sanitizer(stream)
for token in stream:
    print token

What is the expected output? What do you see instead?

I do not know exactly what should be printed. Instead, an exception is raised:

$ python t.py
{'namespace': u'None', 'type': 'Characters', 'data': u'<body>'}
Traceback (most recent call last):
  File "t.py", line 17, in <module>
    for token in stream:
  File "/home/radek/.virtualenvs/blog/local/lib/python2.7/site-packages/html5lib-0.95-py2.7.egg/html5lib/filters/sanitizer.py", line 7, in __iter__
    token = self.sanitize_token(token)
  File "/home/radek/.virtualenvs/blog/local/lib/python2.7/site-packages/html5lib-0.95-py2.7.egg/html5lib/sanitizer.py", line 171, in sanitize_token
    token["data"][::-1] 
TypeError: unhashable type

Please provide any additional information below.

the faulty token is:

{'namespace': u'None ',' type ':' StartTag ',' name ': u'h1', 'data': {}}

Use of six.text type for comparison in treewalker._base

all over the in the file html5lib/treewalkers/_base.py there are places where isinstance is used and the variable is compared to six.text_type instead of six.string_types

On Python 2, six.text_type = unicode, this means that if an attribute of value passed in is str, then there will be an assertion error.

on Python 2:

six.string_types = (basestring,)
six.text_type = unicode

on Python 3:

six.string_types = (builtins.str, )
six.text_type = builtins.str

So I think the values should be compared against six.string_types and then six.text_type be used to coerce them for output.

Implement main

This is nowadays in HTML and a two-line fix.

Unicode literal in BufferedStream._readFromBuffer causes failure in HTMLBinaryInputStream

In html5lib/inputstream.py, unicode_literals is imported from __future__. This causes html5lib.inputstream.BufferedStream to misbehave, specifically the _readFromBuffer method, which ends with return "".join(rv). Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.

An example of the problem caused:

from urllib2 import Request, urlopen
from html5lib.inputstream import HTMLBinaryInputStream

req = Request(url='http://example.org/')
source = urlopen(req)
HTMLBinaryInputStream(source)

Causing:

Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File ".../html5lib/inputstream.py", line 411, in __init__
    self.charEncoding = self.detectEncoding(parseMeta, chardet)
  File ".../html5lib/inputstream.py", line 448, in detectEncoding
    encoding = self.detectEncodingMeta()
  File ".../html5lib/inputstream.py", line 535, in detectEncodingMeta
    assert isinstance(buffer, bytes)
AssertionError

(That is, when HTMLBinaryInputStream is used with a file-like object (such as the result of urllib2.urlopen), it wraps it in a BufferedStream, which then fails (at line 535) with the assert isinstance(buffer, bytes).)

This can be fixed by using a byte literal in _readFromBuffer, instead, i.e. return b"".join(rv). (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)

Cleanup InputStream

InputStream is a mess. We should clean up so the legacy entry-point does all the magic, and all the internal code is clean.

We should have an abstract base class that defines the methods an InputStream must support, including concrete implementations when possible (charsUntil should be implementable based on lower-level APIs, for example). We should have two concrete classes inheriting from this: TextInputStream and BytesInputStream which take as an argument a file-like object (in Python 3, one that inherits from io.IOBase).

We should investigate whether we can get rid of our BufferedStream, relying upon io.BufferedReader (which exists from 2.6, which we now require).

The entry-point will need to:

Determine whether it has been passed a file-like object or a string.
If it is a file-like object, determine whether it contains Unicode data or not. (In Python 3, this can ordinarily be done by checking whether it inherits from TextIOBase, provided it inherits from IOBase.)
If it is a string, check whether it is bytes/unicode, and then wrap it in StringIO/BytesIO to pass on.

AlphabeticizeAttributes filter should have a native API like inject_meta_charset

<gsnedders> (html5lib.serialize(my_document, alphabeticize_attributes=True) should work as an API basically)

AttributeError when using html5lib

Hi I havea unit test to ensure my end pages are html5valid.
The test has a function which fails the test if its not valid.

def assert_valid_html5(self, html5):
    ''' ensures that the html5 variable consists of valid html5 code '''
    from html5lib import html5parser
    from html5lib import treebuilders
    from html5lib import constants

    treebuilder = treebuilders.getTreeBuilder("etree")
    parser = html5parser.HTMLParser(tree=treebuilder)
    document = parser.parse(html5)

    errors=[]
    for pos, errorcode, datavars in parser.errors:
        errors.append("Line %i Col %i"%pos + " " + constants.E.get(errorcode, 'Unknown error "%s"' % errorcode) % datavars)
    self.assertEquals(len(errors), 0, "expected valid html5, but found errors %s, for html %s" % ("\n".join(errors), html5))

The test runs fine on my mac and also fine on the machine of another developer. However on our jenkins node I'm getting an AttributeError.

I'm trying to find what it is, the strange thing is, when i run "import html5lib" from the python interpreter I don't get any errors. I'm using nosetests to run the tests.

Traceback (most recent call last):
  File "/usr/lib/python2.7/unittest/case.py", line 327, in run
    testMethod()
  File "/var/lib/jenkins/jobs/bliep-master/workspace/test/test_index.py", line 20, in test_index_page
    self.assert_valid_html5(response.body)
  File "/var/lib/jenkins/jobs/bliep-master/workspace/test/test_index.py", line 46, in assert_valid_html5
    from html5lib import html5parser
  File "/var/lib/jenkins/jobs/bliep-master/workspace/test/rollbackimporter.py", line 13, in _import
    result = apply(self.realImport, (name, globals, locals, fromlist))
  File "/usr/local/lib/python2.7/dist-packages/html5lib/__init__.py", line 16, in <module>
    from .html5parser import HTMLParser, parse, parseFragment
  File "/var/lib/jenkins/jobs/bliep-master/workspace/test/rollbackimporter.py", line 13, in _import
    result = apply(self.realImport, (name, globals, locals, fromlist))
  File "/usr/local/lib/python2.7/dist-packages/html5lib/html5parser.py", line 6, in <module>
    from . import inputstream
  File "/var/lib/jenkins/jobs/bliep-master/workspace/test/rollbackimporter.py", line 13, in _import
    result = apply(self.realImport, (name, globals, locals, fromlist))
  File "/usr/local/lib/python2.7/dist-packages/html5lib/inputstream.py", line 9, in <module>
    from . import utils
  File "/var/lib/jenkins/jobs/bliep-master/workspace/test/rollbackimporter.py", line 13, in _import
    result = apply(self.realImport, (name, globals, locals, fromlist))
  File "/usr/local/lib/python2.7/dist-packages/html5lib/utils.py", line 6, in <module>
    import xml.etree.cElementTree as default_etree
AttributeError: 'module' object has no attribute 'cElementTree'

Is this a know issue? I'm trying to figure out whats wrong in the setup but maybe it is also related to html5lib.

Cleanup AAA comments

3facc99 left the comments in an inconsistent state, some updated, others not. We should make these consistent.

Remove trailing whitespace

find . -name '*.py' -print0 | xargs -0 sed -i -e 's/\s*$//g' will do. Should wait until there are no outstanding branches for the sake of them merging cleanly.

xml:lang not handled

http://code.google.com/p/html5lib/issues/detail?id=154

Reported by @fantasai, Jun 1, 2010

What steps will reproduce the problem?
Try to reserialize a document containing xml:lang

What is the expected output? What do you see instead?
For HTML output, I'd expect to get lang= in place of xml:lang=.
Instead I get {http://www.w3.org/XML/1998/namespace}lang which
is completely malformed.

Quote attributes containing weird whitespace or '<'

http://code.google.com/p/html5lib/issues/detail?id=93

Reported by zcorpan, Feb 27, 2009

This is similar to issue 92 except there's an old Opera bug where certain
characters are treated as whitespace.

http://www.opera.com/support/kb/view/900/

The characters are

U+0009, U+000A, U+000B, U+000C, U+000D, U+0020, U+002F, U+00A0, U+1680, U
+180E, U+180F, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U
+2007, U+2008, U+2009, U+200A, U+2028, U+2029, U+202F, U+205F and U+3000

html5lib should probably quote attribute values that contain any of these.

Also, given that Gecko and WebKit start a new tag for <foo bar=baz<quux>
you should probably also quote attribute values that contain "<".

Apr 27, 2009 excors

Also see http://software.hixie.ch/utilities/js/live-dom-viewer/saved/95

In addition to the values mentioned in the spec, the following seem to require
quoting:

Safari 3.0: U+0000 to U+0020 inclusive
Konqueror 4.1: U+0000 to U+0020 inclusive
Safari 3.1: U+000B
Opera 9.6: U+000B
IE6, IE8: U+000B, U+0060
Firefox 2/3: (Not U+0008 despite what that test script says; those characters just
get stripped, it seems)

Apr 27, 2009 zcorpan

(U+000B is not a valid character in HTML5, though I don't know if the serializer
tries to keep the character data valid.)

Sep 4, 2009 Simetrical

The spec should be updated to ban these too, then, right? They're not interoperably
supported. I doubt anyone will cry about not being able to use sub-0x20 characters in
unquoted attribute values, anyway. :) U+60 is `, doesn't seem like a big issue
either. Should this be brought up on the mailing list?

Sep 5, 2009 geoffers

IMO yes, just someone needs to get around to it. :)

Sep 6, 2009 zcorpan

I did, and Hixie rejected it saying that it's an issue that will go away over time.
Feel free to bring it up again (citing that sites who implement the spec using a
serializer will expose themselves to security problems with legacy browsers).

Sep 7, 2009 Simetrical

I posted this a couple of days ago:

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-September/022711.html

Oct 28, 2009 geoffers

Accepted, though we still need to decide how much to quote.

Oct 30, 2009 geoffers

I don't think we need to try and get the spec to quote anything else.

This should presumably be a legacy_quote option or some such.

Support HTML Templates

As specified in: http://www.w3.org/TR/html-templates/

Implementation already present in the Firefox nightlies, probably in Chrome too. This means the spec will be ported to HTML proper shortly.

skip first BOM test is invalid

Sent upstream to html5lib/html5lib-tests#2

AssertionError on some random case

http://code.google.com/p/html5lib/issues/detail?id=190

Reported by marko.koivusalo, Sep 12, 2011

What steps will reproduce the problem?

Unable to reproduce

Please provide any additional information below.

http://flexget.com/ticket/1099

relevant bit:
  File "build/bdist.linux-x86_64/egg/html5lib/inputstream.py", line 285, in detectBOM
    self.rawStream.seek(encoding and seek or 0)
  File "build/bdist.linux-x86_64/egg/html5lib/inputstream.py", line 50, in seek
    assert pos < self._bufferedBytes()
AssertionError```

Document all dependencies, both required and optional

This is the basic outcome of issue 201 on GCode.

Preserve order of attributes on serialization

From Google Code #153:

Reported by @fantasai, Jun 1, 2010

What steps will reproduce the problem?
Parse an XHTML file containing attributes in unsorted order with lxml and reserialize.

What is the expected output? What do you see instead?
Expect no change.
Got attributes in alphabetical order, which makes the source harder to read (since the order was chosen to optimize readability, e.g. listing the fixed-length rel="stylesheet" before variable-length href="..."). This also makes it harder to understand diffs, since there's a lot of unnecessary changes to the source output.

Ideally, html5lib would remember the order of attributes and reserialize in that order. lxml does remember the order, so removing the attrs.sort() line in htmlserializer.py is adequate to fix the problem for serializing an lxml tree.

Jul 20, 2010 geoffers
AFAIK the reason for the sort being there is so that there is a guaranteed order even when a tree-builder with no guaranteed order is being used.

May 22, 2011 geoffers
There's no real way to fix this without relying upon defined-to-be-undefined behaviour in CPython/lxml, and as such I'm reluctant to do so. lxml says attributes are given in an arbitrary order, and they are stored in a dict which CPython makes no guarantee of the order of. (lxml does always insert attributes in document order into the dict, and dicts are ordered by insertion order, so it does actually work… for now, at least).

Yes, we could go against both the lxml/CPython documentation and rely upon the ordering, but if either ever changes their behaviour, it could mean html5lib could potentially start serializing the same lxml parse-tree in random ways, and I'd much rather go for the definitely-consistent route we have now.

gettext warnings in `python/filters/lint.py

http://code.google.com/p/html5lib/issues/detail?id=166

Reported by [email protected], Dec 21, 2010

Running gettext over the Python HTML5Lib gives a few warnings in the form:

errors happened while running xgettext on lint.py
./html5lib/filters/lint.py:56: warning: 'msgid' format string with unnamed arguments cannot be properly localized:
                                        The translator cannot reorder the arguments.
                                        Please consider using a format string with named arguments,
                                        and a mapping instead of a tuple for the arguments.
This can be relatively trivially resolved by changing the bare %s, %r, etc. into named substitutions. The attached patch is my attempt to take care of things.

Grant gsnedders permission to upload html5lib to PyPi

For the sake of having something to prod James with.

CSS Sanitizer "gauntlet" filters style tags like font-family:"sans-serif"

http://code.google.com/p/html5lib/issues/detail?id=180

Reported by bjellema20, Mar 8, 2011

What steps will reproduce the problem?

Pass any html into the sanitizer with an inline style that includes a font-family with a dash (-) such as "sans-serif" and the entire style is stripped. Example html:
<span style='font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D'>Enjoy your day</span>
What is the expected output? What do you see instead?

The style tag should stay, but instead we see:
<span style="">Enjoy your day</span>
Please provide any additional information below.

I've solved this by changing line 197 in sanitizer.py from:
        if not re.match("""^([:,;#%.\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'|"[\s\w]+"|$[\d,\s]+$)*$""", style): return ''
To:
        if not re.match("""^([:,;#%.\sa-zA-Z0-9!]|\w-\w|'[\s\w-]+'|"[\s\w-]+"|$[\d,\s]+$)*$""", style): return ''

HTMLParser is not threadsafe

http://code.google.com/p/html5lib/issues/detail?id=189

Reported by devin.bayer, Jul 25, 2011

Hi. I realize this is by design, but it's not intuitive, since similar standard classes like YamlDecoder and JSONDecoder are.

It would be more clear if the input stream was supplied to the constructor, like with ElementTree.

But at least, please document this in the class.

Sep 16, 2011 geoffers

Is there any reason to document it? This is the case with all Python code in CPython (other implementations may differ), so the cases where things are threadsafe are the notable exceptions.

Sep 16, 2011 devin.bayer

(Most?) Everything in the python standard library is threadsafe and most extensions are. I think you are referring to the GIL, which is different. That prevents parallel execution, but if one thread is blocking, the others can run safely.

The problem with the design of HTMLParser is that two threads can interfere with each other, even if they are not running at the same time.

Mar 11, 2012 [email protected]

This is clearly a defect. This is an object-oriented library in an object oriented language. Two parsers should be completely independent of each other, with no shared global variables, and thus thread-safe. If that's not the case, this is a defect.

Do I have to scrap my plans to convert a parallel web crawler from BeautifulSoup to html5lib?

This looks fixable. The trouble spots include at least these global variables:

dom.py: moduleCache

That could be easily fixed with a lock in getDomModule. That's a once per parse event, so there's no performance issue. All that's needs is
import threading
...
Lok = threading.Lock()
with Lok() :
  ... critical section...
etree.py: moduleCache

Same issue.

etree.lxml: fullTree

This seems to be set only once, at load time. Is it changed elsewhere?

what have I missed? Some lower level library? Is Python's SAX parser unsafe?

This can and should be fixed.

Flake compliance not forced on master

In dede1f2 @ambv changed how flake was run such that the script that Travis now runs can return zero when one or more of the flake commands returns non-zero.

Possible to make IE run script after roundtripping in html5lib

http://code.google.com/p/html5lib/issues/detail?id=92

Reported by zcorpan, Feb 27, 2009

What steps will reproduce the problem?
Input: <br title=><xmp>><script>alert(1)</script></xmp>
Serialization options: omit quotes.

What is the expected output?
Attribute values with ` in them should be quoted even with the omit quotes
setting.

What do you see instead?
Quotes are omitted and hence, the script is run in IE.

Feb 27, 2009 t.broyer

IIRC, the spec says a ` is allowed in an unquoted attribute value:
http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#attributes

Should the spec be changed? should we rather add a new option to the serializer?

Mar 10, 2009 sad.neko

I'm sorry, but i couldn't find ` to be allowed in unquoted attribute values in html5
neither in html4. Am i missing something?

Sep 4, 2009 Simetrical

The requirements that comment 2 links to say unquoted attributes "must not contain any literal space characters, any U+0022 QUOTATION MARK (") characters, U+0027 APOSTROPHE (') characters, U+003D EQUALS SIGN (=) characters, U+003C LESS-THAN SIGN (<) characters, or U+003E GREATER-THAN SIGN (>) characters, and must not be the empty string." There are no other constraints that don't apply to quoted attributes as well.

What's the bug here? As far as I can tell from reading the spec, the given text
should parse as

<br title=""><xmp>><script>alert(1)</script></xmp>

and conformant browsers should run the script.

Sep 6, 2009 zcorpan

No, because xmp is a RAWTEXT element. So it's equivalent to the following XML

<br title=""/><xmp>><script>alert(1)</script></xmp>

but in IE it's equivalent to the following XML

<br title="><xmp>"/><script>alert(1)</script><xmp/>

(I think a stray </xmp> tag will result in an empty element in IE, but I could
remember incorrectly; anyway that's besides the point.)

Oct 18, 2009 geoffers

` is now non-conforming at the start of an unquoted attribute.

Related to #11.

Fix parse.py under Py3

pyflakes finds a lot of issues in html5lib (some critical like the "undefined" ones at the bottom)

From Google Code, issue #221, reported by @mo. This is a more recent run post python3 merge below.

$ pyflakes html5lib/
html5lib/__init__.py:16: 'parse' imported but unused
html5lib/__init__.py:16: 'HTMLParser' imported but unused
html5lib/__init__.py:16: 'parseFragment' imported but unused
html5lib/__init__.py:17: 'getTreeBuilder' imported but unused
html5lib/__init__.py:18: 'getTreeWalker' imported but unused
html5lib/__init__.py:19: 'serialize' imported but unused
html5lib/inputstream.py:6: 'types' imported but unused
html5lib/inputstream.py:110: local variable 'data' is assigned to but never used
html5lib/inputstream.py:293: 'sys' imported but unused
html5lib/tokenizer.py:11: 'entitiesWindows1252' imported but unused
html5lib/tokenizer.py:12: 'asciiLowercase' imported but unused
html5lib/tokenizer.py:801: redefinition of unused 'scriptDataDoubleEscapedDashState' from line 778
html5lib/utils.py:3: 'version_info' imported but unused
html5lib/html5parser.py:4: 'sys' imported but unused
html5lib/html5parser.py:17: 'formattingElements' imported but unused
html5lib/html5parser.py:18: 'tableInsertModeElements' imported but unused
html5lib/html5parser.py:19: 'voidElements' imported but unused
html5lib/html5parser.py:20: redefinition of unused 'spaceCharacters' from line 16
html5lib/html5parser.py:91: local variable 'e' is assigned to but never used
html5lib/html5parser.py:408: local variable 'element' is assigned to but never used
html5lib/html5parser.py:1405: local variable 'name' is assigned to but never used
html5lib/html5parser.py:1591: local variable 'node' is assigned to but never used
html5lib/tests/__init__.py:12: 'support' imported but unused
html5lib/tests/test_serializer.py:18: 'html5parser' imported but unused
html5lib/tests/test_serializer.py:175: local variable 'test_name' is assigned to but never used
html5lib/tests/test_parser2.py:5: 'support' imported but unused
html5lib/tests/test_parser2.py:37: local variable 'doc' is assigned to but never used
html5lib/tests/support.py:15: 'html5lib' imported but unused
html5lib/tests/support.py:16: 'html5parser' imported but unused
html5lib/tests/support.py:46: 'lxml' imported but unused
html5lib/tests/test_sanitizer.py:3: 'os' imported but unused
html5lib/tests/test_sanitizer.py:4: 'sys' imported but unused
html5lib/tests/test_sanitizer.py:5: 'unittest' imported but unused
html5lib/tests/test_stream.py:3: 'support' imported but unused
html5lib/tests/tokenizertotree.py:10: 'test_parser' imported but unused
html5lib/tests/test_tokenizer.py:5: 'sys' imported but unused
html5lib/tests/test_tokenizer.py:7: 'io' imported but unused
html5lib/tests/test_tokenizer.py:183: local variable 'testName' is assigned to but never used
html5lib/tests/test_tokenizer.py:187: local variable 'skip' is assigned to but never used
html5lib/tests/test_parser.py:6: 'io' imported but unused
html5lib/tests/test_parser.py:14: 'html5lib' imported but unused
html5lib/tests/test_parser.py:15: 'treebuilders' imported but unused
html5lib/tests/test_treewalkers.py:16: 'LintError' imported but unused
html5lib/tests/test_treewalkers.py:16: 'LintFilter' imported but unused
html5lib/tests/test_treewalkers.py:20: redefinition of unused 'COMMENT' from line 109
html5lib/tests/test_treewalkers.py:87: 'ElementTree' imported but unused
html5lib/tests/test_encoding.py:3: 're' imported but unused
html5lib/tests/test_encoding.py:30: local variable 't' is assigned to but never used
html5lib/tests/test_encoding.py:47: local variable 'test_name' is assigned to but never used
html5lib/tests/test_encoding.py:55: 'chardet' imported but unused
html5lib/treebuilders/__init__.py:37: 'sys' imported but unused
html5lib/treebuilders/dom.py:5: 're' imported but unused
html5lib/treebuilders/dom.py:9: 'ihatexml' imported but unused
html5lib/treebuilders/etree.py:210: local variable 'finalText' is assigned to but never used
html5lib/treebuilders/etree.py:274: local variable 'finalText' is assigned to but never used
html5lib/trie/datrie.py:3: 'chain' imported but unused
html5lib/serializer/htmlserializer.py:27: redefinition of unused 'entities' from line 14
html5lib/serializer/htmlserializer.py:231: local variable 'attributes' is assigned to but never used
html5lib/treewalkers/dom.py:9: 'voidElements' imported but unused
html5lib/treewalkers/simpletree.py:50: undefined name '_node'
html5lib/treewalkers/lxmletree.py:8: 'sys' imported but unused
html5lib/treewalkers/lxmletree.py:13: 'voidElements' imported but unused
html5lib/treewalkers/_base.py:98: undefined name 'NodeImplementedError'
html5lib/treewalkers/_base.py:157: local variable 'endTag' is assigned to but never used

$ pyflakes html5lib/ | grep undefined
html5lib/treewalkers/simpletree.py:50: undefined name '_node'
html5lib/treewalkers/_base.py:98: undefined name 'NodeImplementedError'
html5lib/treewalkers/etree.py:129: undefined name 'moduleFactoryFactory'

Infinite loop with nested button

http://code.google.com/p/html5lib/issues/detail?id=211

Reported by [email protected], Aug 24, 2012

So I know this is not well-formed HTML, but it occurred in the wild as the output from Markdown.

I have the latest pypi Python library (version = 0.95-dev).

If I try to parse the following HTML, my program goes into an infinite loop and memory usage increases without stop:

u"<p>So theres no shortage of info out there on rounded corners and I've been through much of it and I'm posting to get the communities opinons at this piont.</p>\n<p>My scenario is that we're developing a rounded corner dependant design, mainly used for interactions (<button> and <a>). We are going to use border radius for the good browsers on the block that play nice with it and then use the server to send down javscript to browsers that don't</p>\n<p>What I'm wondering is what to use to up scale the browsers that ignore border radius CSS? I need something that works on button aswell as a, div etc. I've been looking at the following and have found that some don't play nice with <button>. Also the site already uses jQuery.</p>\n<p>http://www.curvycorners.net/ - http://code.google.com/p/jquerycurvycorners/</p>\n<p>http://www.html.it/articoli/niftycube/index.html</p>\n<p>http://www.malsup.com/jquery/corner/</p>"

Aug 24, 2012 waylan

I can't comment on the infinite loop, but as the maintainer of the Markdown library, I was concerned regarding the original reporter's implication that Markdown may be producing invalid HTML. While only the output is provided, not the input, it appears to me that the invalid output is a result of invalid input. You should be wrapping those random angle-bracket tags in code tags. So "(<button> and <a>)" (note the backticks surrounding each tag) would be output by Markdown as "(<button> and <a>)", which is valid HTML and will not result in an infinite loop in html5lib.

If, in the event that the Markdown input is coming from an untrusted third party, then you absolutely should be sanitizing it before passing it on to anything else.

That said, one such way to sanitize (my recommendation) is to use the Bleach library 1, which uses html5lib internally. So I guess we're back to that infinite loop.

Aug 24, 2012 [email protected]

The Markdown comes from the wild and is probably invalid.

My idea was to pass the HTML through tidy before running an HTML parser, thus avoiding an infinite loop. There are several tidy wrappers in Python. I used pytidylib.

I didn't play with the options to make tidy more strict, and even after tidy, html5lib still goes into an infinite loop. So my current workaround is to use tidy followed by lxml :\

Implement scripting disabled case.

Since people have requested this.

Treebuilder test serializers are recursive

They are therefore liable to run out of stack. They shouldn't.

Sanitizing filter broken in 0.90

http://code.google.com/p/html5lib/issues/detail?id=162

Reported by [email protected], Oct 10, 2010

DESCRIPTION

Consider the following interaction with html5lib 0.90:
    >>> from html5lib import html5parser, serializer, treebuilders, treewalkers
    >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
    >>> dom = p.parse("""<body onload="sucker()">""") 
    >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True)
    >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom)))
    u'<body onload=sucker()>'
This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.

ANALYSIS

The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.

Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:
    >>> from html5lib import tokenizer
    >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">""")))
    {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}
But during filtering, tokens look like this:
    >>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3]
    {'namespace': u'http://www.w3.org/1999/xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}
When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.

OBSERVATION

Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?

WORKAROUND

I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:

Serialize the DOM to HTML without sanitization.

Re-parse the HTML from step 1, using the sanitizing tokenizer.

lxml fails with null byte in attribute value

From Google Code 186:

>>> html5lib.parse("<p val=\"hu\x00\">", treebuilder="lxml")
Traceback (most recent call last):
...
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

HTMLSanitizer can't be used as a tokenizer

Reported by devin.bayer, Jun 7, 2011

version html5lib-0.95_dev

/html5lib/html5parser.py line 242 in parseFragment
      self._parse(stream, True, container=container, encoding=encoding)
/html5lib/html5parser.py line 110 in _parse
      parser=self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'parser'

Jun 7, 2011 devin.bayer

This is a workaround and slightly safer design. There is no need for the mixin or to hardcode the __init__ arguments:

from html5lib import HTMLParser
from html5lib.tokenizer import HTMLTokenizer
from html5lib.sanitizer import HTMLSanitizerMixin
from cgi import escape

class Sanitizer(HTMLTokenizer):
    def __init__(self, *a, **kw):
        HTMLTokenizer.__init__(self, *a, **kw)
        self._saner = HTMLSanitizerMixin()

    def __iter__(self):
        for token in HTMLTokenizer.__iter__(self):
            saner = self._saner.sanitize_token(token)
            if saner: yield saner 

PARSER = HTMLParser(tokenizer=Sanitizer)

def sanitize(html):
    return PARSER.parseFragment(html).toxml()

ResourceWarning on tests

Not seen this locally, oddly, but Travis CI always hits it:

Exception ResourceWarning: ResourceWarning("unclosed file <_io.BufferedReader name='/home/travis/build/html5lib/html5lib-python/html5lib/tests/testdata/encoding/chardet/test_big5.txt'>",) in <_io.FileIO name='/home/travis/build/html5lib/html5lib-python/html5lib/tests/testdata/encoding/chardet/test_big5.txt' mode='rb'> ignored

Support charade

This is effectively a six-based Py2/3 version of chardet. Would allow us for testing purposes to stop being evil under Python 3 and using the Debian package.

Upload 0.90 to pypi

You released this previous version and externally hosted it. With recent changes (for very good reasons) externals like yours (without a hash and such) don't work anymore. Since there isn't really a change log between 0.90 and 0.95, could you please upload 0.90 so until we have time to upgrade we can keep using what works for us?