html5lib / html5lib-python Goto Github PK
View Code? Open in Web Editor NEWStandards-compliant library for parsing and serializing HTML documents and fragments in Python
License: MIT License
Standards-compliant library for parsing and serializing HTML documents and fragments in Python
License: MIT License
http://code.google.com/p/html5lib/issues/detail?id=210
Reported by r.kintzi, Aug 14, 2012
What steps will reproduce the problem?
from html5lib import HTMLParser from html5lib.treebuilders import getTreeBuilder from html5lib.treewalkers import getTreeWalker from html5lib.filters.sanitizer import Filter as Sanitizer html = "<html><body><h1>Header" parser = HTMLParser(tree = getTreeBuilder("lxml"), namespaceHTMLElements = False) doc = parser.parse(html) root = doc.getroot() body = doc.xpath('/html/body') walker = getTreeWalker('lxml') stream = walker(body) stream = Sanitizer(stream) for token in stream: print tokenWhat is the expected output? What do you see instead?
I do not know exactly what should be printed. Instead, an exception is raised:
$ python t.py {'namespace': u'None', 'type': 'Characters', 'data': u'<body>'} Traceback (most recent call last): File "t.py", line 17, in <module> for token in stream: File "/home/radek/.virtualenvs/blog/local/lib/python2.7/site-packages/html5lib-0.95-py2.7.egg/html5lib/filters/sanitizer.py", line 7, in __iter__ token = self.sanitize_token(token) File "/home/radek/.virtualenvs/blog/local/lib/python2.7/site-packages/html5lib-0.95-py2.7.egg/html5lib/sanitizer.py", line 171, in sanitize_token token["data"][::-1] TypeError: unhashable type
Please provide any additional information below.
the faulty token is:
{'namespace': u'None ',' type ':' StartTag ',' name ': u'h1', 'data': {}}
From Google Code #153:
Reported by @fantasai, Jun 1, 2010
What steps will reproduce the problem?
Parse an XHTML file containing attributes in unsorted order with lxml and reserialize.What is the expected output? What do you see instead?
Expect no change.
Got attributes in alphabetical order, which makes the source harder to read (since the order was chosen to optimize readability, e.g. listing the fixed-lengthrel="stylesheet"
before variable-lengthhref="..."
). This also makes it harder to understand diffs, since there's a lot of unnecessary changes to the source output.Ideally, html5lib would remember the order of attributes and reserialize in that order. lxml does remember the order, so removing the attrs.sort() line in htmlserializer.py is adequate to fix the problem for serializing an lxml tree.
Jul 20, 2010 geoffers
AFAIK the reason for the sort being there is so that there is a guaranteed order even when a tree-builder with no guaranteed order is being used.May 22, 2011 geoffers
There's no real way to fix this without relying upon defined-to-be-undefined behaviour in CPython/lxml, and as such I'm reluctant to do so. lxml says attributes are given in an arbitrary order, and they are stored in a dict which CPython makes no guarantee of the order of. (lxml does always insert attributes in document order into the dict, and dicts are ordered by insertion order, so it does actually work… for now, at least).Yes, we could go against both the lxml/CPython documentation and rely upon the ordering, but if either ever changes their behaviour, it could mean html5lib could potentially start serializing the same lxml parse-tree in random ways, and I'd much rather go for the definitely-consistent route we have now.
This still refers to all the Google Code stuff.
Sent upstream to html5lib/html5lib-tests#2
Hi I havea unit test to ensure my end pages are html5valid.
The test has a function which fails the test if its not valid.
def assert_valid_html5(self, html5):
''' ensures that the html5 variable consists of valid html5 code '''
from html5lib import html5parser
from html5lib import treebuilders
from html5lib import constants
treebuilder = treebuilders.getTreeBuilder("etree")
parser = html5parser.HTMLParser(tree=treebuilder)
document = parser.parse(html5)
errors=[]
for pos, errorcode, datavars in parser.errors:
errors.append("Line %i Col %i"%pos + " " + constants.E.get(errorcode, 'Unknown error "%s"' % errorcode) % datavars)
self.assertEquals(len(errors), 0, "expected valid html5, but found errors %s, for html %s" % ("\n".join(errors), html5))
The test runs fine on my mac and also fine on the machine of another developer. However on our jenkins node I'm getting an AttributeError.
I'm trying to find what it is, the strange thing is, when i run "import html5lib" from the python interpreter I don't get any errors. I'm using nosetests to run the tests.
Traceback (most recent call last):
File "/usr/lib/python2.7/unittest/case.py", line 327, in run
testMethod()
File "/var/lib/jenkins/jobs/bliep-master/workspace/test/test_index.py", line 20, in test_index_page
self.assert_valid_html5(response.body)
File "/var/lib/jenkins/jobs/bliep-master/workspace/test/test_index.py", line 46, in assert_valid_html5
from html5lib import html5parser
File "/var/lib/jenkins/jobs/bliep-master/workspace/test/rollbackimporter.py", line 13, in _import
result = apply(self.realImport, (name, globals, locals, fromlist))
File "/usr/local/lib/python2.7/dist-packages/html5lib/__init__.py", line 16, in <module>
from .html5parser import HTMLParser, parse, parseFragment
File "/var/lib/jenkins/jobs/bliep-master/workspace/test/rollbackimporter.py", line 13, in _import
result = apply(self.realImport, (name, globals, locals, fromlist))
File "/usr/local/lib/python2.7/dist-packages/html5lib/html5parser.py", line 6, in <module>
from . import inputstream
File "/var/lib/jenkins/jobs/bliep-master/workspace/test/rollbackimporter.py", line 13, in _import
result = apply(self.realImport, (name, globals, locals, fromlist))
File "/usr/local/lib/python2.7/dist-packages/html5lib/inputstream.py", line 9, in <module>
from . import utils
File "/var/lib/jenkins/jobs/bliep-master/workspace/test/rollbackimporter.py", line 13, in _import
result = apply(self.realImport, (name, globals, locals, fromlist))
File "/usr/local/lib/python2.7/dist-packages/html5lib/utils.py", line 6, in <module>
import xml.etree.cElementTree as default_etree
AttributeError: 'module' object has no attribute 'cElementTree'
Is this a know issue? I'm trying to figure out whats wrong in the setup but maybe it is also related to html5lib.
http://code.google.com/p/html5lib/issues/detail?id=174
Reported by eric.promislow, Feb 23, 2011
Given this HTML:
<!doctype html> <title>Adoption agency bug</title> <p>A form:</a>I was expecting an unexpected-end-tag error, but got this value
for parser.errors:[((3, 14), 'adoption-agency-1.1', {'name': u'a'})]
Version: 0.90
Implementation: python2
This is the basic outcome of issue 201 on GCode.
In [6]: html5lib.serialize(html5lib.parse("<math><mi xml:lang=en xlink:href=foo /></math>"))
Out[6]: u'<math><mi href=foo lang=en></mi></math>'
We should get u'<math><mi xml:href=foo xml:lang=en></mi></math>'
or similar back.
<gsnedders> (html5lib.serialize(my_document, alphabeticize_attributes=True) should work as an API basically)
<body><title>X</title>
gets serialized as <title>X</title>
with omit_optional_tags=True despite it being semantically important. Similar bug with <body><meta>
.
http://code.google.com/p/html5lib/issues/detail?id=166
Reported by [email protected], Dec 21, 2010
Running
gettext
over the Python HTML5Lib gives a few warnings in the form:errors happened while running xgettext on lint.py
./html5lib/filters/lint.py:56: warning: 'msgid' format string with unnamed arguments cannot be properly localized: The translator cannot reorder the arguments. Please consider using a format string with named arguments, and a mapping instead of a tuple for the arguments.
This can be relatively trivially resolved by changing the bare
%s
,%r
, etc. into named substitutions. The attached patch is my attempt to take care of things.
3facc99 left the comments in an inconsistent state, some updated, others not. We should make these consistent.
Using hashtables for tokens makes no sense; we should move to objects and then have the type implicit.
From Google Code Issue 220:
Reported by [email protected], Mar 1, 2013
What steps will reproduce the problem?
Reproducible in Jython 2.5.2 and Jython 2.7b1>> import html5lib import html5lib Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lib/html5lib/__init__.py", line 14, in <module> from html5parser import HTMLParser, parse, parseFragment File "lib/html5lib/html5parser.py", line 33, in <module> import inputstream UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 48-54: illegal Unicode character
What is the expected output? What do you see instead?
jython cannot read inputstream.py.Please provide any additional information below.
inputstream.py contains some seriously broken Unicode characters in the range 0xD800-0xDFFF, which are known as "unpaired surrogates".
This has been closed as wont-fix: http://bugs.jython.org/issue1836
It may be necessary to modify inputstream.py to not use these unicode character literals when running in Jython.
n.b. a test for Jython:
import platform JYTHON = (platform.system() == 'Java')
Apr 7 (2 days ago) geoffers
As I just commented on the Jython bug, I believe this is a bug in Jython not implemented Python as it is documented. I don't particularly want to add in hacks for a bug in Jython.
Apr 7 (2 days ago) geoffers
Furthermore, if I do add a hack for it, we go into an infinite loop in the testsuite. I'm changing this bug to a more generic support Jython bug — no timeframe or even decision whether this is likely to happen.
Not seen this locally, oddly, but Travis CI always hits it:
Exception ResourceWarning: ResourceWarning("unclosed file <_io.BufferedReader name='/home/travis/build/html5lib/html5lib-python/html5lib/tests/testdata/encoding/chardet/test_big5.txt'>",) in <_io.FileIO name='/home/travis/build/html5lib/html5lib-python/html5lib/tests/testdata/encoding/chardet/test_big5.txt' mode='rb'> ignored
http://code.google.com/p/html5lib/issues/detail?id=211
Reported by [email protected], Aug 24, 2012
So I know this is not well-formed HTML, but it occurred in the wild as the output from Markdown.
I have the latest pypi Python library (version = 0.95-dev).
If I try to parse the following HTML, my program goes into an infinite loop and memory usage increases without stop:
u"<p>So theres no shortage of info out there on rounded corners and I've been through much of it and I'm posting to get the communities opinons at this piont.</p>\n<p>My scenario is that we're developing a rounded corner dependant design, mainly used for interactions (<button> and <a>). We are going to use border radius for the good browsers on the block that play nice with it and then use the server to send down javscript to browsers that don't</p>\n<p>What I'm wondering is what to use to up scale the browsers that ignore border radius CSS? I need something that works on button aswell as a, div etc. I've been looking at the following and have found that some don't play nice with <button>. Also the site already uses jQuery.</p>\n<p>http://www.curvycorners.net/ - http://code.google.com/p/jquerycurvycorners/</p>\n<p>http://www.html.it/articoli/niftycube/index.html</p>\n<p>http://www.malsup.com/jquery/corner/</p>"
Aug 24, 2012 waylan
I can't comment on the infinite loop, but as the maintainer of the Markdown library, I was concerned regarding the original reporter's implication that Markdown may be producing invalid HTML. While only the output is provided, not the input, it appears to me that the invalid output is a result of invalid input. You should be wrapping those random angle-bracket tags in code tags. So "(
<button>
and<a>
)" (note the backticks surrounding each tag) would be output by Markdown as "(<button>
and<a>
)", which is valid HTML and will not result in an infinite loop in html5lib.If, in the event that the Markdown input is coming from an untrusted third party, then you absolutely should be sanitizing it before passing it on to anything else.
That said, one such way to sanitize (my recommendation) is to use the Bleach library 1, which uses html5lib internally. So I guess we're back to that infinite loop.
Aug 24, 2012 [email protected]
The Markdown comes from the wild and is probably invalid.
My idea was to pass the HTML through tidy before running an HTML parser, thus avoiding an infinite loop. There are several tidy wrappers in Python. I used pytidylib.
I didn't play with the options to make tidy more strict, and even after tidy, html5lib still goes into an infinite loop. So my current workaround is to use tidy followed by lxml :\
Since people have requested this.
find . -name '*.py' -print0 | xargs -0 sed -i -e 's/\s*$//g'
will do. Should wait until there are no outstanding branches for the sake of them merging cleanly.
U+000D is converted to U+000A in the pre-tokenizer, so it must be serialized as a character reference.
Reported by devin.bayer, Jun 7, 2011
version html5lib-0.95_dev
/html5lib/html5parser.py line 242 in parseFragment self._parse(stream, True, container=container, encoding=encoding) /html5lib/html5parser.py line 110 in _parse parser=self, **kwargs) TypeError: __init__() got an unexpected keyword argument 'parser'
Jun 7, 2011 devin.bayer
This is a workaround and slightly safer design. There is no need for the mixin or to hardcode the
__init__
arguments:
from html5lib import HTMLParser
from html5lib.tokenizer import HTMLTokenizer
from html5lib.sanitizer import HTMLSanitizerMixin
from cgi import escape
class Sanitizer(HTMLTokenizer):
def __init__(self, *a, **kw):
HTMLTokenizer.__init__(self, *a, **kw)
self._saner = HTMLSanitizerMixin()
def __iter__(self):
for token in HTMLTokenizer.__iter__(self):
saner = self._saner.sanitize_token(token)
if saner: yield saner
PARSER = HTMLParser(tokenizer=Sanitizer)
def sanitize(html):
return PARSER.parseFragment(html).toxml()
For the sake of having something to prod James with.
Reported by maks.tsoy, Dec 24, 2012
Seems like the default value for encoding must be utf-8, not utf=8^
def toxml(self, encoding="utf=8"):
http://code.google.com/p/html5lib/issues/detail?id=200
Reported by vovanec, Mar 6, 2012
A simple test case(my program has more complex handler implementation but the problem is reproducible with the default handler):
import xml.sax.handler
import html5lib
def test(html):
handler = xml.sax.handler.ContentHandler()
parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder('dom'))
dom = parser.parse(html)
html5lib.treebuilders.dom.dom2sax(dom, handler)
html = '<html xml:lang="en">'
test(html)
With html5lib 0.95 it produces the following traceback:
python test.py Traceback (most recent call last): File "test.py", line 13, in <module> test(html) File "test.py", line 10, in test html5lib.treebuilders.dom.dom2sax(dom, handler) File "/home/vkuznets/packages/html5lib-0.95/html5lib-0.95/html5lib/treebuilders/dom.py", line 271, in dom2sax for child in node.childNodes: dom2sax(child, handler, nsmap) File "/home/vkuznets/packages/html5lib-0.95/html5lib-0.95/html5lib/treebuilders/dom.py", line 256, in dom2sax del attributes[(attr.namespaceURI, attr.nodeName)] KeyError: (None, u'xml:lang')
With previous versions(at least 0.11) there's no any error. I assume this attribute may be invalid in the xml namespace, but anyway I don't think it is ok for parser just to crash. I've seen A LOT of html documents that has such attribute in the real world.
Tested it with Python 2.6.5, Linux
Please advise.
Thanks,
--Vladimir
As specified in: http://www.w3.org/TR/html-templates/
Implementation already present in the Firefox nightlies, probably in Chrome too. This means the spec will be ported to HTML proper shortly.
http://code.google.com/p/html5lib/issues/detail?id=52
Reported by [email protected], Aug 9, 2007
CSS keywords "thin" and "thick" are not in the whitelist.
Aug 9, 2007 [email protected]
This issue was originally noticed at:
http://wiki.whatwg.org/wiki/Sanitization_rules
I assume feedparser and friends are all affected since they share the code.
(Patch dropped because it'll be as much effort to deal with changed paths and everything as to just rewrite it.)
http://code.google.com/p/html5lib/issues/detail?id=92
Reported by zcorpan, Feb 27, 2009
What steps will reproduce the problem?
Input:<br title=
><xmp>><script>alert(1)</script></xmp>
Serialization options: omit quotes.What is the expected output?
Attribute values with ` in them should be quoted even with the omit quotes
setting.What do you see instead?
Quotes are omitted and hence, the script is run in IE.
Feb 27, 2009 t.broyer
IIRC, the spec says a ` is allowed in an unquoted attribute value:
http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#attributesShould the spec be changed? should we rather add a new option to the serializer?
Mar 10, 2009 sad.neko
I'm sorry, but i couldn't find ` to be allowed in unquoted attribute values in html5
neither in html4. Am i missing something?
Sep 4, 2009 Simetrical
The requirements that comment 2 links to say unquoted attributes "must not contain any literal space characters, any U+0022 QUOTATION MARK (") characters, U+0027 APOSTROPHE (') characters, U+003D EQUALS SIGN (=) characters, U+003C LESS-THAN SIGN (<) characters, or U+003E GREATER-THAN SIGN (>) characters, and must not be the empty string." There are no other constraints that don't apply to quoted attributes as well.
What's the bug here? As far as I can tell from reading the spec, the given text
should parse as
<br title="
"><xmp>><script>alert(1)</script></xmp>
and conformant browsers should run the script.
Sep 6, 2009 zcorpan
No, because xmp is a RAWTEXT element. So it's equivalent to the following XML
<br title="
"/><xmp>><script>alert(1)</script></xmp>
but in IE it's equivalent to the following XML
<br title="><xmp>"/><script>alert(1)</script><xmp/>
(I think a stray </xmp> tag will result in an empty element in IE, but I could
remember incorrectly; anyway that's besides the point.)
Oct 18, 2009 geoffers
` is now non-conforming at the start of an unquoted attribute.
Related to #11.
http://code.google.com/p/html5lib/issues/detail?id=162
Reported by [email protected], Oct 10, 2010
DESCRIPTION
Consider the following interaction with html5lib 0.90:
>>> from html5lib import html5parser, serializer, treebuilders, treewalkers >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom')) >>> dom = p.parse("""<body onload="sucker()">""") >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True) >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom))) u'<body onload=sucker()>'
This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.
ANALYSIS
The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.
Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:
>>> from html5lib import tokenizer >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">"""))) {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}
But during filtering, tokens look like this:
>>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3] {'namespace': u'http://www.w3.org/1999/xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}
When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.
OBSERVATION
Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?
WORKAROUND
I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:
- Serialize the DOM to HTML without sanitization.
- Re-parse the HTML from step 1, using the sanitizing tokenizer.
http://code.google.com/p/html5lib/issues/detail?id=189
Reported by devin.bayer, Jul 25, 2011
Hi. I realize this is by design, but it's not intuitive, since similar standard classes like YamlDecoder and JSONDecoder are.
It would be more clear if the input stream was supplied to the constructor, like with ElementTree.
But at least, please document this in the class.
Sep 16, 2011 geoffers
Is there any reason to document it? This is the case with all Python code in CPython (other implementations may differ), so the cases where things are threadsafe are the notable exceptions.
Sep 16, 2011 devin.bayer
(Most?) Everything in the python standard library is threadsafe and most extensions are. I think you are referring to the GIL, which is different. That prevents parallel execution, but if one thread is blocking, the others can run safely.
The problem with the design of HTMLParser is that two threads can interfere with each other, even if they are not running at the same time.
Mar 11, 2012 [email protected]
This is clearly a defect. This is an object-oriented library in an object oriented language. Two parsers should be completely independent of each other, with no shared global variables, and thus thread-safe. If that's not the case, this is a defect.
Do I have to scrap my plans to convert a parallel web crawler from BeautifulSoup to html5lib?
This looks fixable. The trouble spots include at least these global variables:
dom.py: moduleCache
That could be easily fixed with a lock in getDomModule. That's a once per parse event, so there's no performance issue. All that's needs is
import threading ... Lok = threading.Lock() with Lok() : ... critical section...
etree.py: moduleCache
Same issue.
etree.lxml: fullTree
This seems to be set only once, at load time. Is it changed elsewhere?
what have I missed? Some lower level library? Is Python's SAX parser unsafe?
This can and should be fixed.
To quote @rubys from http://code.google.com/p/html5lib/issues/detail?id=62:
My inclination is to flip this entirely. It seems inconsistent that evil CSS is
stripped, and unknown attributes are stripped, but unknown elements are escaped, and
escaped poorly (what happens if an attribute for this element has a double quote in it?).I mean, who wants to see
<object>
tags. It is bad enough that YouTube videos are
stripped, but rubbing salt in the would by showing a bunch of gibberish seems
entirely unnecessary.I'd suggest a
expose_disallowed_elements=False
class variable which can be set to
True if somebody really wants the current behavior.
With #26 this has become possible, but a nicer API would be better.
They are therefore liable to run out of stack. They shouldn't.
http://code.google.com/p/html5lib/issues/detail?id=180
Reported by bjellema20, Mar 8, 2011
What steps will reproduce the problem?
Pass any html into the sanitizer with an inline style that includes a font-family with a dash (-) such as "sans-serif" and the entire style is stripped. Example html:
<span style='font-size:10.0pt;font-family:"Verdana","sans-serif";color:#1F497D'>Enjoy your day</span>
What is the expected output? What do you see instead?
The style tag should stay, but instead we see:
<span style="">Enjoy your day</span>
Please provide any additional information below.
I've solved this by changing line 197 in sanitizer.py from:
if not re.match("""^([:,;#%.\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'|"[\s\w]+"|\([\d,\s]+\))*$""", style): return ''To:
if not re.match("""^([:,;#%.\sa-zA-Z0-9!]|\w-\w|'[\s\w-]+'|"[\s\w-]+"|\([\d,\s]+\))*$""", style): return ''
http://code.google.com/p/html5lib/issues/detail?id=176
Reported by eric.promislow, Feb 23, 2011
For this standalone code (Python 2.6, html5lib 0.90):
import cStringIO, html5lib, pprint
text = """ <!DOCTYPE html> Here's a table <table> <caption>Stuff goes here</bogus> <tr><td>col</td><td>another</td></tr> </table> """ inputStream = cStringIO.StringIO(text) parser = html5lib.HTMLParser() doc = parser.parse(inputStream) errors = parser.errors pprint.pprint(errors)I get this output:
[((5, 32), 'unexpected-end-tag', {'name': u'bogus'}), ((6, 4), 'XXX-undefined-error', {})]
Quick first look at the code, I would say that end-tags that aren't
in self.tree.openElements should be popped as part of error-recovery,
but then, I just started working with this project a few hours ago...
>>> html5lib.parse("<p val=\"hu\x00\">", treebuilder="lxml")
Traceback (most recent call last):
...
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
http://code.google.com/p/html5lib/issues/detail?id=79
Reported by nils.winter, Sep 1, 2008
What steps will reproduce the problem?
Take any html file and use the minidom treebuilder. See the attached example.
What is the expected output? What do you see instead?
docgetElementyById doesn't work. It should
Please provide any additional information below.
minidom seems to expect some document definitions like
<!DOCTYPE x [<!ATTLIST x a ID #IMPLIED>]
(or the complete doctype). Since the html5lib treebuilder is used for building the document an html5lib is provided with a valid html document, it should be html5libs business to provide the ID definitions based on the current html doctype to minidom.
all over the in the file html5lib/treewalkers/_base.py there are places where isinstance is used and the variable is compared to six.text_type instead of six.string_types
On Python 2, six.text_type = unicode, this means that if an attribute of value passed in is str, then there will be an assertion error.
on Python 2:
on Python 3:
So I think the values should be compared against six.string_types and then six.text_type be used to coerce them for output.
From Google Code, issue #221, reported by @mo. This is a more recent run post python3 merge below.
$ pyflakes html5lib/
html5lib/__init__.py:16: 'parse' imported but unused
html5lib/__init__.py:16: 'HTMLParser' imported but unused
html5lib/__init__.py:16: 'parseFragment' imported but unused
html5lib/__init__.py:17: 'getTreeBuilder' imported but unused
html5lib/__init__.py:18: 'getTreeWalker' imported but unused
html5lib/__init__.py:19: 'serialize' imported but unused
html5lib/inputstream.py:6: 'types' imported but unused
html5lib/inputstream.py:110: local variable 'data' is assigned to but never used
html5lib/inputstream.py:293: 'sys' imported but unused
html5lib/tokenizer.py:11: 'entitiesWindows1252' imported but unused
html5lib/tokenizer.py:12: 'asciiLowercase' imported but unused
html5lib/tokenizer.py:801: redefinition of unused 'scriptDataDoubleEscapedDashState' from line 778
html5lib/utils.py:3: 'version_info' imported but unused
html5lib/html5parser.py:4: 'sys' imported but unused
html5lib/html5parser.py:17: 'formattingElements' imported but unused
html5lib/html5parser.py:18: 'tableInsertModeElements' imported but unused
html5lib/html5parser.py:19: 'voidElements' imported but unused
html5lib/html5parser.py:20: redefinition of unused 'spaceCharacters' from line 16
html5lib/html5parser.py:91: local variable 'e' is assigned to but never used
html5lib/html5parser.py:408: local variable 'element' is assigned to but never used
html5lib/html5parser.py:1405: local variable 'name' is assigned to but never used
html5lib/html5parser.py:1591: local variable 'node' is assigned to but never used
html5lib/tests/__init__.py:12: 'support' imported but unused
html5lib/tests/test_serializer.py:18: 'html5parser' imported but unused
html5lib/tests/test_serializer.py:175: local variable 'test_name' is assigned to but never used
html5lib/tests/test_parser2.py:5: 'support' imported but unused
html5lib/tests/test_parser2.py:37: local variable 'doc' is assigned to but never used
html5lib/tests/support.py:15: 'html5lib' imported but unused
html5lib/tests/support.py:16: 'html5parser' imported but unused
html5lib/tests/support.py:46: 'lxml' imported but unused
html5lib/tests/test_sanitizer.py:3: 'os' imported but unused
html5lib/tests/test_sanitizer.py:4: 'sys' imported but unused
html5lib/tests/test_sanitizer.py:5: 'unittest' imported but unused
html5lib/tests/test_stream.py:3: 'support' imported but unused
html5lib/tests/tokenizertotree.py:10: 'test_parser' imported but unused
html5lib/tests/test_tokenizer.py:5: 'sys' imported but unused
html5lib/tests/test_tokenizer.py:7: 'io' imported but unused
html5lib/tests/test_tokenizer.py:183: local variable 'testName' is assigned to but never used
html5lib/tests/test_tokenizer.py:187: local variable 'skip' is assigned to but never used
html5lib/tests/test_parser.py:6: 'io' imported but unused
html5lib/tests/test_parser.py:14: 'html5lib' imported but unused
html5lib/tests/test_parser.py:15: 'treebuilders' imported but unused
html5lib/tests/test_treewalkers.py:16: 'LintError' imported but unused
html5lib/tests/test_treewalkers.py:16: 'LintFilter' imported but unused
html5lib/tests/test_treewalkers.py:20: redefinition of unused 'COMMENT' from line 109
html5lib/tests/test_treewalkers.py:87: 'ElementTree' imported but unused
html5lib/tests/test_encoding.py:3: 're' imported but unused
html5lib/tests/test_encoding.py:30: local variable 't' is assigned to but never used
html5lib/tests/test_encoding.py:47: local variable 'test_name' is assigned to but never used
html5lib/tests/test_encoding.py:55: 'chardet' imported but unused
html5lib/treebuilders/__init__.py:37: 'sys' imported but unused
html5lib/treebuilders/dom.py:5: 're' imported but unused
html5lib/treebuilders/dom.py:9: 'ihatexml' imported but unused
html5lib/treebuilders/etree.py:210: local variable 'finalText' is assigned to but never used
html5lib/treebuilders/etree.py:274: local variable 'finalText' is assigned to but never used
html5lib/trie/datrie.py:3: 'chain' imported but unused
html5lib/serializer/htmlserializer.py:27: redefinition of unused 'entities' from line 14
html5lib/serializer/htmlserializer.py:231: local variable 'attributes' is assigned to but never used
html5lib/treewalkers/dom.py:9: 'voidElements' imported but unused
html5lib/treewalkers/simpletree.py:50: undefined name '_node'
html5lib/treewalkers/lxmletree.py:8: 'sys' imported but unused
html5lib/treewalkers/lxmletree.py:13: 'voidElements' imported but unused
html5lib/treewalkers/_base.py:98: undefined name 'NodeImplementedError'
html5lib/treewalkers/_base.py:157: local variable 'endTag' is assigned to but never used
$ pyflakes html5lib/ | grep undefined
html5lib/treewalkers/simpletree.py:50: undefined name '_node'
html5lib/treewalkers/_base.py:98: undefined name 'NodeImplementedError'
html5lib/treewalkers/etree.py:129: undefined name 'moduleFactoryFactory'
http://code.google.com/p/html5lib/issues/detail?id=190
Reported by marko.koivusalo, Sep 12, 2011
What steps will reproduce the problem?
Unable to reproduce
Please provide any additional information below.
http://flexget.com/ticket/1099
relevant bit:
File "build/bdist.linux-x86_64/egg/html5lib/inputstream.py", line 285, in detectBOM self.rawStream.seek(encoding and seek or 0) File "build/bdist.linux-x86_64/egg/html5lib/inputstream.py", line 50, in seek assert pos < self._bufferedBytes() AssertionError```
This is nowadays in HTML and a two-line fix.
InputStream is a mess. We should clean up so the legacy entry-point does all the magic, and all the internal code is clean.
We should have an abstract base class that defines the methods an InputStream must support, including concrete implementations when possible (charsUntil should be implementable based on lower-level APIs, for example). We should have two concrete classes inheriting from this: TextInputStream and BytesInputStream which take as an argument a file-like object (in Python 3, one that inherits from io.IOBase).
We should investigate whether we can get rid of our BufferedStream, relying upon io.BufferedReader (which exists from 2.6, which we now require).
The entry-point will need to:
http://code.google.com/p/html5lib/issues/detail?id=154
Reported by @fantasai, Jun 1, 2010
What steps will reproduce the problem?
Try to reserialize a document containing xml:langWhat is the expected output? What do you see instead?
For HTML output, I'd expect to getlang=
in place ofxml:lang=
.
Instead I get{http://www.w3.org/XML/1998/namespace}lang
which
is completely malformed.
This is effectively a six-based Py2/3 version of chardet. Would allow us for testing purposes to stop being evil under Python 3 and using the Debian package.
Nothing has used this for years, and it was inevitably the source of issues historically (exhausting stack, etc.).
http://code.google.com/p/html5lib/issues/detail?id=93
Reported by zcorpan, Feb 27, 2009
This is similar to issue 92 except there's an old Opera bug where certain
characters are treated as whitespace.http://www.opera.com/support/kb/view/900/
The characters are
U+0009, U+000A, U+000B, U+000C, U+000D, U+0020, U+002F, U+00A0, U+1680, U
+180E, U+180F, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U
+2007, U+2008, U+2009, U+200A, U+2028, U+2029, U+202F, U+205F and U+3000html5lib should probably quote attribute values that contain any of these.
Also, given that Gecko and WebKit start a new tag for
<foo bar=baz<quux>
you should probably also quote attribute values that contain "<".
Apr 27, 2009 excors
Also see http://software.hixie.ch/utilities/js/live-dom-viewer/saved/95
In addition to the values mentioned in the spec, the following seem to require
quoting:Safari 3.0: U+0000 to U+0020 inclusive
Konqueror 4.1: U+0000 to U+0020 inclusive
Safari 3.1: U+000B
Opera 9.6: U+000B
IE6, IE8: U+000B, U+0060
Firefox 2/3: (Not U+0008 despite what that test script says; those characters just
get stripped, it seems)
Apr 27, 2009 zcorpan
(U+000B is not a valid character in HTML5, though I don't know if the serializer
tries to keep the character data valid.)
Sep 4, 2009 Simetrical
The spec should be updated to ban these too, then, right? They're not interoperably
supported. I doubt anyone will cry about not being able to use sub-0x20 characters in
unquoted attribute values, anyway. :) U+60 is `, doesn't seem like a big issue
either. Should this be brought up on the mailing list?
Sep 5, 2009 geoffers
IMO yes, just someone needs to get around to it. :)
Sep 6, 2009 zcorpan
I did, and Hixie rejected it saying that it's an issue that will go away over time.
Feel free to bring it up again (citing that sites who implement the spec using a
serializer will expose themselves to security problems with legacy browsers).
Sep 7, 2009 Simetrical
I posted this a couple of days ago:
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-September/022711.html
Oct 28, 2009 geoffers
Accepted, though we still need to decide how much to quote.
Oct 30, 2009 geoffers
I don't think we need to try and get the spec to quote anything else.
This should presumably be a legacy_quote option or some such.
We haven't supported or tested it in forever; it should go.
You released this previous version and externally hosted it. With recent changes (for very good reasons) externals like yours (without a hash and such) don't work anymore. Since there isn't really a change log between 0.90 and 0.95, could you please upload 0.90 so until we have time to upgrade we can keep using what works for us?
In html5lib/inputstream.py, unicode_literals
is imported from __future__
. This causes html5lib.inputstream.BufferedStream
to misbehave, specifically the _readFromBuffer
method, which ends with return "".join(rv)
. Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.
An example of the problem caused:
from urllib2 import Request, urlopen
from html5lib.inputstream import HTMLBinaryInputStream
req = Request(url='http://example.org/')
source = urlopen(req)
HTMLBinaryInputStream(source)
Causing:
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
File ".../html5lib/inputstream.py", line 411, in __init__
self.charEncoding = self.detectEncoding(parseMeta, chardet)
File ".../html5lib/inputstream.py", line 448, in detectEncoding
encoding = self.detectEncodingMeta()
File ".../html5lib/inputstream.py", line 535, in detectEncodingMeta
assert isinstance(buffer, bytes)
AssertionError
(That is, when HTMLBinaryInputStream
is used with a file-like object (such as the result of urllib2.urlopen
), it wraps it in a BufferedStream
, which then fails (at line 535) with the assert isinstance(buffer, bytes)
.)
This can be fixed by using a byte literal in _readFromBuffer
, instead, i.e. return b"".join(rv)
. (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.