<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Original comment by jwall@google.com on 2 Jul 2012 at

Imbalanced closing tag causes Parse failure about go-html-transform HOT 6 CLOSED

lambdax commented on September 25, 2024

Imbalanced closing tag causes Parse failure

from go-html-transform.

Comments (6)

GoogleCodeExporter commented on September 25, 2024

Original comment by [email protected] on 2 Jul 2012 at 9:48

Changed state: Accepted
Added labels: ****
Removed labels: ****

from go-html-transform.

GoogleCodeExporter commented on September 25, 2024

I'm not sure the correct behavior here is to ignore the end tag. My reading of 
the html5 spec is that this is invalid html5. However I can see how using this 
library for scraping pages could be useful.

Perhaps I should use a flag to govern whether this should be strict html5 
parsing or best-effort forgiving parsing. I'm still thinking about it.

Original comment by JeremyMZHS on 18 Jul 2012 at 8:20

Added labels: ****
Removed labels: ****

from go-html-transform.

GoogleCodeExporter commented on September 25, 2024

Yes, it's invalid HTML.  However, the HTML5 spec defines in painful detail what 
is supposed to happen in that situation.  See

http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.ht
ml#adoptionAgency

The HTML5 spec standardizes the handling of bad HTML, so that user agents all 
do it the same way.  It is a huge pain to implement properly, but parsers that 
don't follow the spec should not be called HTML5 parsers.

Original comment by [email protected] on 7 Jan 2013 at 7:24

Added labels: ****
Removed labels: ****

from go-html-transform.

GoogleCodeExporter commented on September 25, 2024

As a test, I wrote a simple Go program to parse web pages and report errors. 
Here are a few major sites. The quality of HTML out there is amazingly low, as 
anyone involved in web crawling knows.  An HTML5 parser has to handle all the 
error cases, or it's just a toy. Some examples:

C:\projects\go>go run src\examples\htmlreaderdemo\htmlreaderdemo.go
HTML reader demo.

Enter URL: http://code.google.com
Error parsing 'http://code.google.com': Parse error: Strange characters in end 
tag: [:] switching to BogusCommentState
(Note: This fails because "code.google.com" has XML namespace tags in an HTML 
document. The document DOCTYPE probably should be XHTML, but it isn't. 
<g:plusone> is not valid HTML.  There's also trouble with the character set 
header disagreeing with the actual document content. Fail.)

Enter URL: http://www.go.com
Error parsing 'http://www.go.com': Parse error: NotSameTag: End Tag does not 
match Start Tag start:[td] end:[div]
(Note: Disney's home page.  W3C validator finds 79 errors.)

Enter URL: http://www.gm.com
Error parsing 'http://www.gm.com': Parse error: NotSameTag: End Tag does not 
match Start Tag start:[head] end:[div]

Enter URL: http://www.ford.com
Error parsing 'http://www.ford.com': Parse error: NotSameTag: End Tag does not 
match Start Tag start:[head] end:[body]

Enter URL: http://www.intel.com
OK.

Enter URL: http://www.iana.org
OK.

Enter URL: http://www.nyt.com
OK.

Original comment by [email protected] on 7 Jan 2013 at 8:14

Added labels: ****
Removed labels: ****

from go-html-transform.

GoogleCodeExporter commented on September 25, 2024

I'll take a stab at these later this week.

Original comment by JeremyMZHS on 16 Jan 2013 at 9:43

Added labels: ****
Removed labels: ****

from go-html-transform.

GoogleCodeExporter commented on September 25, 2024

I finally got around to porting go-html-transform over the the exp/html parser 
in go tip. h5 was always a stop-gap for me while they got that parser 
shipshape. This bug should now no longer apply.

Original comment by [email protected] on 28 Jan 2013 at 5:10

Changed state: Fixed
Added labels: ****
Removed labels: ****

from go-html-transform.

Imbalanced closing tag causes Parse failure about go-html-transform HOT 6 CLOSED

Comments (6)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs