<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Daisydiff fails to process certain invalid HTML files about daisydiff HOT 9 OPEN

GoogleCodeExporter commented on August 19, 2024

Daisydiff fails to process certain invalid HTML files

from daisydiff.

Comments (9)

GoogleCodeExporter commented on August 19, 2024

As can be seen by the source code and the stack trace you posted, daisydiff 
uses Neko to convert HTML to valid XML , and then parses the result using SAX.

So if an exception is present it means that Neko failed and could NOT convert 
the result. Why do you say that NekoHTML can process it?

It seems to me that this might be a bug in Neko and not DaisyDiff.

Original comment by [email protected] on 24 Aug 2012 at 10:31

from daisydiff.

GoogleCodeExporter commented on August 19, 2024

That was my thought as well which is why I took nekohtml.jar and 
xercesImpl-2.8.0.jar from my daisydiff version and ran the above HTML through 
nekohtml (using the sample Java program from 
http://nekohtml.sourceforge.net/usage.html). Neko came through the example 
without problems and printed the correct DOM tree.

I've zipped up the nekohtml test and uploaded it: 
https://dl.dropbox.com/u/1898992/nekoTest.zip

I can do more tests but I'd need some guidance as XML processing on Java isn't 
my forte.

Original comment by [email protected] on 24 Aug 2012 at 10:39

from daisydiff.

GoogleCodeExporter commented on August 19, 2024

I modified the neko test to also include a call to get attributes (uploaded to 
same URL) and it still doesn't crash.

Original comment by [email protected] on 24 Aug 2012 at 11:08

from daisydiff.

GoogleCodeExporter commented on August 19, 2024

So here's my theory (bear in mind I am no expert on NekoHTML or daisydiff):

NekoHTML parses the HTML into a DOM which is stored as Java objects. There is 
no checking if the attribute name is a valid XML name as the document currently 
isn't in XML form anyway (and I guess everything is valid in tagsoup HTML :)). 
Then daisydiff copies this attribute into XML along with all the others and the 
XML implementation rejects the final document as invalid.

If this is indeed the case, daisydiff should check all attributes whether their 
names are valid in XML and either drop them if they aren't or perhaps prefix 
them with an underscore or something to make them valid XML.

Original comment by [email protected] on 25 Aug 2012 at 9:45

from daisydiff.

GoogleCodeExporter commented on August 19, 2024

After more investigation, I found the solution in NekoHTML: it offers filters 
that can alter the processing of HTML. One in particular, called Purifier, 
ensures XML well-formedness. Using this solves the issue, I'm including the 
patch.

What it does is along the same lines as my proposed solution: it renames the 
invalid attribute name to start with valid characters.

Original comment by [email protected] on 28 Aug 2012 at 10:11

Attachments:

purify_html_before_parsing.diff

from daisydiff.

GoogleCodeExporter commented on August 19, 2024

Great find!

However what are the side effect on this? Does this filter break anything else?
Have you seen the unit tests contained in DaisyDiff? Do they still pass?

Original comment by [email protected] on 28 Aug 2012 at 11:57

from daisydiff.

GoogleCodeExporter commented on August 19, 2024

This fix should not be applied :(

I've found out that this Purifier has some problems 
(http://sourceforge.net/tracker/?func=detail&atid=952178&aid=3497694&group_id=19
5122) and this is what has been causing the other issue I have reported.

Sorry if I have wasted anybody's time.

Original comment by [email protected] on 31 Aug 2012 at 2:02

from daisydiff.

GoogleCodeExporter commented on August 19, 2024

It is OK. That is the main problem with DaisyDiff. You fix something in one 
place, and something else breaks :-0

Original comment by [email protected] on 31 Aug 2012 at 3:02

from daisydiff.

GoogleCodeExporter commented on August 19, 2024

Well it isn't daisydiff's problem this time. Hmm, seeing that the fix posted 
for the NekoHTML bug is simply subclassing the Purifier subclass, maybe we 
could do that in daisydiff? I'm just wondering why that person's fix wasn't 
accepted into NekoHTML itself, seeing as it was proposed several months ago..

Original comment by [email protected] on 31 Aug 2012 at 4:08

from daisydiff.

Daisydiff fails to process certain invalid HTML files about daisydiff HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs