GithubHelp home page GithubHelp logo

Comments (9)

GoogleCodeExporter avatar GoogleCodeExporter commented on August 19, 2024
As can be seen by the source code and the stack trace you posted, daisydiff 
uses Neko to convert HTML to valid XML , and then parses the result using SAX.

So if an exception is present it means that Neko failed and could NOT convert 
the result. Why do you say that NekoHTML can process it?

It seems to me that this might be a bug in Neko and not DaisyDiff.

Original comment by [email protected] on 24 Aug 2012 at 10:31

from daisydiff.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 19, 2024
That was my thought as well which is why I took nekohtml.jar and 
xercesImpl-2.8.0.jar from my daisydiff version and ran the above HTML through 
nekohtml (using the sample Java program from 
http://nekohtml.sourceforge.net/usage.html). Neko came through the example 
without problems and printed the correct DOM tree.

I've zipped up the nekohtml test and uploaded it: 
https://dl.dropbox.com/u/1898992/nekoTest.zip

I can do more tests but I'd need some guidance as XML processing on Java isn't 
my forte.

Original comment by [email protected] on 24 Aug 2012 at 10:39

from daisydiff.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 19, 2024
I modified the neko test to also include a call to get attributes (uploaded to 
same URL) and it still doesn't crash.

Original comment by [email protected] on 24 Aug 2012 at 11:08

from daisydiff.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 19, 2024
So here's my theory (bear in mind I am no expert on NekoHTML or daisydiff):

NekoHTML parses the HTML into a DOM which is stored as Java objects. There is 
no checking if the attribute name is a valid XML name as the document currently 
isn't in XML form anyway (and I guess everything is valid in tagsoup HTML :)). 
Then daisydiff copies this attribute into XML along with all the others and the 
XML implementation rejects the final document as invalid.

If this is indeed the case, daisydiff should check all attributes whether their 
names are valid in XML and either drop them if they aren't or perhaps prefix 
them with an underscore or something to make them valid XML.

Original comment by [email protected] on 25 Aug 2012 at 9:45

from daisydiff.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 19, 2024
After more investigation, I found the solution in NekoHTML: it offers filters 
that can alter the processing of HTML. One in particular, called Purifier, 
ensures XML well-formedness. Using this solves the issue, I'm including the 
patch.

What it does is along the same lines as my proposed solution: it renames the 
invalid attribute name to start with valid characters.

Original comment by [email protected] on 28 Aug 2012 at 10:11

Attachments:

from daisydiff.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 19, 2024
Great find!

However what are the side effect on this? Does this filter break anything else?
Have you seen the unit tests contained in DaisyDiff? Do they still pass?

Original comment by [email protected] on 28 Aug 2012 at 11:57

from daisydiff.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 19, 2024
This fix should not be applied :(

I've found out that this Purifier has some problems 
(http://sourceforge.net/tracker/?func=detail&atid=952178&aid=3497694&group_id=19
5122) and this is what has been causing the other issue I have reported.

Sorry if I have wasted anybody's time.

Original comment by [email protected] on 31 Aug 2012 at 2:02

from daisydiff.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 19, 2024
It is OK. That is the main problem with DaisyDiff. You fix something in one 
place, and something else breaks :-0

Original comment by [email protected] on 31 Aug 2012 at 3:02

from daisydiff.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 19, 2024
Well it isn't daisydiff's problem this time. Hmm, seeing that the fix posted 
for the NekoHTML bug is simply subclassing the Purifier subclass, maybe we 
could do that in daisydiff? I'm just wondering why that person's fix wasn't 
accepted into NekoHTML itself, seeing as it was proposed several months ago..

Original comment by [email protected] on 31 Aug 2012 at 4:08

from daisydiff.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.