Comments (9)
As can be seen by the source code and the stack trace you posted, daisydiff
uses Neko to convert HTML to valid XML , and then parses the result using SAX.
So if an exception is present it means that Neko failed and could NOT convert
the result. Why do you say that NekoHTML can process it?
It seems to me that this might be a bug in Neko and not DaisyDiff.
Original comment by [email protected]
on 24 Aug 2012 at 10:31
from daisydiff.
That was my thought as well which is why I took nekohtml.jar and
xercesImpl-2.8.0.jar from my daisydiff version and ran the above HTML through
nekohtml (using the sample Java program from
http://nekohtml.sourceforge.net/usage.html). Neko came through the example
without problems and printed the correct DOM tree.
I've zipped up the nekohtml test and uploaded it:
https://dl.dropbox.com/u/1898992/nekoTest.zip
I can do more tests but I'd need some guidance as XML processing on Java isn't
my forte.
Original comment by [email protected]
on 24 Aug 2012 at 10:39
from daisydiff.
I modified the neko test to also include a call to get attributes (uploaded to
same URL) and it still doesn't crash.
Original comment by [email protected]
on 24 Aug 2012 at 11:08
from daisydiff.
So here's my theory (bear in mind I am no expert on NekoHTML or daisydiff):
NekoHTML parses the HTML into a DOM which is stored as Java objects. There is
no checking if the attribute name is a valid XML name as the document currently
isn't in XML form anyway (and I guess everything is valid in tagsoup HTML :)).
Then daisydiff copies this attribute into XML along with all the others and the
XML implementation rejects the final document as invalid.
If this is indeed the case, daisydiff should check all attributes whether their
names are valid in XML and either drop them if they aren't or perhaps prefix
them with an underscore or something to make them valid XML.
Original comment by [email protected]
on 25 Aug 2012 at 9:45
from daisydiff.
After more investigation, I found the solution in NekoHTML: it offers filters
that can alter the processing of HTML. One in particular, called Purifier,
ensures XML well-formedness. Using this solves the issue, I'm including the
patch.
What it does is along the same lines as my proposed solution: it renames the
invalid attribute name to start with valid characters.
Original comment by [email protected]
on 28 Aug 2012 at 10:11
Attachments:
from daisydiff.
Great find!
However what are the side effect on this? Does this filter break anything else?
Have you seen the unit tests contained in DaisyDiff? Do they still pass?
Original comment by [email protected]
on 28 Aug 2012 at 11:57
from daisydiff.
This fix should not be applied :(
I've found out that this Purifier has some problems
(http://sourceforge.net/tracker/?func=detail&atid=952178&aid=3497694&group_id=19
5122) and this is what has been causing the other issue I have reported.
Sorry if I have wasted anybody's time.
Original comment by [email protected]
on 31 Aug 2012 at 2:02
from daisydiff.
It is OK. That is the main problem with DaisyDiff. You fix something in one
place, and something else breaks :-0
Original comment by [email protected]
on 31 Aug 2012 at 3:02
from daisydiff.
Well it isn't daisydiff's problem this time. Hmm, seeing that the fix posted
for the NekoHTML bug is simply subclassing the Purifier subclass, maybe we
could do that in daisydiff? I'm just wondering why that person's fix wasn't
accepted into NekoHTML itself, seeing as it was proposed several months ago..
Original comment by [email protected]
on 31 Aug 2012 at 4:08
from daisydiff.
Related Issues (20)
- Indentation goes crazy when you compare two html pages having lot of numbered lists HOT 2
- Indentation goes crazy when you compare two html pages having lot of numbered lists HOT 4
- Alternative jQuery UI for Diff results
- DOM structure is modified in the daisydiff output HOT 5
- Unit test fails due to a missing newline character HOT 2
- Empty IMG tag throws NullPointerException
- when we comparing the two files have no change, daisy diff throwing Uncaught unknown destination..
- Does not Diff <Title> or <Meta> or <JavaScript> in HTML HOT 3
- An element that was moved out of a table can lead to broken table elements in the diff HOT 1
- Invalid tags are generated HOT 2
- Word changed but showing removed and added & change in image showing improper in Chrome. HOT 1
- error on line 6 at column 8: Opening and ending tag mismatch: link line 0 and head HOT 8
- [deleted issue]
- Compare result error for table
- [deleted issue]
- TextNode->IsSame() references not declared variable $html2 in php
- Is it possible to display the new created content? Not display the whole line.
- Xerces Impl included in daisydiff.jar has security vulnerabilities
- not capturing the mismatch
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from daisydiff.