GithubHelp home page GithubHelp logo

Comments (23)

spassarop avatar spassarop commented on July 24, 2024 1

You won't believe it. When you exceed the three significant figures, the value in CSS as a lexical unit is interpreted with an exponential value. What I mean is that 0.001 is parsed to 0.001, but 0.0001 is parsed to 1.0e-5. That value is later used as string for the regex so it fails.

This lexical unit stuff is how Batik CSS handles CSS parsing and its way beyond AntiSamy limits. What AntiSamy does is to concatenate the lexical unit properties from Batik CSS API, this is the case for margin, a float value:

lu.getFloatValue() + lu.getDimensionUnitText()

This is "1.0E-5" + "pt" and after that comes a lowercase which is the final value compared against the regex. Kinda tricky...
My proposal is to include exponential format in the regex, which is totally valid for browsers already. However, the policy regex is not entirely loyal to what is compared later because if I want to tailor the margin regex to be [+-]?\d*.?\d* and not accept any other formats, then I will have a mismatch as it's happening now. On the other hand, we cannot know deep inside the CssValidator which was the string format before transforming itself into a lexical unit with a float value, simple decimal or exponential?

The other problem is that even if the regex is changed anticipating that shady/hidden/implicit conversion and the margin keeps its value, it will have that new verified string, so instead of the original 0.00001pt it will be 1.0E-5pt. That's more inconsistent behavior. Probably double or BigDecimal would be better to handle this, but they would probably cause a conflict between exponential vs non-exponential anyway and that is up to Batik CSS.

So, there lies the reason and extended background of the issue. @davewichers what do you think about all that?

As a side note: this also happens on .NET, just can't fight with the underlying parser :'(

from antisamy.

spassarop avatar spassarop commented on July 24, 2024 1

I agree with Kevin on the wontfix and on the W3C compliance argument. Just to avoid misunderstandings, @LiuXing-R was just helping investigate, @bijarniy was the one who created the issue.

I found the JIRA site where they add issues and signed up but there are so many fields to fill that I'll probably be kicked in the head by one of the team if I do it wrong. It would be better to notify somewhere to discuss and make them follow the right process. Apparently it must be done through the mailing list, I subscribed [email protected] to later see how you're supposed to report a bug. Of course anyone else can do it too, in my case if I end doing it I'll comment here. It will take time to elaborate a convincing and complete bug report.

from antisamy.

davewichers avatar davewichers commented on July 24, 2024

@spassarop - Can you look into this one too?

from antisamy.

davewichers avatar davewichers commented on July 24, 2024

@spassarop - Whatever you want to do here is fine with me. I trust you :-)

from antisamy.

spassarop avatar spassarop commented on July 24, 2024

In that case, the only thing I can offer is to label this issue as bug and update the example policies so margin can handle exponentials, but still having the format change inconsistency on the output that's out of our hands. @bijarniy, of course you can do that policy change in your project yourself, I just offer having it by default for future releases.

from antisamy.

davewichers avatar davewichers commented on July 24, 2024

Sounds fine with me. Do you want to implement/push those changes?

from antisamy.

spassarop avatar spassarop commented on July 24, 2024

I'm on it. Doing the test cases I found out that there are also problems on Batik CSS with exponential dimensions when they are the input. When using the margin 1.0E-4pt, the value is 1.0 and the dimension is E-4pt instead of just pt. Maybe I can add the test cases explaining the current behavior so the test code is self-explainatory about the issue, expecting them to fail or something like that.

from antisamy.

bijarniy avatar bijarniy commented on July 24, 2024

In that case, the only thing I can offer is to label this issue as bug and update the example policies so margin can handle exponentials, but still having the format change inconsistency on the output that's out of our hands. @bijarniy, of course you can do that policy change in your project yourself, I just offer having it by default for future releases.

Thanks @davewichers for sharing RCA of this issue. As per your latest comment, are you going to provide this by default in the next release? If so, could you please share probable timeline for the same and based on that we would be asking our customer to change the policy to allow exponentials
cc : @spassarop

from antisamy.

bijarniy avatar bijarniy commented on July 24, 2024

@davewichers / @spassarop I did close this issue by mistake so reopening it.

from antisamy.

spassarop avatar spassarop commented on July 24, 2024

@bijarniy yes this will be on the next release if it gets merged before that.

from antisamy.

spassarop avatar spassarop commented on July 24, 2024

@davewichers even though a partial solution was provided including exponentials on regex there is no actual solution as we depend on Batick CSS. Should we leave this open potentially forever? bug is the only certain label I decided to add.

from antisamy.

kwwall avatar kwwall commented on July 24, 2024

@davewichers and @spassarop - This really is sounding more like a bug in Batik CSS or possibly a misalignment in the understanding with your users than it is a bug per se in AntiSamy. I would recommend closing it with a 'wontfix' label. (You do have one of those, don't you? I haven't checked.)

from antisamy.

LiuXing-R avatar LiuXing-R commented on July 24, 2024

I verified some CSS attributes (width and length) and found that there was a problem of 10000000 -> 1.0e7, so I looked at the source code because Batik CSS used Float.parseFloat(String val).
https://github.com/apache/xmlgraphics-batik/blob/f79ce602fa6244d2a7ff45b57b6a0347e883bc37/batik-css/src/main/java/org/apache/batik/css/parser/Parser.java#L981-L983
->
https://github.com/apache/xmlgraphics-batik/blob/f79ce602fa6244d2a7ff45b57b6a0347e883bc37/batik-css/src/main/java/org/apache/batik/css/parser/Parser.java#L1551-L1560

Do we need to push the Batik CSS community to address this issue?

from antisamy.

davewichers avatar davewichers commented on July 24, 2024

@spassarop - Would you be willing to open up a discussion with Batik to see what they say?

from antisamy.

kwwall avatar kwwall commented on July 24, 2024

@star-R I think the only logical approach at this point is to ask that Batik CSS community to fix it. Write up an example with a test case using Batik CSS directly and show them expected outcome to illustrate exactly what you are asking for / expecting. Because I think the AntiSamy team has already said they are not going to replace Batik CSS or try to write their own CSS parser. Just my $.02, but that was my interpretation based on earlier comments from the AntiSamy contributors. Batik CSS is still being actively supported so if you can show them a place that it not in compliance with W3C, I think you'll have a fair shot of convincing them. But since you are the one who wants this (I doubt anyone else uses such extreme margins), I think you should be the one to file an issue with Batik CSS.

from antisamy.

davewichers avatar davewichers commented on July 24, 2024

@spassarop - Any dialog going on with the Batik team? If not, can you push/nudge again to see if you can get them to work on this?

from antisamy.

spassarop avatar spassarop commented on July 24, 2024

Damn, I forgot to do this. I started to investigate how to report but somehow I didn't do it. Guess I'll do it in the following days, I'm setting a reminder.

from antisamy.

spassarop avatar spassarop commented on July 24, 2024

Created an issue just now: https://issues.apache.org/jira/browse/BATIK-1320
Hope it is the right path, could not find other way of telling them. Let's wait now.

from antisamy.

carlosame avatar carlosame commented on July 24, 2024

I found this issue while browsing Batik's JIRA, although I'm not related to the Batik project (nor the ASF) in any way. I do not think that there is anything that the Batik people (nor any other SAC implementor) could do. The SAC API provides the getFloatValue() that, as the name implies, provides a float. And you are serializing the result with the following:

lu.getFloatValue() + lu.getDimensionUnitText()

That is, the 1.0e-5 appears because you are serializing the value that way.

The SAC API does not provide any way to determine the exact input text that was used when parsing. Perhaps Antisamy could provide a different kind of check so numbers are checked as numbers and not as strings. Checking numbers with string methods is always going to be error prone.

from antisamy.

carlosame avatar carlosame commented on July 24, 2024

Incidentally, I see that you are checking against old CSS 2.1 units.

<regexp name="length" value="((-|\+)?0|(-|\+)?([0-9]+(\.[0-9]*)?)(em|ex|px|in|cm|mm|pt|pc))"/>

But today's websites use a wider range of values and units, see CSS Values and Units Module Level 4. So if you keep with the regexp approach (for example serializing the lu.getFloatValue() with a DecimalFormat that avoids exponents), I'd suggest adding the newer rem, rex, cap, etc.

from antisamy.

spassarop avatar spassarop commented on July 24, 2024

@carlosame: About the units, the regex was expanded in #104 based on this page. I don't know why some of the units were left out really. We could expand the list even more with absolute and relative units.

Regarding the float stuff the matter is complex. AntiSamy does "generic" validations by comparing to values defined in a policy XML. It can be a strict comparison if it's by value or a regexp-based comparison. I agree it's best to use abstract representations whenever it's possible instead of just using strings in general. That's why we use parsers and such utilities and work with objects. However, relying on parsers for this to be done eventually ends up in having a final granular value that we cannot go "deeper" with.

In general, for HTML and CSS we use strings as normally the values are just text or number+text. We also stop going deeper when evaluating attributes values or tags' text contents. With CSS we stop on those lexical units for CSS property values. A property value can be really complex as you know, with numbers, units, function calls, etc. That's why we also need the parser to be specific with what it provides so we can validate the values somehow. I'd love we had something more versatile than just getFloatValue() which covers the cases we need, but float is the most complete type we can use now.

I don't really know if an improvement from Batik's side can be made, that's what I want them to tell us. I'm not a Java guy so that's why I seek specific knowledge for this. On AntiSamy .NET I had the same issue and reached the parser creator who fixed it, so now it has the behavior I needed on most cases (there was a details with exponents which I cannot remember now). Don't have much to say.

from antisamy.

carlosame avatar carlosame commented on July 24, 2024

I don't know why some of the units were left out really. We could expand the list even more with absolute and relative units.

I haven't seen that updated regex, I'd just add vmin and vmax as the other units are only supported by only one or two user agents AFAIR.

A property value can be really complex as you know, with numbers, units, function calls, etc. That's why we also need the parser to be specific with what it provides so we can validate the values somehow. I'd love we had something more versatile than just getFloatValue() which covers the cases we need, but float is the most complete type we can use now.

The problem are the following lines:

return lu.getFloatValue() + lu.getDimensionUnitText();

and

return String.valueOf(lu.getFloatValue());

If you serialize those float values with a properly configured DecimalFormat instead of the current valueOf(), it does not matter whether the input was 0.0001pt or 1e-5pt (which is also a valid CSS value). Your regular expression would work.

You do not need to change anything at Batik, although I'd migrate to a more modern SAC parser like my Css4j 1.3.1 (in post-EOL state) or cssparser (less complete but still maintained, and the maintainer is generally very helpful).

I don't really know if an improvement from Batik's side can be made, that's what I want them to tell us.

The Batik project is not very active currently so you could have to wait for a long time, and the fix that I explained is easy to apply.

from antisamy.

spassarop avatar spassarop commented on July 24, 2024

Related to #293

from antisamy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.