Comments (9)
This use case is not possible with the current architecture. I see the value. I think it’s a candidate for 1.5.9.
from antisamy.
I'm investigating this. Using a DOM parser, with these settings, I get only: "firstname,lastname" in the output of .getCleanHTML(). Using a SAX parser, I get: "firstname,lastname<name></name>"
. I'm not sure if this inconsistency in output is a different bug. I'll research that too.
from antisamy.
I have similar kind of issue when i try to scan with text: "hello <hii world", i get the output : "hello".
and scan is not giving any error in CleanResults for this. but with
when i scan text "hello <hi>" it will give output: "hello" with error "[The hi tag was empty, and therefore we could not process it. The rest of the message is intact, and its removal should not have any side effects.]"
is there any workaround to get output "hello <hii world" on scan "hello <hii world".
from antisamy.
I have a similar issue, using the Antisamy library with version 1.5.8 and I tried writing the following unit test case:
Input:
<html><head><style>.uegzbq{font-size:22px;}@media not all and (pointer:coarse){.8bsfb:hover{background-color:#056b27;}}.scem3j{font-size:25px;}</style></head><body><div class="uegzbq">First Line</div><br><div class="scem3j">Second Line</div></body></html>
Expected on org.owasp.validator.html.CleanResults.getCleanHTML
:
<html><head><style>/*<![CDATA[*/*.uegzbq {
font-size: 22.0px;
}
@media not all and (pointer:coarse) {
.8bsfb: hover {
background-color: #056b27;
}
}
.scem3j{
font-size:25px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>
Actual:
<html><head><style>/*<![CDATA[*/*.uegzbq {
font-size: 22.0px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>
I can see that in the org.owasp.validator.html.scan.AntiSamyDOMScanner
class, I was having the expected string prior to serialization and after the org.apache.xml.serialize.HTMLSerializer
has done the serialization to the DocumentFragment
Whatever it was after @
symbol got stripped off in the style tag.
I tried upgrading to 1.6.3
but still having the same issue.
from antisamy.
Also stumbled across this one - are there any plans to work on this in the near future?
from antisamy.
@spassarop - Is this even possible/reasonable? Or way too hard? I suspect 'too hard'.
from antisamy.
Some points to cover and clarify to understand the current behavior:
- AntiSamy parses the input with cyberneko as HTML, so it expects valid HTML to build the internal representation which gets validated. Raw inputs like
hello <hii world
andfirstname,lastname<[email protected]>
(provided by @goshantmeher and @nikowitt respectively) are not valid HTML, so the library returns a best effort parse result auto-closing tags and so. Raw text that goes through an HTML parser but is not intended to be HTML should be HTML-encoded first, at least that specific fragments. It is just the nature of the parser, as with any other kind of parser. - Using
<directive name="onUnknownTag" value="encode"/>
does not work in this particular case (the one with[email protected]
) because in the validation logic, a prior step is to check if the tag is empty AND if is in the allowed empty tags list. As the tag is not defined, it won't be in the policy, it won't be allowed as empty and therefore removed before even getting to the encoding step. Just to test I added "name" as an allowed empty tag and re-tested, resulting infirstname,lastname<name/>
. This takes us to the following point. - The cyberneko parser naturally does not take the whole
[email protected]
as a valid tag name, it internally truncates it when encountering@
. Can't do anything about it as far as I know, if the whole string should be interpreted as text, then it should also be HTML-encoded as I said before. - The CSS style tag case is a whole different scenario, where the problem is not HTML parsing, but CSS parsing. In the example provided by @log2akshat, the
@media
rule is not valid for Batik-CSS library (it expects the opening{
right after thenot
word), so it decides to fail and stop parsing the whole CSS block, the HTML-embedded style sheet. Why is that? You may say Batik-CSS is not smart enough (already pointed this out in #108) and it a valid point.
In conclusion, there is not much to do because of the underlying libraries that parse HTML and CSS. They just expect text in their respective formats to evaluate and parse them in their target spec, parsing invalid HTML will result in a weird result for sure. If you fear someone may put HTML in a place where it shouldn't the solution is to HTML-encode, not to filter. Maybe the whole string, maybe the fragments that must be HTML-free but get inserted in an HTML template, that depends entirely on the usage context.
The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition. I hope all this explanation make the issues clear.
from antisamy.
@spassarop - You did a lot of analysis on this one. Are there any changes you are comfortable with making that would improve anything?
from antisamy.
The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition.
Maybe this, so tags like <name>
which are unknown get a chance to be detected as such and honor the “onUnknownTag” behavior, independently whether they’re empty tags or not. An unknown and empty tag does not get to that point today.
I’m not sure if it does improve something but at least will be consistent with the expected behavior of “onUnknownTag” on policy definition. It can be done for the next version.
The other problems cannot be solved through AntiSamy code. In my opinion they’re due to wrong usage and underlying libraries limitations, as I stated in the previous analysis.
from antisamy.
Related Issues (20)
- Change in behavior between 1.6.4 and 1.6.5 for getErrorMessages HOT 7
- Commit details for CVE-2022-28366? HOT 4
- Remove all deprecated APIs/features in prep for 1.7.0 release HOT 1
- ASHTMLSerializer uses deprecated HTMLSerializer. Replace with TrAX.
- AntiSamy converting single quotes to double quotes for font-family which is causing issue while rendering HOT 6
- AntiSamy not detecting XSS for anchor tag HOT 10
- CssHandler test case failure on Windows HOT 5
- Incorrect 'Contributing' link on OWASP wiki page HOT 1
- Javadoc cleanup
- 2 enhancement HOT 2
- 1 enhancement with api HOT 2
- CVE-2022-24891 HOT 7
- Removing Xerces dependency? HOT 3
- Does Antisamy has support for custom css properties " --* " and css-function " var() " and how to define it in the antisamy policy file? HOT 10
- Enabled noopenerAndNoreferrerAnchors policy drops nofollow HOT 7
- Covering all cases of "rel" attribute in "anchor" tag is quite verbose HOT 3
- Investigate replacing Batik CSS HOT 1
- Dealing with Security Vulnerabilities CVE-2023-26119 HOT 13
- AntiSamy encodes unknown tags despite not being configured that way HOT 6
- GraalVM Support HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from antisamy.