I am using antisamy 1.5.7. I saw issue when input was <p dir="au

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Antisamy stripping @ and not encoding if it falls within <> about antisamy HOT 9 OPEN

nahsra commented on July 4, 2024 2

Antisamy stripping @ and not encoding if it falls within <>

from antisamy.

Comments (9)

nahsra commented on July 4, 2024 1

This use case is not possible with the current architecture. I see the value. I think it’s a candidate for 1.5.9.

from antisamy.

davewichers commented on July 4, 2024

I'm investigating this. Using a DOM parser, with these settings, I get only: "firstname,lastname" in the output of .getCleanHTML(). Using a SAX parser, I get: "firstname,lastname<name></name>". I'm not sure if this inconsistency in output is a different bug. I'll research that too.

from antisamy.

goshantmeher commented on July 4, 2024

I have similar kind of issue when i try to scan with text: "hello <hii world", i get the output : "hello".
and scan is not giving any error in CleanResults for this. but with
when i scan text "hello <hi>" it will give output: "hello" with error "[The hi tag was empty, and therefore we could not process it. The rest of the message is intact, and its removal should not have any side effects.]"
is there any workaround to get output "hello <hii world" on scan "hello <hii world".

from antisamy.

log2akshat commented on July 4, 2024

I have a similar issue, using the Antisamy library with version 1.5.8 and I tried writing the following unit test case:

Input:

<html><head><style>.uegzbq{font-size:22px;}@media not all and (pointer:coarse){.8bsfb:hover{background-color:#056b27;}}.scem3j{font-size:25px;}</style></head><body><div class="uegzbq">First Line</div><br><div class="scem3j">Second Line</div></body></html>

Expected on org.owasp.validator.html.CleanResults.getCleanHTML:

<html><head><style>/*<![CDATA[*/*.uegzbq {
	font-size: 22.0px;
}
@media not all and (pointer:coarse) {
       .8bsfb: hover {
                background-color: #056b27;
        }
}
.scem3j{
       font-size:25px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>

Actual:

<html><head><style>/*<![CDATA[*/*.uegzbq {
	font-size: 22.0px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>

I can see that in the org.owasp.validator.html.scan.AntiSamyDOMScanner class, I was having the expected string prior to serialization and after the org.apache.xml.serialize.HTMLSerializer has done the serialization to the DocumentFragment Whatever it was after @ symbol got stripped off in the style tag.
I tried upgrading to 1.6.3 but still having the same issue.

from antisamy.

nikowitt commented on July 4, 2024

Also stumbled across this one - are there any plans to work on this in the near future?

from antisamy.

davewichers commented on July 4, 2024

@spassarop - Is this even possible/reasonable? Or way too hard? I suspect 'too hard'.

from antisamy.

spassarop commented on July 4, 2024

Some points to cover and clarify to understand the current behavior:

AntiSamy parses the input with cyberneko as HTML, so it expects valid HTML to build the internal representation which gets validated. Raw inputs like hello <hii world and firstname,lastname<[email protected]> (provided by @goshantmeher and @nikowitt respectively) are not valid HTML, so the library returns a best effort parse result auto-closing tags and so. Raw text that goes through an HTML parser but is not intended to be HTML should be HTML-encoded first, at least that specific fragments. It is just the nature of the parser, as with any other kind of parser.
Using <directive name="onUnknownTag" value="encode"/> does not work in this particular case (the one with [email protected]) because in the validation logic, a prior step is to check if the tag is empty AND if is in the allowed empty tags list. As the tag is not defined, it won't be in the policy, it won't be allowed as empty and therefore removed before even getting to the encoding step. Just to test I added "name" as an allowed empty tag and re-tested, resulting in firstname,lastname<name/>. This takes us to the following point.
The cyberneko parser naturally does not take the whole [email protected] as a valid tag name, it internally truncates it when encountering @. Can't do anything about it as far as I know, if the whole string should be interpreted as text, then it should also be HTML-encoded as I said before.
The CSS style tag case is a whole different scenario, where the problem is not HTML parsing, but CSS parsing. In the example provided by @log2akshat, the @media rule is not valid for Batik-CSS library (it expects the opening { right after the not word), so it decides to fail and stop parsing the whole CSS block, the HTML-embedded style sheet. Why is that? You may say Batik-CSS is not smart enough (already pointed this out in #108) and it a valid point.

In conclusion, there is not much to do because of the underlying libraries that parse HTML and CSS. They just expect text in their respective formats to evaluate and parse them in their target spec, parsing invalid HTML will result in a weird result for sure. If you fear someone may put HTML in a place where it shouldn't the solution is to HTML-encode, not to filter. Maybe the whole string, maybe the fragments that must be HTML-free but get inserted in an HTML template, that depends entirely on the usage context.

The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition. I hope all this explanation make the issues clear.

from antisamy.

davewichers commented on July 4, 2024

@spassarop - You did a lot of analysis on this one. Are there any changes you are comfortable with making that would improve anything?

from antisamy.

spassarop commented on July 4, 2024

The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition.

Maybe this, so tags like <name> which are unknown get a chance to be detected as such and honor the “onUnknownTag” behavior, independently whether they’re empty tags or not. An unknown and empty tag does not get to that point today.

I’m not sure if it does improve something but at least will be consistent with the expected behavior of “onUnknownTag” on policy definition. It can be done for the next version.

The other problems cannot be solved through AntiSamy code. In my opinion they’re due to wrong usage and underlying libraries limitations, as I stated in the previous analysis.

from antisamy.

Antisamy stripping @ and not encoding if it falls within <> about antisamy HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs