It looks like somewhere around version 1.5 that the method o

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

To be clear this issue is not a vulnerability (<a class="user-mention notranslate" dat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I've extended the test <a class="user-mention notranslate" data-hovercard-type="user"

stripNonValidXMLCharacters doesn't work with HTML where html.length() > 1 about antisamy HOT 10 OPEN

nahsra commented on July 24, 2024

stripNonValidXMLCharacters doesn't work with HTML where html.length() > 1

from antisamy.

Comments (10)

davewichers commented on July 24, 2024

@izian I can replicate the behavior that <div>Hello\uD83D\uDC95</div> doesn't change when getting sanitzed by AntiSamy. However, when I try just: \uD888 that doesn't get sanitized either. Can you show me the exact usage of AntiSamy that is resulting in a single character getting sanitized properly as you describe?

@nahsra - Are these special unicode characters supposed to get stripped by AntiSamy's default policy?

from antisamy.

nahsra commented on July 24, 2024

The reason we were searching for these unusual (at least, they were unusual 10 years ago) Unicode ranges was the underlying XML parsing API blew up when encountering them. Therefore we had to add some defensive code to filter them out before processing.

I agree that this code could be cleaned up a lot, and the XML dependencies could have become more sturdy since then, and that there is nominal risk from this behavior. This issue should stay open and be acted upon but isn’t a blocker for release.

I appreciate the thoughtful write up.

from antisamy.

davewichers commented on July 24, 2024

To be clear this issue is not a vulnerability (@nahsra's 'nominal risk' comment) . It is intended functionality that doesn't work, but the fact that it doesn't isn't a security issue.

from antisamy.

izian commented on July 24, 2024

Thanks for checking it out.
I cannot remember why we thought that this was part of the security of the utility. For some reason, when I first took a look, our code was documented as using this to make sure we didn’t get malicious control characters through.
Perhaps these are a thing of the past or were never really in existence.
In either case, given that this functionality hasn’t worked in a while (unless you pass through a single character) it would likely be worse for people who have become accustomed to no character filtering if it were “fixed” rather than simply removed if no longer needed.

I just wanted to point out the matches / find. It’s a common trip up.
Like forEach vs forEachOrdered and findFirst vs findAny and list removeAll not being the same as removing via iteration. I see matches / find issues in a lot of code.

Thanks for the time looking in to it

from antisamy.

davewichers commented on July 24, 2024

@nahsra - Given the comment: "given that this functionality hasn’t worked in a while (unless you pass through a single character) it would likely be worse for people who have become accustomed to no character filtering if it were “fixed” rather than simply removed if no longer needed." do you think we should still try to fix this, or just leave it alone?

from antisamy.

davewichers commented on July 24, 2024

@spassarop - Hey Sebastian - any. clue how to address this old issue? There are two tests cases for this issue already in the test class, but they are commented out. testGithubIssue34a() and testGithubIssue34b(). Same for issue #24 - and test case: testGithubIssue24(). If you could address either or both that would be fantastic.

from antisamy.

spassarop commented on July 24, 2024

@davewichers, @izian - I looked into the function that checks de regex of "invalid characters". I've tested with find() instead of matches() and for the DOM parser it gets better results. However, when it comes to replacing, the surrogate pair \uD83D\uDC95 which forms 💕 does not want to be entirely replaced. Instead, one of the unicode characters remains.

I've tried with different inputs and changing the pattern but the result does not improve. I know it probably isn't a performant solution, but implementing the replacement by hand does in fact replace all characters. To remind, this is the pattern:

"[\u0000-\u001F\uD800-\uDFFF\uFFFE-\uFFFF&&[^\u0009\u000A\u000D]]"

I've built a string with all the supposedly invalid characters with the following code:

StringBuilder s = new StringBuilder();
char c = (char)0x0;
do {
    if ((c >= 0x0 && c <= 0x1f || c >= 0xd800 && c <= 0xdfff || c >= 0xfffe && c <= 0xffff) && c!=0x9 && c!=0xa && c!=0xd )
         s.append(c);
    c = (char)(c+1);
} while(c!=0xffff);

Then I've run this assertion which leaves a single unicode character (can't remember which one but it's in the middle of a range from the pattern):

assertEquals("<div>Hello</div>", as.scan("<div>Hello"+s.toString()+"</div>", policy, AntiSamy.DOM).getCleanHTML());

Then I've copied and pasted the current implementation on the .NET project for StripNonValidXmlCharacters and made the changes to compile in Java:

if (in == null || ("".equals(in))) {
    return ""; // vacancy test.
}

StringBuilder cleanText = new StringBuilder(); // Used to hold the output.
char current; // Used to reference the current character.

for (int i = 0; i < in.length(); i++)
{
    current = in.charAt(i);
    if ((current == 0x9) || (current == 0xA) || (current == 0xD)
            || ((current >= 0x20) && (current <= 0xD7FF))
            || ((current >= 0xE000) && (current <= 0xFFFD))
            || ((current >= 0x10000) && (current <= 0x10FFFF)))
    {
        cleanText.append(current);
    }
}

return cleanText.toString();

And well, that works because it's "manual" replacing without depending on Java's builtin functions. The previous find() call can prevent this replacing from happening if it's too expensive to always do it.

About SAX parser:

It does not have the character check, it's implemented only at DOM parser.
If it gets implemented for SAX, it has the same behavior as DOM. But when the input is a Reader object, AntiSamy can't read the plain HTML without some transformations that I hope they don't mess with any character in it, if implemented.

That's all I can say after some hours of analyzing this.

from antisamy.

davewichers commented on July 24, 2024

Thanks Sebastian for researching! Do you think we should change/fix AntiSamy in some way or are we already doing the best we can with the parsers we are using?

…

-Dave

On Sun, Jan 17, 2021 at 5:38 PM Sebastián Passaro ***@***.***> wrote: @davewichers <https://github.com/davewichers>, @izian <https://github.com/izian> - I looked into the function that checks de regex of "invalid characters". I've tested with find() instead of matches() and for the DOM parser it gets better results. However, when it comes to replacing, the surrogate pair \uD83D\uDC95 which forms 💕 does not want to be entirely replaced. Instead, one of the unicode characters remains. I've tried with different inputs and changing the pattern but the result does not improve. I know it probably isn't a performant solution, but implementing the replacement by hand does in fact replace all characters. To remind, this is the pattern: "[\u0000-\u001F\uD800-\uDFFF\uFFFE-\uFFFF&&[^\u0009\u000A\u000D]]" I've built a string with all the supposedly invalid characters with the following code: StringBuilder s = new StringBuilder(); char c = (char)0x0; do { if ((c >= 0x0 && c <= 0x1f || c >= 0xd800 && c <= 0xdfff || c >= 0xfffe && c <= 0xffff) && c!=0x9 && c!=0xa && c!=0xd ) s.append(c); c = (char)(c+1); } while(c!=0xffff); Then I've run this assertion which leaves a single unicode character (can't remember which one but it's in the middle of a range from the pattern): assertEquals("<div>Hello</div>", as.scan("<div>Hello"+s.toString()+"</div>", policy, AntiSamy.DOM).getCleanHTML()); Then I've copied and pasted the current implementation on the .NET project for StripNonValidXmlCharacters and made the changes to compile in Java: if (in == null || ("".equals(in))) { return ""; // vacancy test. } StringBuilder cleanText = new StringBuilder(); // Used to hold the output. char current; // Used to reference the current character. for (int i = 0; i < in.length(); i++) { current = in.charAt(i); if ((current == 0x9) || (current == 0xA) || (current == 0xD) || ((current >= 0x20) && (current <= 0xD7FF)) || ((current >= 0xE000) && (current <= 0xFFFD)) || ((current >= 0x10000) && (current <= 0x10FFFF))) { cleanText.append(current); } } return cleanText.toString(); And well, that works because it's "manual" replacing without depending on Java's builtin functions. The previous find() call can prevent this replacing from happening if it's too expensive to always do it. About SAX parser: - It does not have the character check, it's implemented only at DOM parser. - If it gets implemented for SAX, it has the same behavior as DOM. But when the input is a Reader object, AntiSamy can't read the plain HTML without some transformations that I hope they don't mess with any character in it, if implemented. That's all I can say after some hours of analyzing this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGFWBKW57DSVJONMMCIRM3S2NRGTANCNFSM4GXEVWGQ> .

from antisamy.

izian commented on July 24, 2024

I can take another look to refresh my 2 year memory of this one in the morrow.
I remember thinking about:
Are most strings clean? It would be good to not have overhead on the 99% use case.
If you were about to strip a character is it worth checking on if it and it’s following character represent a surrogate pair? Do you even want to allow 16 bit but valid characters through?

I recognise that the use of this method by our developer is probably far removed from its intended use. The same developer also escaped JSON as if it were HTML.

I only happened upon this library by chance chasing down my missing code point and thought it was here it was lost; it was actually another library pretending to do good for XML and it only removed one of the code points; leaving back part of a surrogate pair which destroyed the XML when it was being serialised again.

maybe I could afford some time to take a look again here and see if I can follow what you’ve written and come to any understanding of a fix in line with your intention for the method; not our use :-)

from antisamy.

spassarop commented on July 24, 2024

I've extended the test @izian mention before, the one to test performance, just to test locally how bad it was the solution I proposed above. Executions vary, sometimes the result in the "manual" replacement is faster that replaceAll(), sometimes the other way round. However, differences are at max 5 ms. So if the implementation for the replacement changes, all characters are replaced and performance should stay the same.

The only thing that worries me is the SAX parser with the Reader object, because of the default encoding for reading it, do the cleansing and make the object again without losing something. Maybe I'm worrying when I shouldn't but I don't know that.

However, I've built the test string with all unicode characters that should be stripped also to make another test. The test is to don't strip at all. I've commented the line where cleansing occurrs and the library gave an output, so it didn't blow up. Maybe after the NekoHTML updates that is no longer a problem and we can remove the invalid XML characters validation. I just give that information, I prefer not to make that decision.

from antisamy.

stripNonValidXMLCharacters doesn't work with HTML where html.length() > 1 about antisamy HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs