Comments (10)
@izian I can replicate the behavior that <div>Hello\uD83D\uDC95</div>
doesn't change when getting sanitzed by AntiSamy. However, when I try just: \uD888
that doesn't get sanitized either. Can you show me the exact usage of AntiSamy that is resulting in a single character getting sanitized properly as you describe?
@nahsra - Are these special unicode characters supposed to get stripped by AntiSamy's default policy?
from antisamy.
The reason we were searching for these unusual (at least, they were unusual 10 years ago) Unicode ranges was the underlying XML parsing API blew up when encountering them. Therefore we had to add some defensive code to filter them out before processing.
I agree that this code could be cleaned up a lot, and the XML dependencies could have become more sturdy since then, and that there is nominal risk from this behavior. This issue should stay open and be acted upon but isn’t a blocker for release.
I appreciate the thoughtful write up.
from antisamy.
To be clear this issue is not a vulnerability (@nahsra's 'nominal risk' comment) . It is intended functionality that doesn't work, but the fact that it doesn't isn't a security issue.
from antisamy.
Thanks for checking it out.
I cannot remember why we thought that this was part of the security of the utility. For some reason, when I first took a look, our code was documented as using this to make sure we didn’t get malicious control characters through.
Perhaps these are a thing of the past or were never really in existence.
In either case, given that this functionality hasn’t worked in a while (unless you pass through a single character) it would likely be worse for people who have become accustomed to no character filtering if it were “fixed” rather than simply removed if no longer needed.
I just wanted to point out the matches / find. It’s a common trip up.
Like forEach vs forEachOrdered and findFirst vs findAny and list removeAll not being the same as removing via iteration. I see matches / find issues in a lot of code.
Thanks for the time looking in to it
from antisamy.
@nahsra - Given the comment: "given that this functionality hasn’t worked in a while (unless you pass through a single character) it would likely be worse for people who have become accustomed to no character filtering if it were “fixed” rather than simply removed if no longer needed." do you think we should still try to fix this, or just leave it alone?
from antisamy.
@spassarop - Hey Sebastian - any. clue how to address this old issue? There are two tests cases for this issue already in the test class, but they are commented out. testGithubIssue34a() and testGithubIssue34b(). Same for issue #24 - and test case: testGithubIssue24(). If you could address either or both that would be fantastic.
from antisamy.
@davewichers, @izian - I looked into the function that checks de regex of "invalid characters". I've tested with find()
instead of matches()
and for the DOM parser it gets better results. However, when it comes to replacing, the surrogate pair \uD83D\uDC95
which forms 💕 does not want to be entirely replaced. Instead, one of the unicode characters remains.
I've tried with different inputs and changing the pattern but the result does not improve. I know it probably isn't a performant solution, but implementing the replacement by hand does in fact replace all characters. To remind, this is the pattern:
"[\u0000-\u001F\uD800-\uDFFF\uFFFE-\uFFFF&&[^\u0009\u000A\u000D]]"
I've built a string with all the supposedly invalid characters with the following code:
StringBuilder s = new StringBuilder();
char c = (char)0x0;
do {
if ((c >= 0x0 && c <= 0x1f || c >= 0xd800 && c <= 0xdfff || c >= 0xfffe && c <= 0xffff) && c!=0x9 && c!=0xa && c!=0xd )
s.append(c);
c = (char)(c+1);
} while(c!=0xffff);
Then I've run this assertion which leaves a single unicode character (can't remember which one but it's in the middle of a range from the pattern):
assertEquals("<div>Hello</div>", as.scan("<div>Hello"+s.toString()+"</div>", policy, AntiSamy.DOM).getCleanHTML());
Then I've copied and pasted the current implementation on the .NET project for StripNonValidXmlCharacters and made the changes to compile in Java:
if (in == null || ("".equals(in))) {
return ""; // vacancy test.
}
StringBuilder cleanText = new StringBuilder(); // Used to hold the output.
char current; // Used to reference the current character.
for (int i = 0; i < in.length(); i++)
{
current = in.charAt(i);
if ((current == 0x9) || (current == 0xA) || (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
{
cleanText.append(current);
}
}
return cleanText.toString();
And well, that works because it's "manual" replacing without depending on Java's builtin functions. The previous find()
call can prevent this replacing from happening if it's too expensive to always do it.
About SAX parser:
- It does not have the character check, it's implemented only at DOM parser.
- If it gets implemented for SAX, it has the same behavior as DOM. But when the input is a Reader object, AntiSamy can't read the plain HTML without some transformations that I hope they don't mess with any character in it, if implemented.
That's all I can say after some hours of analyzing this.
from antisamy.
from antisamy.
I can take another look to refresh my 2 year memory of this one in the morrow.
I remember thinking about:
Are most strings clean? It would be good to not have overhead on the 99% use case.
If you were about to strip a character is it worth checking on if it and it’s following character represent a surrogate pair? Do you even want to allow 16 bit but valid characters through?
I recognise that the use of this method by our developer is probably far removed from its intended use. The same developer also escaped JSON as if it were HTML.
I only happened upon this library by chance chasing down my missing code point and thought it was here it was lost; it was actually another library pretending to do good for XML and it only removed one of the code points; leaving back part of a surrogate pair which destroyed the XML when it was being serialised again.
maybe I could afford some time to take a look again here and see if I can follow what you’ve written and come to any understanding of a fix in line with your intention for the method; not our use :-)
from antisamy.
I've extended the test @izian mention before, the one to test performance, just to test locally how bad it was the solution I proposed above. Executions vary, sometimes the result in the "manual" replacement is faster that replaceAll()
, sometimes the other way round. However, differences are at max 5 ms. So if the implementation for the replacement changes, all characters are replaced and performance should stay the same.
The only thing that worries me is the SAX parser with the Reader object, because of the default encoding for reading it, do the cleansing and make the object again without losing something. Maybe I'm worrying when I shouldn't but I don't know that.
However, I've built the test string with all unicode characters that should be stripped also to make another test. The test is to don't strip at all. I've commented the line where cleansing occurrs and the library gave an output, so it didn't blow up. Maybe after the NekoHTML updates that is no longer a problem and we can remove the invalid XML characters validation. I just give that information, I prefer not to make that decision.
from antisamy.
Related Issues (20)
- 2 enhancement HOT 2
- 1 enhancement with api HOT 2
- CVE-2022-24891 HOT 7
- Removing Xerces dependency? HOT 3
- Does Antisamy has support for custom css properties " --* " and css-function " var() " and how to define it in the antisamy policy file? HOT 10
- Enabled noopenerAndNoreferrerAnchors policy drops nofollow HOT 7
- Covering all cases of "rel" attribute in "anchor" tag is quite verbose HOT 3
- Investigate replacing Batik CSS HOT 1
- Dealing with Security Vulnerabilities CVE-2023-26119 HOT 13
- AntiSamy encodes unknown tags despite not being configured that way HOT 6
- GraalVM Support HOT 4
- noopenerAndNoreferrerAnchors policy directive seems disabled by default in 1.7.2 version HOT 2
- How to find if vulnerable script is present in the input HOT 8
- Is there a way to not encode certain HTML Entities? HOT 6
- antisamy:1.7.3 contains batik-css:1.16 that has CVE-2022-44729 vulnerability HOT 1
- Sanitized output for same tainted input differs from AntiSamy 1.7.3 to 1.7.4 HOT 6
- Regex named: "Paragraph", is causing "StackOverFlowError" HOT 8
- the argument 'policy' never be used HOT 1
- antiSamy.scan(input, policy) giving the following as not a valid html. HOT 8
- Prevent formatting and translation of css HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from antisamy.