GithubHelp home page GithubHelp logo

Comments (9)

nahsra avatar nahsra commented on July 4, 2024 1

This use case is not possible with the current architecture. I see the value. I think it’s a candidate for 1.5.9.

from antisamy.

davewichers avatar davewichers commented on July 4, 2024

I'm investigating this. Using a DOM parser, with these settings, I get only: "firstname,lastname" in the output of .getCleanHTML(). Using a SAX parser, I get: "firstname,lastname<name></name>". I'm not sure if this inconsistency in output is a different bug. I'll research that too.

from antisamy.

goshantmeher avatar goshantmeher commented on July 4, 2024

I have similar kind of issue when i try to scan with text: "hello <hii world", i get the output : "hello".
and scan is not giving any error in CleanResults for this. but with
when i scan text "hello <hi>" it will give output: "hello" with error "[The hi tag was empty, and therefore we could not process it. The rest of the message is intact, and its removal should not have any side effects.]"
is there any workaround to get output "hello <hii world" on scan "hello <hii world".

from antisamy.

log2akshat avatar log2akshat commented on July 4, 2024

I have a similar issue, using the Antisamy library with version 1.5.8 and I tried writing the following unit test case:

Input:

<html><head><style>.uegzbq{font-size:22px;}@media not all and (pointer:coarse){.8bsfb:hover{background-color:#056b27;}}.scem3j{font-size:25px;}</style></head><body><div class="uegzbq">First Line</div><br><div class="scem3j">Second Line</div></body></html>

Expected on org.owasp.validator.html.CleanResults.getCleanHTML:

<html><head><style>/*<![CDATA[*/*.uegzbq {
	font-size: 22.0px;
}
@media not all and (pointer:coarse) {
       .8bsfb: hover {
                background-color: #056b27;
        }
}
.scem3j{
       font-size:25px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>

Actual:

<html><head><style>/*<![CDATA[*/*.uegzbq {
	font-size: 22.0px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>

I can see that in the org.owasp.validator.html.scan.AntiSamyDOMScanner class, I was having the expected string prior to serialization and after the org.apache.xml.serialize.HTMLSerializer has done the serialization to the DocumentFragment Whatever it was after @ symbol got stripped off in the style tag.
I tried upgrading to 1.6.3 but still having the same issue.

from antisamy.

nikowitt avatar nikowitt commented on July 4, 2024

Also stumbled across this one - are there any plans to work on this in the near future?

from antisamy.

davewichers avatar davewichers commented on July 4, 2024

@spassarop - Is this even possible/reasonable? Or way too hard? I suspect 'too hard'.

from antisamy.

spassarop avatar spassarop commented on July 4, 2024

Some points to cover and clarify to understand the current behavior:

  • AntiSamy parses the input with cyberneko as HTML, so it expects valid HTML to build the internal representation which gets validated. Raw inputs like hello <hii world and firstname,lastname<[email protected]> (provided by @goshantmeher and @nikowitt respectively) are not valid HTML, so the library returns a best effort parse result auto-closing tags and so. Raw text that goes through an HTML parser but is not intended to be HTML should be HTML-encoded first, at least that specific fragments. It is just the nature of the parser, as with any other kind of parser.
  • Using <directive name="onUnknownTag" value="encode"/> does not work in this particular case (the one with [email protected]) because in the validation logic, a prior step is to check if the tag is empty AND if is in the allowed empty tags list. As the tag is not defined, it won't be in the policy, it won't be allowed as empty and therefore removed before even getting to the encoding step. Just to test I added "name" as an allowed empty tag and re-tested, resulting in firstname,lastname&lt;name/&gt;. This takes us to the following point.
  • The cyberneko parser naturally does not take the whole [email protected] as a valid tag name, it internally truncates it when encountering @. Can't do anything about it as far as I know, if the whole string should be interpreted as text, then it should also be HTML-encoded as I said before.
  • The CSS style tag case is a whole different scenario, where the problem is not HTML parsing, but CSS parsing. In the example provided by @log2akshat, the @media rule is not valid for Batik-CSS library (it expects the opening { right after the not word), so it decides to fail and stop parsing the whole CSS block, the HTML-embedded style sheet. Why is that? You may say Batik-CSS is not smart enough (already pointed this out in #108) and it a valid point.

In conclusion, there is not much to do because of the underlying libraries that parse HTML and CSS. They just expect text in their respective formats to evaluate and parse them in their target spec, parsing invalid HTML will result in a weird result for sure. If you fear someone may put HTML in a place where it shouldn't the solution is to HTML-encode, not to filter. Maybe the whole string, maybe the fragments that must be HTML-free but get inserted in an HTML template, that depends entirely on the usage context.

The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition. I hope all this explanation make the issues clear.

from antisamy.

davewichers avatar davewichers commented on July 4, 2024

@spassarop - You did a lot of analysis on this one. Are there any changes you are comfortable with making that would improve anything?

from antisamy.

spassarop avatar spassarop commented on July 4, 2024

The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition.

Maybe this, so tags like <name> which are unknown get a chance to be detected as such and honor the “onUnknownTag” behavior, independently whether they’re empty tags or not. An unknown and empty tag does not get to that point today.

I’m not sure if it does improve something but at least will be consistent with the expected behavior of “onUnknownTag” on policy definition. It can be done for the next version.

The other problems cannot be solved through AntiSamy code. In my opinion they’re due to wrong usage and underlying libraries limitations, as I stated in the previous analysis.

from antisamy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.