messense / nh3 Goto Github PK
View Code? Open in Web Editor NEWPython binding to Ammonia HTML sanitizer Rust crate
Home Page: https://nh3.readthedocs.io
License: MIT License
Python binding to Ammonia HTML sanitizer Rust crate
Home Page: https://nh3.readthedocs.io
License: MIT License
In the current implementation (v0.2.9), there isn't a way to allow all data-*
attributes (or other generic attributes prefixes)
In the underlying ammonia, the builder allows for generic_attribute_prefixes to be specified, and uses the data-
prefix as an example in the docs:
https://docs.rs/ammonia/latest/ammonia/struct.Builder.html#method.generic_attribute_prefixes
I am currently using bleach with an implementation that allows all data-*
attributes, and I would like to switch to this library.
Having this ability would allow me to make the switch easily.
Please consider adding this feature.
Now it's dict[str, set[str]]
, and attempt to use frozenset
will return
TypeError: argument 'attributes': 'frozenset' object cannot be converted to 'PySet'
but IMO using frozenset
is a good practice, because if data is immutable good to use immutable type, for example allowed attributes can be defined in configuration and it's safer to use immutable type.
Can I know why this error appear? I'm using (ubuntu-20.04, 3.6, 3.2)
for the build.
error: Couldn't find a setup script in /tmp/easy_install-1n2e9hb9/nh3-0.2.14.tar.gz
Error: Process completed with exit code 1.
https://github.com/agusmakmun/django-markdown-editor/actions/runs/5692732388/job/15430335970?pr=209
Either I am misunderstanding what clean_content_tags
does or it is not working correctly. I cannot get the clean_content_tags
attribute to work on anything other than the two tags <script>
and <style>
. Using nh3 version 0.2.14, python 3.11.0.
import nh3
testItem = "<script>alert('hello')</script><p>hello</p>"
print(nh3.clean(html=testItem, clean_content_tags={'p'}))
I receive this error:
thread '<unnamed>' panicked at 'assertion failed: !self.tags.contains(tag_name)', /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ammonia-3.3.0/src/lib.rs:1792:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "/Users/home/Desktop/import nh3.py", line 4, in <module>
print(nh3.clean(html=item, tags=None, clean_content_tags={'p'}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: assertion failed: !self.tags.contains(tag_name)
I have been able to reproduce this error with b, br, div, and img tags. I haven't tried any others. script and style tags work as expected.
Thanks for this package!
Does this package expose ammonia's features for allowing tags and attributes?
The specific usecase is for use in nbconvert
which has a rather specific set of allowances, and while bleach
has served us well, having a higher-performance option would be lovely!
Like if i have normal output removes it, but how can i make like it <asfjiasfj></asfjiasfj>?
It would be nice if the Python package could also expose ammonia’s default whitelisted URL schemes. If I understand the code correctly, this could easily be done by adding m.add("ALLOWED_URL_SCHEMES", a.clone_url_schemes())?;
to the nh3
function.
Adding rel="noopener noreferrer"
to all <a>
tags is not desired in our use case. Would it be possible to add a parameter to disable that?
If I read ammonia source correctly, it should be achievable by Builder.link_rel(None)
, so it just needs a way to expose this in the Python interface.
While using nh3 library, we came across a use case, where HTML content is expected for a field, but we need to remove the content that can cause XSS attack. Using nh3.clean() directly on the input text doesn't give the expected result and a lot of useful data is getting trimmed ultimately modifying the html template input.
import nh3
text = '''
<!DOCTYPE html>
<html>
<head>
<title>HTML Tutorial</title>
</head>
<body>
<h1>This is a heading</h1>
<p>This is a paragraph.</p>
</body>
</html>
'''
nh3.ALLOWED_TAGS.add('title')
nh3.ALLOWED_TAGS.add('head')
nh3.ALLOWED_TAGS.add('html')
nh3.ALLOWED_TAGS.add('div')
nh3.ALLOWED_TAGS.add('body')
print(nh3.clean(text,tags=nh3.ALLOWED_TAGS,strip_comments=False))
Output:
<title>HTML Tutorial</title>
<h1>This is a heading</h1>
<p>This is a paragraph.</p>
We don't want to trim the html or head or body tags. Is there any limitation to nh3 library which does not allow these tags?
html = "<a href='http://www.google.com'>google.com</a>"
nh3.clean(html, tags={'a'}, attributes={'a': {'href', 'rel'}})
pyo3_runtime.PanicException: assertion failed: self.tag_attributes.get("a").and_then(|a| a.get("rel")).is_none()
Discussions are not enabled so opening it here, sorry 'bout it.
With the recent deprecation of bleach (mostly on grounds of html5lib being unmaintained), unless someone has the time to e.g. rebuild the html5lib API on top of an existing html5 parser and the maintainer of bleach decides to use that, ammonia/nh3 seems well positioned as a migration target (there's already one package which has done that visible from the linked Bleach PR).
One issue there is that nh3 currently provides rather limited tuning knobs compared to Ammonia and Bleach (not sure how the two relate as I have not looked yet), but the readme doesn't really say what your eventual goals would be on that front as maintainer. If you do aim to favor such support & migration, maybe an issue or even project (kanban) about full Ammonia support and / or Bleach features parity (if not API compatibility) could be a consideration?
An other possible issue (though more internal) is for exposing customisations which allow arbitrary callables (attribute_filter
seems to be the only one currently): nh3
currently releases the GIL during cleanup, which wouldn't allow calling Python functions, and thus exposing a generic attribute_filter
, I don't know whether Ammonia has parallelism built-in or how much you care about parallel cleaning (though I figure having two paths and only keeping the GIL if callbacks were actually provided would always be an option if a somewhat more annoying one).
Reading your docs, it seems like nh3.clean(html=somestring)
should remove all HTML tags present, because technically tags
defaults to None
. Instead what it does is it removes <script>
and <style>
tags, but doesn't remove any others. Using nh3 version 0.2.14 and python 3.11.0.
import nh3
item = "<script>alert('hello')</script><style>hola</style><p>hello</p><b>hi</b>"
print("Output: ", nh3.clean(html=item))
#Output: <p>hello</p><b>hi</b>
I would guess if everything is left at the default, it should be removing all tags since you didn't specify any to keep. I also find it peculiar it automatically removes script and style tags when that's not described as a default behavior in the docs.
Edit to add that I just tested this, and it will remove gibberish tags. I don't understand why...
item = "<asdf>hi</ashgasf>"
print("Output: ", nh3.clean(html=item))
#Output: hi
import nh3
print(nh3.clean('<a href="tg://user?id=2135121152">Link</a> and go < here or gor > here </ hei bro ahah >>>', link_rel=None))
how allow this href in a tag? it removes now and puts only Link/a>
Sorry if I misunderstand something, but for me looks like clean_text
function doesn't clean html or text, but do escape for html.
I understand that it's just mirroring API of ammonia, but perhaps good to have some, better named alias.
Is it possible next time on release new version make the github release to have entry on https://github.com/messense/nh3/releases? 🤔
On create github release entry is created email notification to those whom have set in your repo the web UI Watch->Releases.
gh release can contain additional comments (li changelog) or additional assets like release tar balls (by default it contains only assets from git tag) however all those part are not obligatory.
In simplest variant gh release can be empty because subiekt of the sent email contains git tag name.
I'm asking because my automation process uses those email notifications by trying to make preliminary automated upgrades of building packages, which allows saving some time on maintaining packaging procedures.
Probably other people may be interested to be instantly informed about release new version as well.
Description:
https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository
https://github.com/marketplace/actions/github-release
https://pgjones.dev/blog/trusted-plublishing-2023/
jbms/sphinx-immaterial#281 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.