GithubHelp home page GithubHelp logo

messense / nh3 Goto Github PK

View Code? Open in Web Editor NEW
214.0 5.0 7.0 140 KB

Python binding to Ammonia HTML sanitizer Rust crate

Home Page: https://nh3.readthedocs.io

License: MIT License

Rust 75.56% Python 24.44%
bleach sanitize-html

nh3's Introduction

nh3

CI PyPI Documentation Status

Python bindings to the ammonia HTML sanitization library.

Installation

pip install nh3

Usage

See the documentation.

Performance

A quick benchmark showing that nh3 is about 20 times faster than the deprecated bleach package. Measured on a MacBook Air (M2, 2022).

Python 3.11.0 (main, Oct 25 2022, 16:25:24) [Clang 14.0.0 (clang-1400.0.29.102)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.9.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import requests

In [2]: import bleach

In [3]: import nh3

In [4]: html = requests.get("https://www.google.com").text

In [5]: %timeit bleach.clean(html)
2.85 ms ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit nh3.clean(html)
138 µs ± 860 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

License

This work is released under the MIT license. A copy of the license is provided in the LICENSE file.

nh3's People

Contributors

adamchainz avatar damianzaremba avatar dependabot[bot] avatar lepture avatar messense avatar monosans avatar seanbudd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

nh3's Issues

Alias for clean_text function: escape?

Sorry if I misunderstand something, but for me looks like clean_text function doesn't clean html or text, but do escape for html.
I understand that it's just mirroring API of ammonia, but perhaps good to have some, better named alias.

Feature goals compared to Bleach (/ full ammonia API)?

Discussions are not enabled so opening it here, sorry 'bout it.

With the recent deprecation of bleach (mostly on grounds of html5lib being unmaintained), unless someone has the time to e.g. rebuild the html5lib API on top of an existing html5 parser and the maintainer of bleach decides to use that, ammonia/nh3 seems well positioned as a migration target (there's already one package which has done that visible from the linked Bleach PR).

One issue there is that nh3 currently provides rather limited tuning knobs compared to Ammonia and Bleach (not sure how the two relate as I have not looked yet), but the readme doesn't really say what your eventual goals would be on that front as maintainer. If you do aim to favor such support & migration, maybe an issue or even project (kanban) about full Ammonia support and / or Bleach features parity (if not API compatibility) could be a consideration?

An other possible issue (though more internal) is for exposing customisations which allow arbitrary callables (attribute_filter seems to be the only one currently): nh3 currently releases the GIL during cleanup, which wouldn't allow calling Python functions, and thus exposing a generic attribute_filter, I don't know whether Ammonia has parallelism built-in or how much you care about parallel cleaning (though I figure having two paths and only keeping the GIL if callbacks were actually provided would always be an option if a somewhat more annoying one).

PanicException if allowed attribute is missing

html = "<a href='http://www.google.com'>google.com</a>"
nh3.clean(html, tags={'a'}, attributes={'a': {'href', 'rel'}})
pyo3_runtime.PanicException: assertion failed: self.tag_attributes.get("a").and_then(|a| a.get("rel")).is_none()

clean_content_tags doesn't seem to work on tags other than <script> or <style>

Either I am misunderstanding what clean_content_tags does or it is not working correctly. I cannot get the clean_content_tags attribute to work on anything other than the two tags <script> and <style>. Using nh3 version 0.2.14, python 3.11.0.

import nh3

testItem = "<script>alert('hello')</script><p>hello</p>"
print(nh3.clean(html=testItem, clean_content_tags={'p'}))

I receive this error:

thread '<unnamed>' panicked at 'assertion failed: !self.tags.contains(tag_name)', /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ammonia-3.3.0/src/lib.rs:1792:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/home/Desktop/import nh3.py", line 4, in <module>
    print(nh3.clean(html=item, tags=None, clean_content_tags={'p'}))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: assertion failed: !self.tags.contains(tag_name)

I have been able to reproduce this error with b, br, div, and img tags. I haven't tried any others. script and style tags work as expected.

RFE: is it possible to start making github releases?🤔

Is it possible next time on release new version make the github release to have entry on https://github.com/messense/nh3/releases? 🤔
On create github release entry is created email notification to those whom have set in your repo the web UI Watch->Releases.
gh release can contain additional comments (li changelog) or additional assets like release tar balls (by default it contains only assets from git tag) however all those part are not obligatory.
In simplest variant gh release can be empty because subiekt of the sent email contains git tag name.

I'm asking because my automation process uses those email notifications by trying to make preliminary automated upgrades of building packages, which allows saving some time on maintaining packaging procedures.
Probably other people may be interested to be instantly informed about release new version as well.

Description:
https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository
https://github.com/marketplace/actions/github-release
https://pgjones.dev/blog/trusted-plublishing-2023/
jbms/sphinx-immaterial#281 (comment)

Access to the default URL schemes

It would be nice if the Python package could also expose ammonia’s default whitelisted URL schemes. If I understand the code correctly, this could easily be done by adding m.add("ALLOWED_URL_SCHEMES", a.clone_url_schemes())?; to the nh3 function.

Pylint false positive: no name in module

image
Is there a linter plugin to install, or is the module improperly configured for pylint? Functionally this code works perfectly, and intellisense can see that it's fine.

Allow frozenset in attributes parameter of clean function

Now it's dict[str, set[str]], and attempt to use frozenset will return

TypeError: argument 'attributes': 'frozenset' object cannot be converted to 'PySet'

but IMO using frozenset is a good practice, because if data is immutable good to use immutable type, for example allowed attributes can be defined in configuration and it's safer to use immutable type.

Feature: allow generic attribute prefixes, e.g. data-*

In the current implementation (v0.2.9), there isn't a way to allow all data-* attributes (or other generic attributes prefixes)

In the underlying ammonia, the builder allows for generic_attribute_prefixes to be specified, and uses the data- prefix as an example in the docs:
https://docs.rs/ammonia/latest/ammonia/struct.Builder.html#method.generic_attribute_prefixes

I am currently using bleach with an implementation that allows all data-* attributes, and I would like to switch to this library.
Having this ability would allow me to make the switch easily.

Please consider adding this feature.

Using default values in .clean() method produces unexpected output

Reading your docs, it seems like nh3.clean(html=somestring) should remove all HTML tags present, because technically tags defaults to None. Instead what it does is it removes <script> and <style> tags, but doesn't remove any others. Using nh3 version 0.2.14 and python 3.11.0.

import nh3

item = "<script>alert('hello')</script><style>hola</style><p>hello</p><b>hi</b>"
print("Output: ", nh3.clean(html=item))
#Output: <p>hello</p><b>hi</b>

I would guess if everything is left at the default, it should be removing all tags since you didn't specify any to keep. I also find it peculiar it automatically removes script and style tags when that's not described as a default behavior in the docs.

Edit to add that I just tested this, and it will remove gibberish tags. I don't understand why...

item = "<asdf>hi</ashgasf>"
print("Output: ", nh3.clean(html=item))
#Output: hi

Would it be possible to disable adding of rel="noopener noreferrer"?

Adding rel="noopener noreferrer" to all <a> tags is not desired in our use case. Would it be possible to add a parameter to disable that?

If I read ammonia source correctly, it should be achievable by Builder.link_rel(None), so it just needs a way to expose this in the Python interface.

nh3 clean doesn't include html, head or body tags even when included in ALLOWED_TAGS

While using nh3 library, we came across a use case, where HTML content is expected for a field, but we need to remove the content that can cause XSS attack. Using nh3.clean() directly on the input text doesn't give the expected result and a lot of useful data is getting trimmed ultimately modifying the html template input.

import nh3
text = '''
<!DOCTYPE html>
<html>
<head>
  <title>HTML Tutorial</title>
</head>
<body>
  <h1>This is a heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>
'''

nh3.ALLOWED_TAGS.add('title')
nh3.ALLOWED_TAGS.add('head')
nh3.ALLOWED_TAGS.add('html')
nh3.ALLOWED_TAGS.add('div')
nh3.ALLOWED_TAGS.add('body')

print(nh3.clean(text,tags=nh3.ALLOWED_TAGS,strip_comments=False))

Output: 
<title>HTML Tutorial</title>
 <h1>This is a heading</h1>
 <p>This is a paragraph.</p> 

We don't want to trim the html or head or body tags. Is there any limitation to nh3 library which does not allow these tags?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.