Skip to content

Replace bleach with nh3 for HTML sanitization#163

Open
hoheinzollern wants to merge 15 commits intotorchbox:mainfrom
hoheinzollern:main
Open

Replace bleach with nh3 for HTML sanitization#163
hoheinzollern wants to merge 15 commits intotorchbox:mainfrom
hoheinzollern:main

Conversation

@hoheinzollern
Copy link
Copy Markdown

@hoheinzollern hoheinzollern commented Mar 14, 2026

This PR replaces bleach with nh3 for HTML sanitization, providing significant performance improvements while maintaining the same security guarantees.

Key Changes

  • API Compatibility: All existing configuration options continue to work exactly as before
  • Performance: nh3 is significantly faster than bleach (benchmarks show ~20x improvement)
  • Security: Maintains the same security guarantees with automatic attribute handling

Technical Details

  • Updated constants to use sets instead of lists (nh3 requirement)
  • Modified configuration functions to work with nh3's API
  • Removed rel from allowed attributes (nh3 handles this automatically for security)
  • Updated all tests to work with the new data structures

Testing

  • All existing tests pass (note the necessary upgrade, nh3 adds the rel attribute to links with no option to remove it)
  • Basic functionality verified (markdown rendering, HTML sanitization)

Migration Guide

This change is mostly transparent to users. No code changes are required in projects using wagtail-markdown. Visible changes will be improved performance and the loss of rel attribute.

References

@hoheinzollern
Copy link
Copy Markdown
Author

Currently the CI is failing with weird messages (one wants me to do the opposite of the other, in two instances where dict() and dict comprehensions are used) so I'm unsure how to fix it.

@zerolab
Copy link
Copy Markdown
Member

zerolab commented Mar 16, 2026

Currently the CI is failing with weird messages (one wants me to do the opposite of the other, in two instances where dict() and dict comprehensions are used) so I'm unsure how to fix it.

Ruff fails with https://docs.astral.sh/ruff/rules/unnecessary-generator-dict/ and https://docs.astral.sh/ruff/rules/unnecessary-comprehension/

@hoheinzollern
Copy link
Copy Markdown
Author

Ruff fails with https://docs.astral.sh/ruff/rules/unnecessary-generator-dict/ and https://docs.astral.sh/ruff/rules/unnecessary-comprehension/

Ok, I'm not sure precisely how to address this.

I tried a different approach where all the logic that converts lists to sets (seems necessary with the new library) is localized in the function _sanitise_markdown_html.

@hoheinzollern
Copy link
Copy Markdown
Author

OK, looks like the CI passes now, seems like it's addressed. One critical bit is the requirement to use sets. I've made the choice to do the conversion, but it's equally adequate to pass on the change of interface to the users of wagtail-markdown.
My reason for sending this patch is that I am relying on puput which depends on this project. My own pipeline was complaining about the outdated bleach package, so I figured it's an easy enough fix to switch to nh3 instead of just upgrading a deprecated package to a newer version.

Comment on lines +36 to +40
nh3_kwargs["tags"] = set(nh3_kwargs["tags"])
nh3_kwargs["attributes"] = {
key: set(value) for key, value in nh3_kwargs["attributes"].items()
}
nh3_kwargs["filter_style_properties"] = set(nh3_kwargs["filter_style_properties"])
Copy link
Copy Markdown
Member

@zerolab zerolab Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, these belongs in _get_nh3_kwargs

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could but it breaks the tests that rely on _get_nhs3_kwargs, shall I update them?
Another option is to remove these conversions altogether and present the breaking changes directly to the users, so those who customize these default settings will have to update their app.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to remove these conversions altogether and present the breaking changes directly to the users, so those who customize these default settings will have to update their app.

The proper way would be to add a deprecation warning in the next release, then remove in the following.
At the same time, I do like to be a bit more guarded and prevent users from shooting themselves in the foot. So that is either what this code currentlyu does, or add a system check or similar.

I could but it breaks the tests that rely on _get_nhs3_kwargs, shall I update them?

if it is not too much to ask, that would be nice, thank you

Copy link
Copy Markdown
Author

@hoheinzollern hoheinzollern Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the coercion outside as discussed, but I will leave the deprecation warnings etc. to you, I hope that's OK.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning that styles has been renamed to filter_style_properties in nh3, and this is also a user-facing breaking change.

"<a>anchor tag</a> &lt;script&gt;alert('boom!')&lt;/script&gt;</p>"
'text with a <a href="https://example.com" rel="noopener noreferrer">link</a>\n'
"and some disallowed tag and attributes: italic, "
'<a rel="noopener noreferrer">anchor tag</a> </p>'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree to noopener noreferrer, not everyone will. I think we should leave this to the user, but default to the nh3 defaults.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the default nh3 behaviour, I've just updated the tests to match the output. Should we change that? A quick search revealed this related issue: messense/nh3#8

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good search-fu! Let's do that as it preserves the current behaviour.

We could follow it up to make it configurable, and that gives that choice to end-users

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and documented, should be ready for review.

This commit replaces the bleach library with nh3 for HTML sanitization,
providing significant performance improvements while maintaining the same
security guarantees.

Key changes:
- Updated dependency from bleach to nh3
- Modified constants to use sets instead of lists (nh3 requirement)
- Updated configuration functions to work with nh3's API
- Adjusted tests to work with the new data structures
- Removed 'rel' from allowed attributes (nh3 handles this automatically)
- Updated documentation to reflect the change

The migration is transparent to users - all existing configuration
options continue to work exactly as before, but with improved performance.
- from _sanitise_markdown_html to _coerce_nh3_kwargs_types, called by _get_nh3_kwargs
- updated tests to match
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants