Replace bleach with nh3 for HTML sanitization#163
Replace bleach with nh3 for HTML sanitization#163hoheinzollern wants to merge 15 commits intotorchbox:mainfrom
Conversation
|
Currently the CI is failing with weird messages (one wants me to do the opposite of the other, in two instances where dict() and dict comprehensions are used) so I'm unsure how to fix it. |
Ruff fails with https://docs.astral.sh/ruff/rules/unnecessary-generator-dict/ and https://docs.astral.sh/ruff/rules/unnecessary-comprehension/ |
Ok, I'm not sure precisely how to address this. I tried a different approach where all the logic that converts lists to sets (seems necessary with the new library) is localized in the function |
|
OK, looks like the CI passes now, seems like it's addressed. One critical bit is the requirement to use sets. I've made the choice to do the conversion, but it's equally adequate to pass on the change of interface to the users of |
src/wagtailmarkdown/utils.py
Outdated
| nh3_kwargs["tags"] = set(nh3_kwargs["tags"]) | ||
| nh3_kwargs["attributes"] = { | ||
| key: set(value) for key, value in nh3_kwargs["attributes"].items() | ||
| } | ||
| nh3_kwargs["filter_style_properties"] = set(nh3_kwargs["filter_style_properties"]) |
There was a problem hiding this comment.
IMHO, these belongs in _get_nh3_kwargs
There was a problem hiding this comment.
I could but it breaks the tests that rely on _get_nhs3_kwargs, shall I update them?
Another option is to remove these conversions altogether and present the breaking changes directly to the users, so those who customize these default settings will have to update their app.
There was a problem hiding this comment.
Another option is to remove these conversions altogether and present the breaking changes directly to the users, so those who customize these default settings will have to update their app.
The proper way would be to add a deprecation warning in the next release, then remove in the following.
At the same time, I do like to be a bit more guarded and prevent users from shooting themselves in the foot. So that is either what this code currentlyu does, or add a system check or similar.
I could but it breaks the tests that rely on _get_nhs3_kwargs, shall I update them?
if it is not too much to ask, that would be nice, thank you
There was a problem hiding this comment.
I moved the coercion outside as discussed, but I will leave the deprecation warnings etc. to you, I hope that's OK.
There was a problem hiding this comment.
Maybe worth mentioning that styles has been renamed to filter_style_properties in nh3, and this is also a user-facing breaking change.
| "<a>anchor tag</a> <script>alert('boom!')</script></p>" | ||
| 'text with a <a href="https://example.com" rel="noopener noreferrer">link</a>\n' | ||
| "and some disallowed tag and attributes: italic, " | ||
| '<a rel="noopener noreferrer">anchor tag</a> </p>' |
There was a problem hiding this comment.
While I agree to noopener noreferrer, not everyone will. I think we should leave this to the user, but default to the nh3 defaults.
There was a problem hiding this comment.
This is just the default nh3 behaviour, I've just updated the tests to match the output. Should we change that? A quick search revealed this related issue: messense/nh3#8
There was a problem hiding this comment.
Good search-fu! Let's do that as it preserves the current behaviour.
We could follow it up to make it configurable, and that gives that choice to end-users
There was a problem hiding this comment.
Done and documented, should be ready for review.
This commit replaces the bleach library with nh3 for HTML sanitization, providing significant performance improvements while maintaining the same security guarantees. Key changes: - Updated dependency from bleach to nh3 - Modified constants to use sets instead of lists (nh3 requirement) - Updated configuration functions to work with nh3's API - Adjusted tests to work with the new data structures - Removed 'rel' from allowed attributes (nh3 handles this automatically) - Updated documentation to reflect the change The migration is transparent to users - all existing configuration options continue to work exactly as before, but with improved performance.
- from _sanitise_markdown_html to _coerce_nh3_kwargs_types, called by _get_nh3_kwargs - updated tests to match
- updated tests to match
This PR replaces bleach with nh3 for HTML sanitization, providing significant performance improvements while maintaining the same security guarantees.
Key Changes
Technical Details
relfrom allowed attributes (nh3 handles this automatically for security)Testing
relattribute to links with no option to remove it)Migration Guide
This change is mostly transparent to users. No code changes are required in projects using wagtail-markdown. Visible changes will be improved performance and the loss of
relattribute.References