Skip to content

Fix: Normalize plain text input across libxml versions#93

Open
GertjanRoke wants to merge 1 commit intoueberdosis:mainfrom
GertjanRoke:fix/plain-text-libxml-consistency
Open

Fix: Normalize plain text input across libxml versions#93
GertjanRoke wants to merge 1 commit intoueberdosis:mainfrom
GertjanRoke:fix/plain-text-libxml-consistency

Conversation

@GertjanRoke
Copy link
Copy Markdown

Summary

  • Fixes inconsistent DOMParser output when plain text (no HTML tags) is passed to setContent(), caused by differing DOMDocument::loadHTML() behavior between libxml 2.9.x and 2.10+
  • Detects plain text input via strip_tags() and wraps it in <p> before parsing, ensuring consistent paragraph-wrapped output regardless of libxml version
  • Adds tests for plain text input, special characters, and verifying HTML input is not double-wrapped

Fixes #90

Context

DOMDocument::loadHTML() delegates to libxml2's HTML parser, whose behavior changed in libxml 2.10.0 (HTML5-conformant tokenizer). On older libxml, bare text inside <body> is automatically wrapped in <p>, while newer libxml leaves it as a raw text node. This caused Editor->setContent('Hello world') to produce different Tiptap JSON depending on the server's libxml version.

The fix is minimal: before calling loadHTML(), check if the input contains any HTML tags using strip_tags(). If not, wrap it in <p> tags to normalize behavior to match what Tiptap's browser editor produces.

Test plan

  • New test: plain text 'Hello world' produces paragraph-wrapped JSON
  • New test: plain text with special characters (&, <) is handled correctly
  • New test: HTML input '<p>Hello world</p>' is not double-wrapped
  • Full test suite passes (189/189, excluding pre-existing unrelated Shiki failure)

🤖 Generated with Claude Code

…ml versions

DOMDocument::loadHTML() behaves differently across libxml versions when
parsing plain text (no HTML tags): libxml 2.9.x wraps bare text in <p>
tags while libxml 2.10+ does not. This caused setContent() to produce
inconsistent JSON output depending on the server environment.

Detect plain text input (no HTML tags) via strip_tags() and wrap it in
<p> before passing to loadHTML(), ensuring consistent paragraph-wrapped
output regardless of the underlying libxml version.

Fixes ueberdosis#90

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: DOMParser produces different output for plain text input depending on libxml version

1 participant