Description
When passing large HTML files (>3MB) to MarkItDown, the conversion silently fails and returns the input HTML unchanged instead of markdown. No error is raised.
Reproduction
from markitdown import MarkItDown
# Fetch Tesla's DEF 14A proxy statement from SEC EDGAR (~4.1MB HTML)
import httpx
url = "https://www.sec.gov/Archives/edgar/data/1318605/000110465925090866/tm2420438-5_def14a.htm"
resp = httpx.get(url, headers={"User-Agent": "test@example.com"})
html = resp.text # ~4.7MB
md_converter = MarkItDown()
result = md_converter.convert_stream(
io.BytesIO(html.encode("utf-8")), file_extension=".html"
)
output = result.text_content
print(f"Input length: {len(html)}")
print(f"Output length: {len(output)}")
print(f"Contains HTML: {'<div' in output[:5000]}")
Expected:
- Output is markdown text, significantly smaller than input
- HTML tags are stripped
Actual:
- Output is nearly identical to input (4.1MB → 4.1MB)
- Output still contains raw HTML tags (
<div>, <span>, inline CSS)
- No error or warning raised
Comparison with smaller file
A 2MB HTML file from the same source (Paychex DEF 14A) converts correctly:
- Input: 2,056,755 bytes HTML
- Output: 319,767 bytes markdown
- No HTML tags in output
Environment
- markitdown version: latest (pip install)
- Python 3.13
- macOS
Impact
This makes it impossible to reliably convert large SEC filings (proxy statements, annual reports) to markdown. The silent failure is particularly problematic since there's no indication the conversion didn't work — the caller has to check the output for HTML tags to detect the failure.
A warning or exception when conversion fails would be preferable to silently returning the input.
Description
When passing large HTML files (>3MB) to MarkItDown, the conversion silently fails and returns the input HTML unchanged instead of markdown. No error is raised.
Reproduction
Expected:
Actual:
<div>,<span>, inline CSS)Comparison with smaller file
A 2MB HTML file from the same source (Paychex DEF 14A) converts correctly:
Environment
Impact
This makes it impossible to reliably convert large SEC filings (proxy statements, annual reports) to markdown. The silent failure is particularly problematic since there's no indication the conversion didn't work — the caller has to check the output for HTML tags to detect the failure.
A warning or exception when conversion fails would be preferable to silently returning the input.