Skip to content

Large HTML files (>3MB) silently return unconverted HTML #1636

@gulleroglu

Description

@gulleroglu

Description

When passing large HTML files (>3MB) to MarkItDown, the conversion silently fails and returns the input HTML unchanged instead of markdown. No error is raised.

Reproduction

from markitdown import MarkItDown

# Fetch Tesla's DEF 14A proxy statement from SEC EDGAR (~4.1MB HTML)
import httpx
url = "https://www.sec.gov/Archives/edgar/data/1318605/000110465925090866/tm2420438-5_def14a.htm"
resp = httpx.get(url, headers={"User-Agent": "test@example.com"})
html = resp.text  # ~4.7MB

md_converter = MarkItDown()
result = md_converter.convert_stream(
    io.BytesIO(html.encode("utf-8")), file_extension=".html"
)
output = result.text_content

print(f"Input length:  {len(html)}")
print(f"Output length: {len(output)}")
print(f"Contains HTML: {'<div' in output[:5000]}")

Expected:

  • Output is markdown text, significantly smaller than input
  • HTML tags are stripped

Actual:

  • Output is nearly identical to input (4.1MB → 4.1MB)
  • Output still contains raw HTML tags (<div>, <span>, inline CSS)
  • No error or warning raised

Comparison with smaller file

A 2MB HTML file from the same source (Paychex DEF 14A) converts correctly:

  • Input: 2,056,755 bytes HTML
  • Output: 319,767 bytes markdown
  • No HTML tags in output

Environment

  • markitdown version: latest (pip install)
  • Python 3.13
  • macOS

Impact

This makes it impossible to reliably convert large SEC filings (proxy statements, annual reports) to markdown. The silent failure is particularly problematic since there's no indication the conversion didn't work — the caller has to check the output for HTML tags to detect the failure.

A warning or exception when conversion fails would be preferable to silently returning the input.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions