Skip to content

fix: preserve HTML tables in Outlook .msg conversion#1673

Open
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1567-outlook-msg-html-tables
Open

fix: preserve HTML tables in Outlook .msg conversion#1673
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1567-outlook-msg-html-tables

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #1567

Problem

The OutlookMsgConverter reads only the plain text body, discarding the HTML body. When a .msg file contains HTML tables, the plain text fallback strips all HTML formatting.

Solution

Prefer the HTML body (PR_BODY_HTML) when it exists:

  1. Try Unicode HTML stream __substg1.0_1013001F first
  2. Fall back to binary HTML stream __substg1.0_10130102
  3. Convert HTML to markdown via BeautifulSoup + _CustomMarkdownify (same as HtmlConverter)
  4. Fall back to plain text body if no HTML body is present

No new dependencies introduced.

Testing

Tested with a .msg file containing HTML tables. Before: unformatted plain text. After: proper markdown tables.

…#1567)

When a .msg file contains an HTML body (PR_BODY_HTML), prefer it over
the plain text body so that tables and other HTML formatting are
converted to proper markdown instead of being stripped.

- Try Unicode HTML stream (__substg1.0_1013001F) first
- Fall back to binary HTML stream (__substg1.0_10130102)
- Convert HTML to markdown via BeautifulSoup + _CustomMarkdownify
- Fall back to plain text body if no HTML body is present
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[outlook]: HTML Tables in outlook files

1 participant