Skip to content

Fix XXE vulnerability in DOCX preprocessing (#1565)#1582

Open
KevinChen1994 wants to merge 2 commits intomicrosoft:mainfrom
KevinChen1994:fix/xxe-vulnerability-docx-preprocessing
Open

Fix XXE vulnerability in DOCX preprocessing (#1565)#1582
KevinChen1994 wants to merge 2 commits intomicrosoft:mainfrom
KevinChen1994:fix/xxe-vulnerability-docx-preprocessing

Conversation

@KevinChen1994
Copy link
Copy Markdown

Summary

  • Replace unsafe xml.etree.ElementTree with defusedxml.ElementTree in
    converter_utils/docx/pre_process.py to prevent XXE (XML External Entity)
    injection attacks
  • The ET.fromstring() call parses XML content extracted from user-supplied
    DOCX files, making it vulnerable to XXE injection and Billion Laughs
    (exponential entity expansion) attacks
  • defusedxml is already a core dependency of the project and is used
    consistently in other converters (omml.py, _rss_converter.py,
    _epub_converter.py)

Fixes #1565

Test plan

  • Verified import works correctly
  • All 109 module vector tests pass
  • All 84 remaining tests pass (CLI, PDF tables, misc)
  • DOCX conversion with math equations (OMML → LaTeX) works correctly
  • Maintainers may want to add a test with a crafted XXE payload to verify the fix blocks the attack

Replace unsafe `xml.etree.ElementTree` with `defusedxml.ElementTree` in
`converter_utils/docx/pre_process.py`. The `ET.fromstring()` call parses
XML content extracted from user-supplied DOCX files, which is vulnerable
to XML External Entity (XXE) injection and Billion Laughs (exponential
entity expansion) attacks.

`defusedxml` is already a core dependency of the project and is used in
other converters (omml.py, _rss_converter.py, _epub_converter.py). This
change makes the DOCX preprocessor consistent with the rest of the
codebase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@plowsec
Copy link
Copy Markdown

plowsec commented Mar 25, 2026

Likely a false positive, I wasn't able to reproduce.

if you put a breakpoint here: markitdown/src/markitdown/converter_utils/docx/pre_process.py:110 you never get a oMath tag you just get <!DOCTYPE w:document>.

besides, in markitdown/src/markitdown/converter_utils/docx/pre_process.py:44 the alleged attacker-controlled payload is wrapped by a root element and any payload within would trigger a ParseError.

So.....PoC or GTFO haha

@KevinChen1994
Copy link
Copy Markdown
Author

▎ You're correct that a DOCTYPE cannot be injected at line 44 via MATH_ROOT_TEMPLATE.format(str(tag)) — the template wraps content inside the root
element, making entity definitions structurally impossible there. I concede that point.

▎ However, the real attack surface is line 110 (BeautifulSoup(content.decode(), features='xml')). The attacker controls the full word/document.xml
including its DOCTYPE. With lxml ≤ 4.6 (still widely installed, and unpinned in our pyproject.toml), lxml resolves file:// SYSTEM entities by
default. I'll expand the PR to also harden this parsing step with XMLParser(resolve_entities=False, no_network=True).

▎ As for line 44: while not directly exploitable today, xml.etree does expand Billion Laughs internal entities (demonstrably; defusedxml blocks
them). Keeping defusedxml there is valid defense-in-depth per CWE-776, and protects against any future refactor that changes how str(tag) is
constructed.

▎ I'll update the PR to address both layers.

- Sanitize XML via lxml XMLParser(resolve_entities=False, no_network=True)
  before passing to BeautifulSoup, blocking external entity resolution
  and network access at the parsing layer (line 110)
- Keep defusedxml at the ET.fromstring call (line 44) as defense-in-depth
- Add regression tests for XXE file exfiltration and Billion Laughs DoS

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@KevinChen1994
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Security: XXE vulnerability in DOCX pre-processor (ET.fromstring on untrusted input)

2 participants