Fix XXE vulnerability in DOCX preprocessing (#1565)#1582
Fix XXE vulnerability in DOCX preprocessing (#1565)#1582KevinChen1994 wants to merge 2 commits intomicrosoft:mainfrom
Conversation
Replace unsafe `xml.etree.ElementTree` with `defusedxml.ElementTree` in `converter_utils/docx/pre_process.py`. The `ET.fromstring()` call parses XML content extracted from user-supplied DOCX files, which is vulnerable to XML External Entity (XXE) injection and Billion Laughs (exponential entity expansion) attacks. `defusedxml` is already a core dependency of the project and is used in other converters (omml.py, _rss_converter.py, _epub_converter.py). This change makes the DOCX preprocessor consistent with the rest of the codebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Likely a false positive, I wasn't able to reproduce. if you put a breakpoint here: besides, in So.....PoC or GTFO haha |
|
▎ You're correct that a DOCTYPE cannot be injected at line 44 via MATH_ROOT_TEMPLATE.format(str(tag)) — the template wraps content inside the root ▎ However, the real attack surface is line 110 (BeautifulSoup(content.decode(), features='xml')). The attacker controls the full word/document.xml ▎ As for line 44: while not directly exploitable today, xml.etree does expand Billion Laughs internal entities (demonstrably; defusedxml blocks ▎ I'll update the PR to address both layers. |
- Sanitize XML via lxml XMLParser(resolve_entities=False, no_network=True) before passing to BeautifulSoup, blocking external entity resolution and network access at the parsing layer (line 110) - Keep defusedxml at the ET.fromstring call (line 44) as defense-in-depth - Add regression tests for XXE file exfiltration and Billion Laughs DoS Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@microsoft-github-policy-service agree |
Summary
xml.etree.ElementTreewithdefusedxml.ElementTreeinconverter_utils/docx/pre_process.pyto prevent XXE (XML External Entity)injection attacks
ET.fromstring()call parses XML content extracted from user-suppliedDOCX files, making it vulnerable to XXE injection and Billion Laughs
(exponential entity expansion) attacks
defusedxmlis already a core dependency of the project and is usedconsistently in other converters (
omml.py,_rss_converter.py,_epub_converter.py)Fixes #1565
Test plan