Fix CJK emphasis delimiter detection in scanDelims#387
Fix CJK emphasis delimiter detection in scanDelims#387awoni wants to merge 1 commit intoexecutablebooks:masterfrom
Conversation
Treat CJK ideographs (codepoints > 0x2E7F) as punctuation in flanking delimiter checks so that emphasis markers work correctly in mixed CJK/ASCII contexts (e.g. 湾岸の**46%**を now renders bold correctly). https://claude.ai/code/session_01RzF12DUuSS8R6Zw1NKwv9o
PR #387 Review: Fix CJK emphasis delimiter detection in scanDelimsWhat it doesThe PR modifies Concerns1. Deviation from CommonMark spec (not yet adopted upstream) The CommonMark spec defines emphasis flanking using Unicode categories Merging this makes markdown-it-py diverge from both the spec and its upstream JS reference implementation. 2. Overly broad heuristic The check
By contrast, the JS markdown-it PR #1145 uses a targeted helper checking only CJK Unified Ideographs (U+4E00–9FFF) and Hangul Syllables (U+AC00–D7A3) — and was still asked to add Hiragana/Katakana. 3. No tests committed 4. Semantic mismatch in the fix The comment says "Treat CJK ideographs as punctuation" but the effect is more subtle — by marking CJK as punctuation, the fix changes the flanking formula's boundary condition. The upstream JS approach is cleaner: it adds Comparison with other parsers
RecommendationThis PR addresses a real user-facing problem but the approach has issues:
|
Summary
Fix emphasis delimiter detection for mixed CJK/ASCII text.
In CommonMark's
scanDelims, when an emphasis closing marker**is preceded by ASCII punctuation (e.g.%) and followed by a CJK character,right_flankingevaluates toFalse— causing bold to silently fail.Example:
湾岸の**46%**をrenders as plain text instead of<strong>46%</strong>.Root Cause
CJK characters (e.g.
を,の,は) are not classified as whitespace or punctuation by the spec. WhenlastChar='%'(punctuation) andnextChar='を'(neither whitespace nor punctuation):right_flanking = not (isLastPunctChar and not (isNextWhiteSpace or isNextPunctChar))evaluates to
False, so the closing**is rejected.Fix
Treat characters above U+2E7F as punctuation for flanking delimiter checks (6 lines added to
scanDelimsinstate_inline.py). This covers CJK Unified Ideographs, Hiragana, Katakana, Hangul, and other East Asian scripts.Test Cases
All pass. Existing English-only behavior is unchanged.
Note
The same bug likely exists in the JavaScript version of markdown-it.