Skip to content

Fix CJK emphasis delimiter detection in scanDelims#387

Open
awoni wants to merge 1 commit intoexecutablebooks:masterfrom
aiseed-dev:fix/cjk-emphasis
Open

Fix CJK emphasis delimiter detection in scanDelims#387
awoni wants to merge 1 commit intoexecutablebooks:masterfrom
aiseed-dev:fix/cjk-emphasis

Conversation

@awoni
Copy link
Copy Markdown

@awoni awoni commented Apr 6, 2026

Summary

Fix emphasis delimiter detection for mixed CJK/ASCII text.

In CommonMark's scanDelims, when an emphasis closing marker ** is preceded by ASCII punctuation (e.g. %) and followed by a CJK character, right_flanking evaluates to False — causing bold to silently fail.

Example: 湾岸の**46%**を renders as plain text instead of <strong>46%</strong>.

Root Cause

CJK characters (e.g. , , ) are not classified as whitespace or punctuation by the spec. When lastChar='%' (punctuation) and nextChar='を' (neither whitespace nor punctuation):

right_flanking = not (isLastPunctChar and not (isNextWhiteSpace or isNextPunctChar))

evaluates to False, so the closing ** is rejected.

Fix

Treat characters above U+2E7F as punctuation for flanking delimiter checks (6 lines added to scanDelims in state_inline.py). This covers CJK Unified Ideographs, Hiragana, Katakana, Hangul, and other East Asian scripts.

Test Cases

from markdown_it import MarkdownIt
md = MarkdownIt("commonmark")

assert "<strong>" in md.render("湾岸の**46%**を供給")
assert "<strong>" in md.render("の**20〜25%**が通過する")
assert "<strong>" in md.render("日本語の**太字**テスト")
assert "<strong>" in md.render("**全部太字**")
assert "<strong>" in md.render("English **bold** test")

All pass. Existing English-only behavior is unchanged.

Note

The same bug likely exists in the JavaScript version of markdown-it.

Treat CJK ideographs (codepoints > 0x2E7F) as punctuation in flanking
delimiter checks so that emphasis markers work correctly in mixed
CJK/ASCII contexts (e.g. 湾岸の**46%**を now renders bold correctly).

https://claude.ai/code/session_01RzF12DUuSS8R6Zw1NKwv9o
@chrisjsewell
Copy link
Copy Markdown
Member

PR #387 Review: Fix CJK emphasis delimiter detection in scanDelims

What it does

The PR modifies scanDelims in state_inline.py to treat any character with codepoint > U+2E7F as "punctuation" for flanking delimiter checks. This fixes cases like 湾岸の**46%**を where the closing ** couldn't be recognized as right-flanking because (a CJK character) is neither whitespace nor punctuation under the spec rules.


Concerns

1. Deviation from CommonMark spec (not yet adopted upstream)

The CommonMark spec defines emphasis flanking using Unicode categories P (punctuation) and S (symbols). CJK letters are category Lo — regular characters. The spec issue commonmark/commonmark-spec#650 has been open since 2020 with no resolution. The upstream JS markdown-it has a corresponding PR (markdown-it/markdown-it#1145) that is not merged, and a commenter (@tats-u, author of markdown-it-cjk-friendly) noted "such a change must be merged into CommonMark first."

Merging this makes markdown-it-py diverge from both the spec and its upstream JS reference implementation.

2. Overly broad heuristic

The check ord(char) > 0x2E7F treats all characters above codepoint 11903 as punctuation-like. While most blocks immediately above U+2E7F are CJK-related (CJK Radicals, Kangxi, Kana, Hangul, CJK Unified Ideographs), this also captures:

  • Yi Syllables (U+A000–A48F)
  • Vai (U+A500–A63F)
  • Various African and South/Southeast Asian scripts at higher codepoints
  • Emoji and miscellaneous symbols

By contrast, the JS markdown-it PR #1145 uses a targeted helper checking only CJK Unified Ideographs (U+4E00–9FFF) and Hangul Syllables (U+AC00–D7A3) — and was still asked to add Hiragana/Katakana.

3. No tests committed

4. Semantic mismatch in the fix

The comment says "Treat CJK ideographs as punctuation" but the effect is more subtle — by marking CJK as punctuation, the fix changes the flanking formula's boundary condition. The upstream JS approach is cleaner: it adds || isNextCJK directly to the right_flanking formula rather than pretending CJK characters are punctuation (which also affects can_open/can_close via isLastPunctChar/isNextPunctChar in ways that may have unintended side effects).


Comparison with other parsers

Parser Approach Status
CommonMark spec CJK letters are regular characters, not punctuation Spec issue #650 open, no change
JS markdown-it Follows spec strictly PR #1145 (relaxed right-flanking for CJK) open, not merged
markdown-it-cjk-friendly Plugin with precise Unicode range checks Active, recommended as opt-in
Python-Markdown Regex-based; proposed cjk_friendly_emphasis extension Rejected for core, suggested as third-party
mistune Regex-based, asterisks more permissive No explicit CJK handling
This PR Broad > 0x2E7F heuristic in core Deviates from all above

Recommendation

This PR addresses a real user-facing problem but the approach has issues:

  1. Should be a plugin, not a core change — consistent with the ecosystem consensus (CommonMark, JS markdown-it, Python-Markdown all treat this as opt-in behavior)
  2. If merged into core, the heuristic should be replaced with explicit Unicode range checks (Hiragana U+3040–309F, Katakana U+30A0–30FF, CJK Unified U+4E00–9FFF, Hangul U+AC00–D7A3, CJK Ext-A U+3400–4DBF at minimum)
  3. The fix logic should modify the flanking formula directly (like the JS PR does with || isNextCJK) rather than overloading the punctuation flag, which has broader side effects on can_open/can_close
  4. Tests are required regardless of approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants