Fix CJK emphasis delimiter detection in scanDelims by awoni · Pull Request #387 · executablebooks/markdown-it-py

awoni · 2026-04-06T06:55:55Z

Summary

Fix emphasis delimiter detection for mixed CJK/ASCII text.

In CommonMark's scanDelims, when an emphasis closing marker ** is preceded by ASCII punctuation (e.g. %) and followed by a CJK character, right_flanking evaluates to False — causing bold to silently fail.

Example: 湾岸の**46%**を renders as plain text instead of <strong>46%</strong>.

Root Cause

CJK characters (e.g. を, の, は) are not classified as whitespace or punctuation by the spec. When lastChar='%' (punctuation) and nextChar='を' (neither whitespace nor punctuation):

right_flanking = not (isLastPunctChar and not (isNextWhiteSpace or isNextPunctChar))

evaluates to False, so the closing ** is rejected.

Fix

Treat characters above U+2E7F as punctuation for flanking delimiter checks (6 lines added to scanDelims in state_inline.py). This covers CJK Unified Ideographs, Hiragana, Katakana, Hangul, and other East Asian scripts.

Test Cases

from markdown_it import MarkdownIt
md = MarkdownIt("commonmark")

assert "<strong>" in md.render("湾岸の**46%**を供給")
assert "<strong>" in md.render("の**20〜25%**が通過する")
assert "<strong>" in md.render("日本語の**太字**テスト")
assert "<strong>" in md.render("**全部太字**")
assert "<strong>" in md.render("English **bold** test")

All pass. Existing English-only behavior is unchanged.

Note

The same bug likely exists in the JavaScript version of markdown-it.

Treat CJK ideographs (codepoints > 0x2E7F) as punctuation in flanking delimiter checks so that emphasis markers work correctly in mixed CJK/ASCII contexts (e.g. 湾岸の**46%**を now renders bold correctly). https://claude.ai/code/session_01RzF12DUuSS8R6Zw1NKwv9o

chrisjsewell · 2026-05-06T16:19:35Z

PR #387 Review: Fix CJK emphasis delimiter detection in scanDelims

What it does

The PR modifies scanDelims in state_inline.py to treat any character with codepoint > U+2E7F as "punctuation" for flanking delimiter checks. This fixes cases like 湾岸の**46%**を where the closing ** couldn't be recognized as right-flanking because を (a CJK character) is neither whitespace nor punctuation under the spec rules.

Concerns

1. Deviation from CommonMark spec (not yet adopted upstream)

The CommonMark spec defines emphasis flanking using Unicode categories P (punctuation) and S (symbols). CJK letters are category Lo — regular characters. The spec issue commonmark/commonmark-spec#650 has been open since 2020 with no resolution. The upstream JS markdown-it has a corresponding PR (markdown-it/markdown-it#1145) that is not merged, and a commenter (@tats-u, author of markdown-it-cjk-friendly) noted "such a change must be merged into CommonMark first."

Merging this makes markdown-it-py diverge from both the spec and its upstream JS reference implementation.

2. Overly broad heuristic

The check ord(char) > 0x2E7F treats all characters above codepoint 11903 as punctuation-like. While most blocks immediately above U+2E7F are CJK-related (CJK Radicals, Kangxi, Kana, Hangul, CJK Unified Ideographs), this also captures:

Yi Syllables (U+A000–A48F)
Vai (U+A500–A63F)
Various African and South/Southeast Asian scripts at higher codepoints
Emoji and miscellaneous symbols

By contrast, the JS markdown-it PR #1145 uses a targeted helper checking only CJK Unified Ideographs (U+4E00–9FFF) and Hangul Syllables (U+AC00–D7A3) — and was still asked to add Hiragana/Katakana.

3. No tests committed

4. Semantic mismatch in the fix

The comment says "Treat CJK ideographs as punctuation" but the effect is more subtle — by marking CJK as punctuation, the fix changes the flanking formula's boundary condition. The upstream JS approach is cleaner: it adds || isNextCJK directly to the right_flanking formula rather than pretending CJK characters are punctuation (which also affects can_open/can_close via isLastPunctChar/isNextPunctChar in ways that may have unintended side effects).

Comparison with other parsers

Parser	Approach	Status
CommonMark spec	CJK letters are regular characters, not punctuation	Spec issue #650 open, no change
JS markdown-it	Follows spec strictly	PR #1145 (relaxed right-flanking for CJK) open, not merged
markdown-it-cjk-friendly	Plugin with precise Unicode range checks	Active, recommended as opt-in
Python-Markdown	Regex-based; proposed `cjk_friendly_emphasis` extension	Rejected for core, suggested as third-party
mistune	Regex-based, asterisks more permissive	No explicit CJK handling
This PR	Broad `> 0x2E7F` heuristic in core	Deviates from all above

Recommendation

This PR addresses a real user-facing problem but the approach has issues:

Should be a plugin, not a core change — consistent with the ecosystem consensus (CommonMark, JS markdown-it, Python-Markdown all treat this as opt-in behavior)
If merged into core, the heuristic should be replaced with explicit Unicode range checks (Hiragana U+3040–309F, Katakana U+30A0–30FF, CJK Unified U+4E00–9FFF, Hangul U+AC00–D7A3, CJK Ext-A U+3400–4DBF at minimum)
The fix logic should modify the flanking formula directly (like the JS PR does with || isNextCJK) rather than overloading the punctuation flag, which has broader side effects on can_open/can_close
Tests are required regardless of approach

tats-u · 2026-05-07T23:56:37Z

This sloPR is far too naive for something created by Opus or Sonnet. You should rewrite it from scratch, providing appropriate and sufficient background information/context. In the first place, the original markdown-it is in a sort of maintenance mode, and this repository probably doesn't plan to incorporate any extensions beyond those used in GitHub-Flavored Markdown into core. (CJK-Friendly Emphasis is available in GitLab)

tats-u · 2026-05-08T11:29:45Z

You can take test cases from https://github.com/tats-u/markdown-cjk-friendly and reference implementations from there, kivikakk/comrak#582, xoofx/markdig#921, and (rejected) yuin/goldmark#529 if you can afford to burn input tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix CJK emphasis delimiter detection in scanDelims#387

Fix CJK emphasis delimiter detection in scanDelims#387
awoni wants to merge 1 commit into
executablebooks:masterfrom
aiseed-dev:fix/cjk-emphasis

awoni commented Apr 6, 2026

Uh oh!

chrisjsewell commented May 6, 2026

Uh oh!

tats-u commented May 7, 2026 •

edited

Loading

Uh oh!

tats-u commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

awoni commented Apr 6, 2026

Summary

Root Cause

Fix

Test Cases

Note

Uh oh!

chrisjsewell commented May 6, 2026

PR #387 Review: Fix CJK emphasis delimiter detection in scanDelims

What it does

Concerns

Comparison with other parsers

Recommendation

Uh oh!

tats-u commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tats-u commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tats-u commented May 7, 2026 •

edited

Loading

tats-u commented May 8, 2026 •

edited

Loading