feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization#2942
feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization#2942Apophis3158 wants to merge 2 commits intoQwenLM:mainfrom
Conversation
| return b.end; | ||
| } | ||
| if (col < b.start) { | ||
| return b.end; |
There was a problem hiding this comment.
[Critical] findNextCjkWordEnd returns b.end when col < b.start, causing Ctrl+Right to skip over non-CJK text and jump directly to the end of a CJK word.
For example, in "hello 你好 world", if the cursor is inside "hello", pressing Ctrl+Right would jump to position 8, skipping "llo " and "你好" entirely. This is asymmetric with findPrevCjkWordStart which correctly returns null in the analogous case.
| return b.end; | |
| } | |
| if (col < b.start) { | |
| return b.end; | |
| if (col < b.start) { | |
| return null; | |
| } |
— glm-5.1 via Qwen Code /review
| ); | ||
| segmentitInstance = null; | ||
| return; | ||
| } | ||
| segmentitInstance = initSegment(new Segment()); | ||
| debugLogger.info('segmentit: loaded successfully'); | ||
| } catch (err) { | ||
| debugLogger.warn('segmentit: failed to load', err); | ||
| segmentitInstance = null; | ||
| } |
There was a problem hiding this comment.
[Suggestion] ensureSegmentitLoaded sets segmentitInstance = null on failure, causing it to retry createRequire on every keypress. Use a sentinel value to distinguish "not yet attempted" from "attempted and failed".
Three changes needed:
- Declaration (line ~114):
let segmentitInstance: { doSegment: (text: string) => Array<{w: string}> } | null | false = null;- Catch block (line ~122): change
segmentitInstance = nullto:
segmentitInstance = false;- Guard (line ~116): change
if (segmentitInstance !== null) return;— this already works sincefalse !== nullistrue, so it will skip retrying.
— glm-5.1 via Qwen Code /review
| debugLogger.warn('getCjkWordBoundaries: error, using char fallback', err); | ||
| // On error, fall back to char-by-char boundaries (cached) | ||
| const fallback = charByCharCjkFallback(line); | ||
| cjkBoundariesCache.set(line, fallback); |
There was a problem hiding this comment.
[Suggestion] The catch block inserts into the cache without calling evictCacheIfNeeded() first. All other insertion paths call it. If doSegment errors on many distinct lines, the cache can grow beyond the 500-entry CJK_BOUNDARIES_CACHE_MAX limit.
| cjkBoundariesCache.set(line, fallback); | |
| evictCacheIfNeeded(); | |
| cjkBoundariesCache.set(line, fallback); |
— glm-5.1 via Qwen Code /review
| "prompts": "^2.4.2", | ||
| "react": "^19.1.0", | ||
| "read-package-up": "^11.0.0", | ||
| "segmentit": "^2.0.3", |
There was a problem hiding this comment.
[Suggestion] segmentit adds ~15MB to disk footprint (embedded dictionary data) as a mandatory dependency for all CLI users. Since the project requires Node.js 20+, the built-in Intl.Segmenter supports CJK word segmentation with zero extra weight:
const segmenter = new Intl.Segmenter('zh', { granularity: 'word' });
const segments = [...segmenter.segment(line)];Note: Intl.Segmenter uses ICU data which may produce different word boundaries than segmentit's dictionary-based approach. Recommend testing with representative CJK text samples before switching.
— glm-5.1 via Qwen Code /review
There was a problem hiding this comment.
These findings could not be posted as inline comments (lines not in diff):
- AppContainer.tsx —
midTurnDrainRefreads from React state mirror instead of synchronous ref. Fix: usedrainQueue()fromuseMessageQueuedirectly. - prompts.ts —
getActionsSection()says "ask for confirmation" but existing rule says "do not ask permission to use the tool". Contradictory instructions may cause inconsistent model behavior. - text-buffer.ts —
delete_word_left/delete_word_rightstill use Latin-only word boundary logic whilemove_wordnow uses CJK segmentation. Inconsistent UX for CJK users.
— glm-5.1 via Qwen Code /review
TLDR
This PR adds intelligent CJK (Chinese/Japanese/Korean) word segmentation to the CLI text input, enabling proper Ctrl+Left/Right word-by-word navigation for CJK text.
Problem: Without this change, pressing Ctrl+Left/Right on CJK text jumps over the entire contiguous block of CJK characters until the next whitespace, treating phrases like "你好世界" as a single word. This makes precise cursor positioning in mixed Latin-CJK text nearly impossible.
Solution: Integrates the
segmentitlibrary for Chinese word segmentation, with character-by-character fallback for long lines and caching for performance. The implementation:segmentitfor CJK word boundary detectionisDifferentScriptfallbackScreenshots / Video Demo
Dive Deeper
Implementation Details
Word Navigation (
wordLeft/wordRight):getCjkWordBoundaries()for lines containing CJK charactersfindPrevCjkWordStart()/findNextCjkWordEnd()for precise cursor positioningisDifferentScript) for mixed text (e.g., Latin + CJK)Performance Optimizations:
segmentitis loaded on-demand viacreateRequire()for ESM/CJS interopDependencies:
segmentit@^2.0.3for Chinese word segmentationReviewer Test Plan
hello 你好 world 世界segmentit)你好,世界!你好hello世界arabicالعربيةnpm run test -- packages/cli/src/ui/components/shared/text-buffer.test.tsTesting Matrix
Linked issues / bugs
#2941
🤖 Generated with Qwen Code