Skip to content

feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization#2942

Open
Apophis3158 wants to merge 2 commits intoQwenLM:mainfrom
Apophis3158:feat/cjk-navigation
Open

feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization#2942
Apophis3158 wants to merge 2 commits intoQwenLM:mainfrom
Apophis3158:feat/cjk-navigation

Conversation

@Apophis3158
Copy link
Copy Markdown

@Apophis3158 Apophis3158 commented Apr 7, 2026

TLDR

This PR adds intelligent CJK (Chinese/Japanese/Korean) word segmentation to the CLI text input, enabling proper Ctrl+Left/Right word-by-word navigation for CJK text.

Problem: Without this change, pressing Ctrl+Left/Right on CJK text jumps over the entire contiguous block of CJK characters until the next whitespace, treating phrases like "你好世界" as a single word. This makes precise cursor positioning in mixed Latin-CJK text nearly impossible.

Solution: Integrates the segmentit library for Chinese word segmentation, with character-by-character fallback for long lines and caching for performance. The implementation:

  • Adds lazy-loaded segmentit for CJK word boundary detection
  • Implements caching (up to 500 entries) to avoid repeated segmentation overhead
  • Falls back to character-by-character navigation for lines exceeding 1500 characters to prevent UI freezing
  • Extends CJK regex coverage to include CJK Extension A and Compatibility Ideographs
  • Preserves existing Latin/multi-script word boundary logic via isDifferentScript fallback

Screenshots / Video Demo

Dive Deeper

Implementation Details

Word Navigation (wordLeft / wordRight):

  • First attempts CJK segmentation via getCjkWordBoundaries() for lines containing CJK characters
  • Uses findPrevCjkWordStart() / findNextCjkWordEnd() for precise cursor positioning
  • Falls back to script-boundary detection (isDifferentScript) for mixed text (e.g., Latin + CJK)
  • Handles edge cases: whitespace-only prefixes, punctuation skipping, and cross-line navigation

Performance Optimizations:

  • Lazy loading: segmentit is loaded on-demand via createRequire() for ESM/CJS interop
  • Pre-warming: Background initialization on first call to minimize latency
  • Caching: Line content → boundary mapping cache with LRU-style eviction
  • Length limit: Lines >1500 chars skip segmentation to avoid UI freeze

Dependencies:

  • Added segmentit@^2.0.3 for Chinese word segmentation

Reviewer Test Plan

  1. Open the CLI and type mixed Latin-CJK text, e.g., hello 你好 world 世界
  2. Use Ctrl+Left/Right to navigate word-by-word:
    • Verify cursor stops at CJK word boundaries (as determined by segmentit)
    • Verify Latin words are still treated as single units
    • Verify cross-line navigation works correctly
  3. Test edge cases:
    • Long lines (>1500 chars) with CJK text
    • CJK text with punctuation: 你好,世界!
    • Mixed scripts: 你好hello世界arabicالعربية
  4. Run unit tests: npm run test -- packages/cli/src/ui/components/shared/text-buffer.test.ts

Testing Matrix

🍏 🪟 🐧
npm run
npx
Docker
Podman - -
Seatbelt - -

Linked issues / bugs

#2941


🤖 Generated with Qwen Code

Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(duplicate review removed)

Comment on lines +299 to +302
return b.end;
}
if (col < b.start) {
return b.end;
Copy link
Copy Markdown
Collaborator

@wenshao wenshao Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] findNextCjkWordEnd returns b.end when col < b.start, causing Ctrl+Right to skip over non-CJK text and jump directly to the end of a CJK word.

For example, in "hello 你好 world", if the cursor is inside "hello", pressing Ctrl+Right would jump to position 8, skipping "llo " and "你好" entirely. This is asymmetric with findPrevCjkWordStart which correctly returns null in the analogous case.

Suggested change
return b.end;
}
if (col < b.start) {
return b.end;
if (col < b.start) {
return null;
}

— glm-5.1 via Qwen Code /review

Comment on lines +114 to +123
);
segmentitInstance = null;
return;
}
segmentitInstance = initSegment(new Segment());
debugLogger.info('segmentit: loaded successfully');
} catch (err) {
debugLogger.warn('segmentit: failed to load', err);
segmentitInstance = null;
}
Copy link
Copy Markdown
Collaborator

@wenshao wenshao Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] ensureSegmentitLoaded sets segmentitInstance = null on failure, causing it to retry createRequire on every keypress. Use a sentinel value to distinguish "not yet attempted" from "attempted and failed".

Three changes needed:

  1. Declaration (line ~114):
let segmentitInstance: { doSegment: (text: string) => Array<{w: string}> } | null | false = null;
  1. Catch block (line ~122): change segmentitInstance = null to:
    segmentitInstance = false;
  1. Guard (line ~116): change if (segmentitInstance !== null) return; — this already works since false !== null is true, so it will skip retrying.

— glm-5.1 via Qwen Code /review

debugLogger.warn('getCjkWordBoundaries: error, using char fallback', err);
// On error, fall back to char-by-char boundaries (cached)
const fallback = charByCharCjkFallback(line);
cjkBoundariesCache.set(line, fallback);
Copy link
Copy Markdown
Collaborator

@wenshao wenshao Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The catch block inserts into the cache without calling evictCacheIfNeeded() first. All other insertion paths call it. If doSegment errors on many distinct lines, the cache can grow beyond the 500-entry CJK_BOUNDARIES_CACHE_MAX limit.

Suggested change
cjkBoundariesCache.set(line, fallback);
evictCacheIfNeeded();
cjkBoundariesCache.set(line, fallback);

— glm-5.1 via Qwen Code /review

"prompts": "^2.4.2",
"react": "^19.1.0",
"read-package-up": "^11.0.0",
"segmentit": "^2.0.3",
Copy link
Copy Markdown
Collaborator

@wenshao wenshao Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] segmentit adds ~15MB to disk footprint (embedded dictionary data) as a mandatory dependency for all CLI users. Since the project requires Node.js 20+, the built-in Intl.Segmenter supports CJK word segmentation with zero extra weight:

const segmenter = new Intl.Segmenter('zh', { granularity: 'word' });
const segments = [...segmenter.segment(line)];

Note: Intl.Segmenter uses ICU data which may produce different word boundaries than segmentit's dictionary-based approach. Recommend testing with representative CJK text samples before switching.

— glm-5.1 via Qwen Code /review

Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These findings could not be posted as inline comments (lines not in diff):

  • AppContainer.tsxmidTurnDrainRef reads from React state mirror instead of synchronous ref. Fix: use drainQueue() from useMessageQueue directly.
  • prompts.tsgetActionsSection() says "ask for confirmation" but existing rule says "do not ask permission to use the tool". Contradictory instructions may cause inconsistent model behavior.
  • text-buffer.tsdelete_word_left/delete_word_right still use Latin-only word boundary logic while move_word now uses CJK segmentation. Inconsistent UX for CJK users.

— glm-5.1 via Qwen Code /review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants