Skip to content

fix(scan): index files under non-ASCII (CJK) directory names#543

Open
Arvin-Hugh wants to merge 1 commit into
colbymchenry:mainfrom
Arvin-Hugh:fix/non-ascii-path-indexing
Open

fix(scan): index files under non-ASCII (CJK) directory names#543
Arvin-Hugh wants to merge 1 commit into
colbymchenry:mainfrom
Arvin-Hugh:fix/non-ascii-path-indexing

Conversation

@Arvin-Hugh
Copy link
Copy Markdown

Problem

Source files under a directory whose name contains non-ASCII characters (e.g. CJK / Chinese folder names) are silently skipped during indexing — no error, no warning. They are not gitignored and parse fine; they simply never enter the index.

On a real .NET solution that groups code into Chinese business-domain folders, only 114 of 902 .cs files were indexed — every skipped file lived under a directory with a non-ASCII name; every indexed file lived under an all-ASCII path.

Minimal reproduction

REPRO=/tmp/cg-cjk-repro
mkdir -p "$REPRO/src/english" "$REPRO/src/中文目录"
printf 'public class Foo {}\n' > "$REPRO/src/english/Foo.cs"
printf 'public class Bar {}\n' > "$REPRO/src/中文目录/Bar.cs"
cd "$REPRO" && git init -q && git add -A && git commit -qm init
codegraph init "$REPRO" && codegraph index "$REPRO" --force
codegraph files --format flat   # only src/english/Foo.cs shows up

Root cause

collectGitFiles and getGitChangedFiles read git ls-files / git status output verbatim. git defaults to core.quotePath=true, which octal-escapes and double-quotes any path containing a byte outside ASCII. So a tracked file at src/中文目录/Bar.cs arrives as:

"src/\344\270\255\346\226\207\347\233\256\345\275\225/Bar.cs"

The surrounding quotes and escapes mean isSourceFile() sees an extension of .cs" (trailing quote), which isn't in EXTENSION_MAP, so the file is dropped. All-ASCII paths are emitted unquoted and are unaffected — which is exactly why only English-named folders were indexed.

Fix

Run the three path-reading git commands with -c core.quotePath=false so non-ASCII paths come through as real UTF-8:

  • git ls-files -c --recurse-submodules (tracked)
  • git ls-files -o --exclude-standard (untracked)
  • git status --porcelain --no-renames (incremental sync)

Testing

  • Added a regression test (Non-ASCII (CJK) paths) that commits one file and leaves one untracked under a CJK-named directory, plus an ASCII control, and asserts all three .cs files are scanned with no quoting artifacts in the returned paths. It fails before the change (1 of 3 found) and passes after.
  • Full vitest run shows no new failures attributable to this change.
  • Verified end-to-end on the real repo: indexed file count went from 114 → 892, and the previously-missing Chinese-path source is now fully present.

No existing issue filed for this; happy to open one to track if preferred.

git defaults to core.quotePath=true, which octal-escapes and double-quotes
any path containing non-ASCII bytes. collectGitFiles / getGitChangedFiles read
git ls-files and git status output verbatim, so a path like
src/中文/Foo.cs arrived as "src/\344\270\255/Foo.cs" — the trailing quote
broke isSourceFile's extension check and the file was silently dropped.

Run those git commands with -c core.quotePath=false so non-ASCII paths come
through as real UTF-8. Projects organized by non-English folder names now index
in full instead of only their all-ASCII paths.

Adds a regression test covering tracked and untracked source under a CJK dir.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant