fix(scan): index files under non-ASCII (CJK) directory names#543
Open
Arvin-Hugh wants to merge 1 commit into
Open
fix(scan): index files under non-ASCII (CJK) directory names#543Arvin-Hugh wants to merge 1 commit into
Arvin-Hugh wants to merge 1 commit into
Conversation
git defaults to core.quotePath=true, which octal-escapes and double-quotes any path containing non-ASCII bytes. collectGitFiles / getGitChangedFiles read git ls-files and git status output verbatim, so a path like src/中文/Foo.cs arrived as "src/\344\270\255/Foo.cs" — the trailing quote broke isSourceFile's extension check and the file was silently dropped. Run those git commands with -c core.quotePath=false so non-ASCII paths come through as real UTF-8. Projects organized by non-English folder names now index in full instead of only their all-ASCII paths. Adds a regression test covering tracked and untracked source under a CJK dir.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Source files under a directory whose name contains non-ASCII characters (e.g. CJK / Chinese folder names) are silently skipped during indexing — no error, no warning. They are not gitignored and parse fine; they simply never enter the index.
On a real .NET solution that groups code into Chinese business-domain folders, only 114 of 902
.csfiles were indexed — every skipped file lived under a directory with a non-ASCII name; every indexed file lived under an all-ASCII path.Minimal reproduction
Root cause
collectGitFilesandgetGitChangedFilesreadgit ls-files/git statusoutput verbatim. git defaults tocore.quotePath=true, which octal-escapes and double-quotes any path containing a byte outside ASCII. So a tracked file atsrc/中文目录/Bar.csarrives as:The surrounding quotes and escapes mean
isSourceFile()sees an extension of.cs"(trailing quote), which isn't inEXTENSION_MAP, so the file is dropped. All-ASCII paths are emitted unquoted and are unaffected — which is exactly why only English-named folders were indexed.Fix
Run the three path-reading git commands with
-c core.quotePath=falseso non-ASCII paths come through as real UTF-8:git ls-files -c --recurse-submodules(tracked)git ls-files -o --exclude-standard(untracked)git status --porcelain --no-renames(incremental sync)Testing
Non-ASCII (CJK) paths) that commits one file and leaves one untracked under a CJK-named directory, plus an ASCII control, and asserts all three.csfiles are scanned with no quoting artifacts in the returned paths. It fails before the change (1 of 3 found) and passes after.vitest runshows no new failures attributable to this change.No existing issue filed for this; happy to open one to track if preferred.