fix: [SNOW-1528909] decode SQL files by BOM before assuming UTF-8#2999
Draft
sfc-gh-olorek wants to merge 1 commit into
Draft
fix: [SNOW-1528909] decode SQL files by BOM before assuming UTF-8#2999sfc-gh-olorek wants to merge 1 commit into
sfc-gh-olorek wants to merge 1 commit into
Conversation
`snow sql -f <file>` and the `!source <file>` include directive called
`path.read_text()` / `path.open("r")` without an explicit encoding, so the
decoder defaulted to whatever the platform locale happened to be (UTF-8 on
modern Linux/macOS, cp125x on Windows). Two real-world inputs broke on
Windows as a result:
- PowerShell's `>` redirect writes UTF-16 LE with a BOM. Reading it as UTF-8
raises `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in
position 0` before the first statement ever reaches Snowflake (#1303).
- A UTF-8 file with a BOM (common from Notepad / Visual Studio) decodes
successfully but leaves U+FEFF as the first character of the first
statement, which Snowflake rejects as a syntax error.
This change introduces `_read_sql_file(path)` in `statement_reader.py`. It
opens the file in binary mode, checks for the UTF-32 / UTF-16 / UTF-8 BOMs
in descending length order, and decodes with the matching codec (`utf-32`,
`utf-16`, `utf-8-sig`). Files without a BOM fall back to UTF-8, matching
how the CLI writes files elsewhere and keeping the common case on the fast
path. Both entry points that previously read SQL files — `files_reader` for
`snow sql -f` and `ParsedStatement.from_file` for `!source` — now go
through the new helper.
No charset-sniffing heuristics or new dependencies (e.g. charset_normalizer)
were introduced; the contributor review on the issue explicitly preferred a
BOM-only approach, and everything else continues to default to UTF-8.
### Testing
Added 13 regression tests to `tests/sql/test_statement_reader.py`:
- `_read_sql_file` decodes UTF-8-sig, UTF-16, UTF-16-LE/BE, UTF-32,
UTF-32-LE/BE with and without an auto-emitted BOM.
- `files_reader` and the `!source` path both round-trip BOM-prefixed files.
- Regression guards that plain UTF-8 files still work and that a UTF-8 BOM
never leaks into the first statement.
Full `tests/sql/` suite: 225/225 passing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pre-review checklist
Changes description
Fixes #1303 (SNOW-1528909).
snow sql -f <file>and the!source <file>include directive calledpath.read_text()/path.open("r")without an explicit encoding, so the decoder defaulted to whatever the platform locale happened to be. Two real-world inputs then broke on Windows:>redirect writes UTF-16 LE with a BOM. Reading it as UTF-8 raisesUnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0before the first statement ever reaches Snowflake — the exact stack trace in the issue.This PR introduces
_read_sql_file(path)insrc/snowflake/cli/_plugins/sql/statement_reader.py:utf-32,utf-16, orutf-8-sig. Python'sutf-16/utf-32codecs consume the BOM themselves, andutf-8-sigstrips the UTF-8 BOM — so the returned text never contains a leading U+FEFF.utf-8, matching how the CLI writes files elsewhere (SecurePath.write_text) and keeping the common case on the fast path.Both entry points that previously read SQL files —
files_readerforsnow sql -fandParsedStatement.from_filefor!source— now go through this helper, so the behaviour is consistent between the two.Note on approach: the issue thread discussed pulling in
charset_normalizerfor broader sniffing, but a follow-up comment (from a member of the SnowCLI team) preferred staying with BOM-only detection plus UTF-8 fallback. This change implements exactly that — no new dependencies, no heuristics that could guess the wrong encoding for a plain UTF-8 file that happens to contain ambiguous byte sequences.Testing
Added 13 regression tests to
tests/sql/test_statement_reader.py:_read_sql_filecorrectly decodes UTF-8-sig, UTF-16 (with BOM from the codec), UTF-16-LE/BE (manually prepended BOM — what PowerShell and Notepad actually write), UTF-32, and UTF-32-LE/BE.files_reader(thesnow sql -fpath) and the!sourcedirective (ParsedStatement.from_file) each round-trip BOM-prefixed files.Also manually verified the reproducer from the issue — a file written by
"/* copyright */\nselect round(ln(100), 4);".encode("utf-16")(what PowerShell>produces) — now decodes and parses cleanly instead of crashing withUnicodeDecodeError.Full
tests/sql/suite: 225/225 passing locally.