Skip to content

fix: [SNOW-1528909] decode SQL files by BOM before assuming UTF-8#2999

Draft
sfc-gh-olorek wants to merge 1 commit into
mainfrom
proactive/sql-utf16-bom-detection
Draft

fix: [SNOW-1528909] decode SQL files by BOM before assuming UTF-8#2999
sfc-gh-olorek wants to merge 1 commit into
mainfrom
proactive/sql-utf16-bom-detection

Conversation

@sfc-gh-olorek
Copy link
Copy Markdown
Contributor

Pre-review checklist

  • I've confirmed that instructions included in README.md are still correct after my changes in the codebase.
  • I've added or updated automated unit tests to verify correctness of my new code.
  • I've added or updated integration tests to verify correctness of my new code.
  • I've confirmed that my changes are working by executing CLI's commands manually on MacOS.
  • I've confirmed that my changes are working by executing CLI's commands manually on Windows.
  • I've confirmed that my changes are up-to-date with the target branch.
  • I've described my changes in the release notes.
  • I've described my changes in the section below.
  • I've described my changes in the documentation.

Changes description

Fixes #1303 (SNOW-1528909).

snow sql -f <file> and the !source <file> include directive called path.read_text() / path.open("r") without an explicit encoding, so the decoder defaulted to whatever the platform locale happened to be. Two real-world inputs then broke on Windows:

  • PowerShell's > redirect writes UTF-16 LE with a BOM. Reading it as UTF-8 raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0 before the first statement ever reaches Snowflake — the exact stack trace in the issue.
  • A UTF-8 file with a BOM (common from Notepad, Visual Studio, many CI setups) decodes successfully but leaves a stray U+FEFF as the first character of the first statement, which Snowflake rejects as a syntax error.

This PR introduces _read_sql_file(path) in src/snowflake/cli/_plugins/sql/statement_reader.py:

  • Opens the file in binary mode.
  • Checks for a UTF-32 / UTF-16 / UTF-8 BOM in descending length order (UTF-32 BOMs start with the same two bytes as UTF-16 BOMs, so length order matters).
  • Decodes with the matching codec: utf-32, utf-16, or utf-8-sig. Python's utf-16 / utf-32 codecs consume the BOM themselves, and utf-8-sig strips the UTF-8 BOM — so the returned text never contains a leading U+FEFF.
  • Files without a BOM fall back to utf-8, matching how the CLI writes files elsewhere (SecurePath.write_text) and keeping the common case on the fast path.

Both entry points that previously read SQL files — files_reader for snow sql -f and ParsedStatement.from_file for !source — now go through this helper, so the behaviour is consistent between the two.

Note on approach: the issue thread discussed pulling in charset_normalizer for broader sniffing, but a follow-up comment (from a member of the SnowCLI team) preferred staying with BOM-only detection plus UTF-8 fallback. This change implements exactly that — no new dependencies, no heuristics that could guess the wrong encoding for a plain UTF-8 file that happens to contain ambiguous byte sequences.

Testing

Added 13 regression tests to tests/sql/test_statement_reader.py:

  • _read_sql_file correctly decodes UTF-8-sig, UTF-16 (with BOM from the codec), UTF-16-LE/BE (manually prepended BOM — what PowerShell and Notepad actually write), UTF-32, and UTF-32-LE/BE.
  • files_reader (the snow sql -f path) and the !source directive (ParsedStatement.from_file) each round-trip BOM-prefixed files.
  • Regression guards that plain UTF-8 files still decode on the fast path and that a UTF-8 BOM never leaks into the first statement.

Also manually verified the reproducer from the issue — a file written by "/* copyright */\nselect round(ln(100), 4);".encode("utf-16") (what PowerShell > produces) — now decodes and parses cleanly instead of crashing with UnicodeDecodeError.

Full tests/sql/ suite: 225/225 passing locally.

`snow sql -f <file>` and the `!source <file>` include directive called
`path.read_text()` / `path.open("r")` without an explicit encoding, so the
decoder defaulted to whatever the platform locale happened to be (UTF-8 on
modern Linux/macOS, cp125x on Windows). Two real-world inputs broke on
Windows as a result:

- PowerShell's `>` redirect writes UTF-16 LE with a BOM. Reading it as UTF-8
  raises `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in
  position 0` before the first statement ever reaches Snowflake (#1303).
- A UTF-8 file with a BOM (common from Notepad / Visual Studio) decodes
  successfully but leaves U+FEFF as the first character of the first
  statement, which Snowflake rejects as a syntax error.

This change introduces `_read_sql_file(path)` in `statement_reader.py`. It
opens the file in binary mode, checks for the UTF-32 / UTF-16 / UTF-8 BOMs
in descending length order, and decodes with the matching codec (`utf-32`,
`utf-16`, `utf-8-sig`). Files without a BOM fall back to UTF-8, matching
how the CLI writes files elsewhere and keeping the common case on the fast
path. Both entry points that previously read SQL files — `files_reader` for
`snow sql -f` and `ParsedStatement.from_file` for `!source` — now go
through the new helper.

No charset-sniffing heuristics or new dependencies (e.g. charset_normalizer)
were introduced; the contributor review on the issue explicitly preferred a
BOM-only approach, and everything else continues to default to UTF-8.

### Testing

Added 13 regression tests to `tests/sql/test_statement_reader.py`:

- `_read_sql_file` decodes UTF-8-sig, UTF-16, UTF-16-LE/BE, UTF-32,
  UTF-32-LE/BE with and without an auto-emitted BOM.
- `files_reader` and the `!source` path both round-trip BOM-prefixed files.
- Regression guards that plain UTF-8 files still work and that a UTF-8 BOM
  never leaks into the first statement.

Full `tests/sql/` suite: 225/225 passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SNOW-1528909: Snowflake CLI cannot handle UTF-16LE encoded text files

1 participant