Describe the bug
Snowflake's incremental table-list query (SNOWFLAKE_INCREMENTAL_GET_TABLE_NAMES in ingestion/src/metadata/ingestion/source/database/snowflake/queries.py) deduplicates rows in SNOWFLAKE.ACCOUNT_USAGE.tables using:
ROW_NUMBER() OVER (partition by TABLE_NAME order by LAST_DDL desc)
...
WHERE ROW_NUMBER = 1
For any table that was recreated via CREATE OR REPLACE TABLE, ACCOUNT_USAGE retains two rows under the same TABLE_NAME:
- the new alive row (
new TABLE_ID, DELETED IS NULL)
- the old dropped row (
old TABLE_ID, DELETED = <drop_ts>)
Snowflake updates the old row's LAST_DDL to the drop timestamp, which is after the new row's create LAST_DDL (the engine creates the new table first, then drops the old one). The dedup therefore picks the deleted row as "the latest state of TABLE_NAME", and the alive table is:
- placed into
SnowflakeTableList.get_deleted()
- queued for deletion in OpenMetadata via
mark_tables_as_deleted
- never yielded to
_get_table_names_and_types's return value, so its new TABLE_ID and schema changes are never processed
To Reproduce
On a Snowflake account with SNOWFLAKE.ACCOUNT_USAGE populated:
- Create a table and let
ACCOUNT_USAGE.tables catch up (1–3h):
CREATE TABLE TEST_DB.PUBLIC.T (ID NUMBER);
- Replace it:
CREATE OR REPLACE TABLE TEST_DB.PUBLIC.T (ID NUMBER, NAME VARCHAR);
- Wait for ACCOUNT_USAGE to reflect both rows, then run the Snowflake ingestion in incremental mode with a window that covers both events.
Inspect ACCOUNT_USAGE:
SELECT TABLE_ID, TABLE_NAME, LAST_DDL, DELETED
FROM SNOWFLAKE.ACCOUNT_USAGE.tables
WHERE TABLE_SCHEMA = 'PUBLIC' AND TABLE_NAME = 'T';
You'll see two rows for T, and the deleted row has the later LAST_DDL.
Expected behavior
The alive (DELETED IS NULL) row wins the dedup. The recreated table is ingested with its new columns; the old TABLE_ID does not cause the entity to be marked deleted.
Suggested fix
Either:
- filter
DELETED IS NULL before the ROW_NUMBER() window (if you only want alive tables in incremental output), or
- change the ORDER BY to favor alive rows on ties, e.g.
ORDER BY (CASE WHEN DELETED IS NULL THEN 1 ELSE 0 END) DESC, LAST_DDL DESC.
The first option is cleaner if delete handling is taken care of elsewhere (e.g., mark_tables_as_deleted post-process). The second preserves the "include deletions" semantics if you want them in the same pass.
Version
- Reproduced on
main (queries.py:33-56).
- Affects any Snowflake ingestion configured with
incremental.enabled: true.
Additional context
Independent of the bad-name / quote_name issue tracked separately — this affects perfectly valid table names whenever CREATE OR REPLACE TABLE (or DROP + CREATE) happens inside the incremental window.
Describe the bug
Snowflake's incremental table-list query (
SNOWFLAKE_INCREMENTAL_GET_TABLE_NAMESiningestion/src/metadata/ingestion/source/database/snowflake/queries.py) deduplicates rows inSNOWFLAKE.ACCOUNT_USAGE.tablesusing:For any table that was recreated via
CREATE OR REPLACE TABLE, ACCOUNT_USAGE retains two rows under the sameTABLE_NAME:new TABLE_ID,DELETED IS NULL)old TABLE_ID,DELETED = <drop_ts>)Snowflake updates the old row's
LAST_DDLto the drop timestamp, which is after the new row's createLAST_DDL(the engine creates the new table first, then drops the old one). The dedup therefore picks the deleted row as "the latest state of TABLE_NAME", and the alive table is:SnowflakeTableList.get_deleted()mark_tables_as_deleted_get_table_names_and_types's return value, so its newTABLE_IDand schema changes are never processedTo Reproduce
On a Snowflake account with
SNOWFLAKE.ACCOUNT_USAGEpopulated:ACCOUNT_USAGE.tablescatch up (1–3h):Inspect ACCOUNT_USAGE:
You'll see two rows for
T, and the deleted row has the laterLAST_DDL.Expected behavior
The alive (
DELETED IS NULL) row wins the dedup. The recreated table is ingested with its new columns; the oldTABLE_IDdoes not cause the entity to be marked deleted.Suggested fix
Either:
DELETED IS NULLbefore theROW_NUMBER()window (if you only want alive tables in incremental output), orORDER BY (CASE WHEN DELETED IS NULL THEN 1 ELSE 0 END) DESC, LAST_DDL DESC.The first option is cleaner if delete handling is taken care of elsewhere (e.g.,
mark_tables_as_deletedpost-process). The second preserves the "include deletions" semantics if you want them in the same pass.Version
main(queries.py:33-56).incremental.enabled: true.Additional context
Independent of the bad-name /
quote_nameissue tracked separately — this affects perfectly valid table names wheneverCREATE OR REPLACE TABLE(or DROP + CREATE) happens inside the incremental window.