Skip to content

Snowflake incremental ingestion mis-marks CREATE OR REPLACE'd tables as deleted #28053

@ulixius9

Description

@ulixius9

Describe the bug

Snowflake's incremental table-list query (SNOWFLAKE_INCREMENTAL_GET_TABLE_NAMES in ingestion/src/metadata/ingestion/source/database/snowflake/queries.py) deduplicates rows in SNOWFLAKE.ACCOUNT_USAGE.tables using:

ROW_NUMBER() OVER (partition by TABLE_NAME order by LAST_DDL desc)
...
WHERE ROW_NUMBER = 1

For any table that was recreated via CREATE OR REPLACE TABLE, ACCOUNT_USAGE retains two rows under the same TABLE_NAME:

  • the new alive row (new TABLE_ID, DELETED IS NULL)
  • the old dropped row (old TABLE_ID, DELETED = <drop_ts>)

Snowflake updates the old row's LAST_DDL to the drop timestamp, which is after the new row's create LAST_DDL (the engine creates the new table first, then drops the old one). The dedup therefore picks the deleted row as "the latest state of TABLE_NAME", and the alive table is:

  • placed into SnowflakeTableList.get_deleted()
  • queued for deletion in OpenMetadata via mark_tables_as_deleted
  • never yielded to _get_table_names_and_types's return value, so its new TABLE_ID and schema changes are never processed

To Reproduce

On a Snowflake account with SNOWFLAKE.ACCOUNT_USAGE populated:

  1. Create a table and let ACCOUNT_USAGE.tables catch up (1–3h):
    CREATE TABLE TEST_DB.PUBLIC.T (ID NUMBER);
  2. Replace it:
    CREATE OR REPLACE TABLE TEST_DB.PUBLIC.T (ID NUMBER, NAME VARCHAR);
  3. Wait for ACCOUNT_USAGE to reflect both rows, then run the Snowflake ingestion in incremental mode with a window that covers both events.

Inspect ACCOUNT_USAGE:

SELECT TABLE_ID, TABLE_NAME, LAST_DDL, DELETED
FROM SNOWFLAKE.ACCOUNT_USAGE.tables
WHERE TABLE_SCHEMA = 'PUBLIC' AND TABLE_NAME = 'T';

You'll see two rows for T, and the deleted row has the later LAST_DDL.

Expected behavior

The alive (DELETED IS NULL) row wins the dedup. The recreated table is ingested with its new columns; the old TABLE_ID does not cause the entity to be marked deleted.

Suggested fix

Either:

  • filter DELETED IS NULL before the ROW_NUMBER() window (if you only want alive tables in incremental output), or
  • change the ORDER BY to favor alive rows on ties, e.g. ORDER BY (CASE WHEN DELETED IS NULL THEN 1 ELSE 0 END) DESC, LAST_DDL DESC.

The first option is cleaner if delete handling is taken care of elsewhere (e.g., mark_tables_as_deleted post-process). The second preserves the "include deletions" semantics if you want them in the same pass.

Version

  • Reproduced on main (queries.py:33-56).
  • Affects any Snowflake ingestion configured with incremental.enabled: true.

Additional context

Independent of the bad-name / quote_name issue tracked separately — this affects perfectly valid table names whenever CREATE OR REPLACE TABLE (or DROP + CREATE) happens inside the incremental window.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions