Split long-running Databricks product tests to reduce suite timeout risk#29377
Split long-running Databricks product tests to reduce suite timeout risk#29377chenjian2664 wants to merge 2 commits into
Conversation
|
Can we simply replace the DELTA_LAKE_DATABRICKS group with DELTA_LAKE_OSS? Or are those tests dependent on the Databricks runtime? |
|
@ebyhr the test called |
256481f to
f7a2d37
Compare
|
@chenjian2664 The latter. OS Delta supports The context is that when we added the tests in #13331, all the tests in TestDeltaLakeCheckpointsCompatibility were running with Databricks. However, we later observed slow CI with Databricks runtime, so we started changing to OS Delta. |
3020419 to
9d51a26
Compare
d48b820 to
faa4afc
Compare
|
@ebyhr I still appended a commit to split |
|
@ebyhr The ci failure is because oss delta don't support the statistics, I am thinking maybe we should still use dbx to test it |
faa4afc to
bdeb4ca
Compare
…n-partitioned tests
bdeb4ca to
c6d0d50
Compare
|
/test-with-secrets sha=c6d0d50187fdc0e28bbacd0b4312c0d1f976da90 |
|
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/25764666754 |
Description
we've encountered the timeout issues:
Mostly dominated by the testReadFromSchemaChangedDeepCloneTable in TestDeltaLakeCloneTableCompatibility:
and testDatabricksWriteStatsAsJsonEnabled in TestDeltaLakeCheckpointsCompatibility, spend more than 40 mins
Summary
testReadFromSchemaChangedDeepCloneTableinto two independent@Testmethods (Partitioned/NonPartitioned) so a hang in one variant no longer blocks the othertestDatabricksWriteStatsAsJsonEnabledtest group fromDELTA_LAKE_DATABRICKSto use oss deltaDELTA_LAKE_OSS.No logic, assertions, or test coverage changed; all 16 type variants still run.
Motivation
suite-delta-lake-databricks173consistently exceeded its 2-hour CI timeout. Root cause analysis showed:TestDeltaLakeCheckpointsCompatibilityconsumed ~1h10m, dominated by the 16-variant parameterised test running seriallytestReadFromSchemaChangedDeepCloneTablehung for 8+ minutes on a Databricks ThriftpollTillOperationFinishedSSL read (unaffected bySocketTimeout=120), then took 24 min total, leaving no budget for subsequent testsTest plan
suite-delta-lake-databricks173completes within the 2-hour budget