Skip to content

Split long-running Databricks product tests to reduce suite timeout risk#29377

Open
chenjian2664 wants to merge 2 commits into
trinodb:masterfrom
chenjian2664:split-flaky-databricks-tests
Open

Split long-running Databricks product tests to reduce suite timeout risk#29377
chenjian2664 wants to merge 2 commits into
trinodb:masterfrom
chenjian2664:split-flaky-databricks-tests

Conversation

@chenjian2664
Copy link
Copy Markdown
Contributor

@chenjian2664 chenjian2664 commented May 8, 2026

Description

we've encountered the timeout issues:

Mostly dominated by the testReadFromSchemaChangedDeepCloneTable in TestDeltaLakeCloneTableCompatibility:

Test execution exceeded timeout of 2.00h\njava.lang.RuntimeException: java.lang.InterruptedException\n  at io.trino.tests.product.launcher.env.EnvironmentListener.lambda$tryInvokeListener$0(EnvironmentListener.java:61)\nCaused by: java.lang.InterruptedException\n  at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:430)\n\nThread dump at 8-min hang shows:\nio.trino.tests.product.deltalake.TestDeltaLakeCloneTableCompatibility.testReadFromSchemaChangedDeepCloneTable running for 8.05m\nStuck on: com.databricks.jdbc.dbclient.impl.thrift.DatabricksThriftAccessor.pollTillOperationFinished(DatabricksThriftAccessor.java:313)\n  -> GetOperationStatus -> DatabricksHttpClient.execute -> SSLSocketImpl.readApplicationRecord (blocking on HTTP response from Thrift server)\n\nSuite result: suite-delta-lake-databricks173 | singlenode-delta-lake-databricks173 | config-default | FAILED | 2.00h | Tests exited with code 1\n\nSlowest tests:\n- TestDeltaLakeCheckpointsCompatibility: 1.16h total\n- TestDeltaLakeCloneTableCompatibility.testReadFromSchemaChangedDeepCloneTable: 24.06m\n- TestDeltaLakeColumnMappingMode: 12.08m total

and testDatabricksWriteStatsAsJsonEnabled in TestDeltaLakeCheckpointsCompatibility, spend more than 40 mins

tests               | 2026-04-23 19:02:54 INFO: [20 of 65] io.trino.tests.product.deltalake.TestDeltaLakeCheckpointsCompatibility.testDatabricksWriteStatsAsJsonEnabled [boolean, true, 0.0, null] (Groups: profile_specific_tests, delta-lake-databricks)

....
....
....

tests               | 2026-04-23 19:48:49 INFO: Test io.trino.tests.product.deltalake.TestDeltaLakeCheckpointsCompatibility.testDatabricksWriteStatsAsJsonEnabled [struct<x bigint>, named_struct('x', 1), null, null] took 2.42m
tests               | 2026-04-23 19:48:50 INFO: SUCCESS     /    io.trino.tests.product.deltalake.TestDeltaLakeCheckpointsCompatibility.testDatabricksWriteStatsAsJsonEnabled [struct<x bigint>, named_struct('x', 1), null, null] (Groups: profile_specific_tests, delta-lake-databricks) took 2 minutes and 25 seconds
tests               | 2026-04-23 19:48:50 INFO: [36 of 65] 

Summary

  • Split testReadFromSchemaChangedDeepCloneTable into two independent @Test methods (Partitioned / NonPartitioned) so a hang in one variant no longer blocks the other
  • Switch testDatabricksWriteStatsAsJsonEnabled test group from DELTA_LAKE_DATABRICKS to use oss delta DELTA_LAKE_OSS.
    No logic, assertions, or test coverage changed; all 16 type variants still run.

Motivation

suite-delta-lake-databricks173 consistently exceeded its 2-hour CI timeout. Root cause analysis showed:

  1. TestDeltaLakeCheckpointsCompatibility consumed ~1h10m, dominated by the 16-variant parameterised test running serially
  2. testReadFromSchemaChangedDeepCloneTable hung for 8+ minutes on a Databricks Thrift pollTillOperationFinished SSL read (unaffected by SocketTimeout=120), then took 24 min total, leaving no budget for subsequent tests

Test plan

  • Verify suite-delta-lake-databricks173 completes within the 2-hour budget
  • Confirm all split test methods appear in the TestNG report and pass

@chenjian2664 chenjian2664 marked this pull request as ready for review May 8, 2026 09:18
Copy link
Copy Markdown
Contributor

@krvikash krvikash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments.

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented May 8, 2026

Can we simply replace the DELTA_LAKE_DATABRICKS group with DELTA_LAKE_OSS? Or are those tests dependent on the Databricks runtime?

@chenjian2664
Copy link
Copy Markdown
Contributor Author

chenjian2664 commented May 8, 2026

@ebyhr the test called testDatabricksWriteStatsAsJsonEnabled so I think it is required databricks ,the purpose is to test the Dbx behaivor?
or want me to rename the test name?

@chenjian2664 chenjian2664 force-pushed the split-flaky-databricks-tests branch from 256481f to f7a2d37 Compare May 8, 2026 10:15
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented May 8, 2026

@chenjian2664 The latter. OS Delta supports delta.checkpoint.writeStatsAsJson and delta.checkpoint.writeStatsAsStruct as far as I checked https://github.com/delta-io/delta, so we should be able to switch the group.

The context is that when we added the tests in #13331, all the tests in TestDeltaLakeCheckpointsCompatibility were running with Databricks. However, we later observed slow CI with Databricks runtime, so we started changing to OS Delta.

@chenjian2664 chenjian2664 force-pushed the split-flaky-databricks-tests branch from 3020419 to 9d51a26 Compare May 11, 2026 03:06
ebyhr
ebyhr previously approved these changes May 11, 2026
@chenjian2664 chenjian2664 force-pushed the split-flaky-databricks-tests branch from d48b820 to faa4afc Compare May 11, 2026 03:13
@chenjian2664
Copy link
Copy Markdown
Contributor Author

@ebyhr I still appended a commit to split testReadFromSchemaChangedDeepCloneTable into partitioned and non-partitioned tests. since oss Delta does not support deep clone, we can't handle this test simply by changing the test group, please have a look

@chenjian2664
Copy link
Copy Markdown
Contributor Author

@ebyhr The ci failure is because oss delta don't support the statistics, I am thinking maybe we should still use dbx to test it

@chenjian2664 chenjian2664 force-pushed the split-flaky-databricks-tests branch from faa4afc to bdeb4ca Compare May 11, 2026 09:07
@chenjian2664 chenjian2664 force-pushed the split-flaky-databricks-tests branch from bdeb4ca to c6d0d50 Compare May 12, 2026 10:47
@ebyhr ebyhr added this pull request to the merge queue May 12, 2026
@ebyhr ebyhr removed this pull request from the merge queue due to a manual request May 12, 2026
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented May 12, 2026

/test-with-secrets sha=c6d0d50187fdc0e28bbacd0b4312c0d1f976da90

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/25764666754

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants