Add test suite for IPC:ParallelFinish hang reproduction by NikolayS · Pull Request #1 · NikolayS/postgres

NikolayS · 2025-11-21T19:33:09Z

Adds comprehensive test suite to reproduce Theory 1 (shared memory queue saturation) for the recurring IPC:ParallelFinish hang issue.

Test files:

test_parallel_queue_saturation.sql: Main reproduction test with 250K dead tuples and flood_error_queue() function to saturate 16KB error queues
monitor_parallel_hang.sql: Monitoring script to observe wait events
test_parallel_hang_alternative.sql: 7 alternative test approaches
run_reproduction_test.sh: Automated setup and execution script
test_parallel_hang_README.md: Complete documentation

Theory being tested: Workers block indefinitely when error queues fill up, creating circular dependency where workers need leader to drain queue but leader only drains when ParallelMessagePending flag is set, which requires successful worker message send (impossible when queue full).

Note: Tests target PostgreSQL 16.3 specifically per production environment.

truncate_useless_pathkeys() seems to have neglected to account for PathKeys that might be useful for WindowClause evaluation. Modify it so that it properly accounts for that. Making this work required adjusting two things: 1. Change from checking query_pathkeys to check sort_pathkeys instead. 2. Add explicit check for window_pathkeys For #1, query_pathkeys gets set in standard_qp_callback() according to the sort order requirements for the first operation to be applied after the join planner is finished, so this changes depending on which upper planner operations a particular query needs. If the query has window functions and no GROUP BY, then query_pathkeys gets set to window_pathkeys. Before this change, this meant PathKeys useful for the ORDER BY were not accounted for in queries with window functions. Because of #1, #2 is now required so that we explicitly check to ensure we don't truncate away PathKeys useful for window functions. Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrj3HTKmXoLMbUjTO=_MNMxM=cnuCSyBKidAVibmYPnrg@mail.gmail.com

cost_tidrangescan() was setting the disabled_nodes value correctly, and then immediately resetting it to zero, due to poor code editing on my part. materialized_finished_plan correctly set matpath.parent to zero, but forgot to also set matpath.parallel_workers = 0, causing an access to uninitialized memory in cost_material. (This shouldn't result in any real problem, but it makes valgrind unhappy.) reparameterize_path was dereferencing a variable before verifying that it was not NULL. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> (issue #1) Reported-by: Michael Paquier <michael@paquier.xyz> (issue #1) Diagnosed-by: Lukas Fittl <lukas@fittl.com> (issue #1) Reported-by: Zsolt Parragi <zsolt.parragi@percona.com> (issue #2) Reported-by: Richard Guo <guofenglinux@gmail.com> (issue #3) Discussion: http://postgr.es/m/CAN4CZFPvwjNJEZ_JT9Y67yR7C=KMNa=LNefOB8ZY7TKDcmAXOA@mail.gmail.com Discussion: http://postgr.es/m/aXrnPgrq6Gggb5TG@paquier.xyz

The previous version checked PromoteIsTriggered() in the wait loop, which only reads the LocalPromoteIsTriggered cache. The cache is populated by CheckForStandbyTrigger(), which does the real work: detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered(). Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays false forever — so pg_promote()'s signal went unnoticed in our loop. Empirically verified the bug and fix: - Before (PromoteIsTriggered only): pg_promote(wait => true, wait_seconds => 30) returns FALSE after 30 seconds, standby still in_recovery. - After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second, standby is promoted. Raw test run from /tmp/verify_promote.sh: [18:56:24] wait for standby to enter pause [18:56:29] pause state: paused [18:56:29] *** calling pg_promote() now *** [18:56:30] pg_promote returned in 0.937961725s [18:56:30] pg_is_in_recovery post-promote: f [18:56:30] PASS: promote escaped the pause in under 10s CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes it extern (drop the static, add the extern in xlogrecovery.h). Mirrors the existing recoveryPausesHere() escape loop in the same file.

Part I: nine-section essay on 2016-2025 PostgreSQL contributor activity, built on Robert Haas's published dataset plus Andrey Borodin's 2025 Reviewed-by tag parse. Part II: five additional lenses (affiliation proxy, subdirectory churn, back-branch activity, commitfest status funnel, mailing-list talkers vs code authors), built directly from git and scraped commitfest.postgresql.org. See docs/README.md and issue #1 for details.

1. Make it so test_random_operations() can accept a NULL to have the function select a random seed. 2. Widen the seed parameter of test_random_operations() to bigint. Without that, it'll be impossible to run the function with a seed which was selected by GetCurrentTimestamp(), and if a randomly selected seed ever results in a failure, we'll likely want to run with the same seed to debug the issue. 3. Report the seed in the error messages in test_random_operations(). If the buildfarm were ever to fail there, we'd certainly want to know what this was. 4. Add CHECK_FOR_INTERRUPTS() to test_random_operations(). Someone might run with a large num_ops and they'd have no way to cancel the query. 5. Minor cosmetic fixes; header order and whitespace issue. To allow #1, the STRICT modifier had to be removed. The additional prechecks were added as I didn't see how else to handle someone passing those parameters as NULL. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Greg Burd <greg@burd.me> Discussion: https://postgr.es/m/CAApHDvrDW9W72vAr7h7XeCu7+Qz-_Vff02Q+RPPuVeM0Qf0MCw@mail.gmail.com

The previous version checked PromoteIsTriggered() in the wait loop, which only reads the LocalPromoteIsTriggered cache. The cache is populated by CheckForStandbyTrigger(), which does the real work: detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered(). Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays false forever — so pg_promote()'s signal went unnoticed in our loop. Empirically verified the bug and fix: - Before (PromoteIsTriggered only): pg_promote(wait => true, wait_seconds => 30) returns FALSE after 30 seconds, standby still in_recovery. - After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second, standby is promoted. Raw test run from /tmp/verify_promote.sh: [18:56:24] wait for standby to enter pause [18:56:29] pause state: paused [18:56:29] *** calling pg_promote() now *** [18:56:30] pg_promote returned in 0.937961725s [18:56:30] pg_is_in_recovery post-promote: f [18:56:30] PASS: promote escaped the pause in under 10s CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes it extern (drop the static, add the extern in xlogrecovery.h). Mirrors the existing recoveryPausesHere() escape loop in the same file.

Adds comprehensive test suite to reproduce Theory 1 (shared memory queue saturation) for the recurring IPC:ParallelFinish hang issue. Test files: - test_parallel_queue_saturation.sql: Main reproduction test with 250K dead tuples and flood_error_queue() function to saturate 16KB error queues - monitor_parallel_hang.sql: Monitoring script to observe wait events - test_parallel_hang_alternative.sql: 7 alternative test approaches - run_reproduction_test.sh: Automated setup and execution script - test_parallel_hang_README.md: Complete documentation Theory being tested: Workers block indefinitely when error queues fill up, creating circular dependency where workers need leader to drain queue but leader only drains when ParallelMessagePending flag is set, which requires successful worker message send (impossible when queue full). Note: Tests target PostgreSQL 16.3 specifically per production environment.

Analyzes two critical commits merged after PostgreSQL 16.3: - 6f6521d (16.4): Don't enter parallel mode when holding interrupts - 06424e9 (16.5): Improved fix for interrupt handling These commits directly address the IPC:ParallelFinish hang issue by preventing parallel worker launch when leader cannot process interrupts, which eliminates the deadlock scenario where workers block on full error queues while leader cannot drain them. Recommendation: Upgrade to PostgreSQL 16.5+ to resolve production issue.

The previous version checked PromoteIsTriggered() in the wait loop, which only reads the LocalPromoteIsTriggered cache. The cache is populated by CheckForStandbyTrigger(), which does the real work: detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered(). Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays false forever — so pg_promote()'s signal went unnoticed in our loop. Empirically verified the bug and fix: - Before (PromoteIsTriggered only): pg_promote(wait => true, wait_seconds => 30) returns FALSE after 30 seconds, standby still in_recovery. - After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second, standby is promoted. Raw test run from /tmp/verify_promote.sh: [18:56:24] wait for standby to enter pause [18:56:29] pause state: paused [18:56:29] *** calling pg_promote() now *** [18:56:30] pg_promote returned in 0.937961725s [18:56:30] pg_is_in_recovery post-promote: f [18:56:30] PASS: promote escaped the pause in under 10s CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes it extern (drop the static, add the extern in xlogrecovery.h). Mirrors the existing recoveryPausesHere() escape loop in the same file.

NikolayS mentioned this pull request Mar 19, 2026

PoC: USDT static tracepoints for wait event tracing #18

Open

NikolayS mentioned this pull request Apr 16, 2026

Sprint 0: Prove the Pipe — logical decoding from an archive-fed standby #25

Open

9 tasks

NikolayS force-pushed the master branch from bee8ce7 to 1f62dbf Compare April 19, 2026 03:19

NikolayS force-pushed the claude/fix-stuck-query-017nKNrbUg5YLp2qxjFZvQg4 branch from 68ca915 to b740b39 Compare April 22, 2026 17:34

NikolayS force-pushed the master branch from e471dc5 to 1f62dbf Compare April 22, 2026 17:50

claude added 2 commits April 22, 2026 11:17

NikolayS force-pushed the claude/fix-stuck-query-017nKNrbUg5YLp2qxjFZvQg4 branch from b740b39 to 1dc55c7 Compare April 22, 2026 18:17

NikolayS mentioned this pull request Apr 24, 2026

Crash-recovery testing: beyond kill -9 (brainstorm + experiments) #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test suite for IPC:ParallelFinish hang reproduction#1

Add test suite for IPC:ParallelFinish hang reproduction#1
NikolayS wants to merge 2 commits intomasterfrom
claude/fix-stuck-query-017nKNrbUg5YLp2qxjFZvQg4

NikolayS commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NikolayS commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants