Skip to content

Add test suite for IPC:ParallelFinish hang reproduction#1

Open
NikolayS wants to merge 2 commits intomasterfrom
claude/fix-stuck-query-017nKNrbUg5YLp2qxjFZvQg4
Open

Add test suite for IPC:ParallelFinish hang reproduction#1
NikolayS wants to merge 2 commits intomasterfrom
claude/fix-stuck-query-017nKNrbUg5YLp2qxjFZvQg4

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

Adds comprehensive test suite to reproduce Theory 1 (shared memory queue saturation) for the recurring IPC:ParallelFinish hang issue.

Test files:

  • test_parallel_queue_saturation.sql: Main reproduction test with 250K dead tuples and flood_error_queue() function to saturate 16KB error queues
  • monitor_parallel_hang.sql: Monitoring script to observe wait events
  • test_parallel_hang_alternative.sql: 7 alternative test approaches
  • run_reproduction_test.sh: Automated setup and execution script
  • test_parallel_hang_README.md: Complete documentation

Theory being tested: Workers block indefinitely when error queues fill up, creating circular dependency where workers need leader to drain queue but leader only drains when ParallelMessagePending flag is set, which requires successful worker message send (impossible when queue full).

Note: Tests target PostgreSQL 16.3 specifically per production environment.

NikolayS pushed a commit that referenced this pull request Dec 23, 2025
truncate_useless_pathkeys() seems to have neglected to account for
PathKeys that might be useful for WindowClause evaluation.  Modify it so
that it properly accounts for that.

Making this work required adjusting two things:

1. Change from checking query_pathkeys to check sort_pathkeys instead.
2. Add explicit check for window_pathkeys

For #1, query_pathkeys gets set in standard_qp_callback() according to the
sort order requirements for the first operation to be applied after the
join planner is finished, so this changes depending on which upper
planner operations a particular query needs.  If the query has window
functions and no GROUP BY, then query_pathkeys gets set to
window_pathkeys.  Before this change, this meant PathKeys useful for the
ORDER BY were not accounted for in queries with window functions.

Because of #1, #2 is now required so that we explicitly check to ensure
we don't truncate away PathKeys useful for window functions.

Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrj3HTKmXoLMbUjTO=_MNMxM=cnuCSyBKidAVibmYPnrg@mail.gmail.com
NikolayS pushed a commit that referenced this pull request Feb 6, 2026
cost_tidrangescan() was setting the disabled_nodes value correctly,
and then immediately resetting it to zero, due to poor code editing on
my part.

materialized_finished_plan correctly set matpath.parent to
zero, but forgot to also set matpath.parallel_workers = 0, causing
an access to uninitialized memory in cost_material. (This shouldn't
result in any real problem, but it makes valgrind unhappy.)

reparameterize_path was dereferencing a variable before verifying that
it was not NULL.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us> (issue #1)
Reported-by: Michael Paquier <michael@paquier.xyz> (issue #1)
Diagnosed-by: Lukas Fittl <lukas@fittl.com> (issue #1)
Reported-by: Zsolt Parragi <zsolt.parragi@percona.com> (issue #2)
Reported-by: Richard Guo <guofenglinux@gmail.com> (issue #3)
Discussion: http://postgr.es/m/CAN4CZFPvwjNJEZ_JT9Y67yR7C=KMNa=LNefOB8ZY7TKDcmAXOA@mail.gmail.com
Discussion: http://postgr.es/m/aXrnPgrq6Gggb5TG@paquier.xyz
NikolayS added a commit that referenced this pull request Apr 16, 2026
The previous version checked PromoteIsTriggered() in the wait loop,
which only reads the LocalPromoteIsTriggered cache. The cache is
populated by CheckForStandbyTrigger(), which does the real work:
detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered().
Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays
false forever — so pg_promote()'s signal went unnoticed in our loop.

Empirically verified the bug and fix:
- Before (PromoteIsTriggered only): pg_promote(wait => true,
  wait_seconds => 30) returns FALSE after 30 seconds, standby still
  in_recovery.
- After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second,
  standby is promoted.

Raw test run from /tmp/verify_promote.sh:
  [18:56:24] wait for standby to enter pause
  [18:56:29] pause state: paused
  [18:56:29] *** calling pg_promote() now ***
  [18:56:30] pg_promote returned in 0.937961725s
  [18:56:30] pg_is_in_recovery post-promote: f
  [18:56:30] PASS: promote escaped the pause in under 10s

CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes
it extern (drop the static, add the extern in xlogrecovery.h). Mirrors
the existing recoveryPausesHere() escape loop in the same file.
NikolayS added a commit that referenced this pull request Apr 19, 2026
Part I: nine-section essay on 2016-2025 PostgreSQL contributor activity,
built on Robert Haas's published dataset plus Andrey Borodin's 2025
Reviewed-by tag parse.

Part II: five additional lenses (affiliation proxy, subdirectory churn,
back-branch activity, commitfest status funnel, mailing-list talkers vs
code authors), built directly from git and scraped commitfest.postgresql.org.

See docs/README.md and issue #1 for details.
NikolayS pushed a commit that referenced this pull request Apr 22, 2026
1. Make it so test_random_operations() can accept a NULL to have the
   function select a random seed.
2. Widen the seed parameter of test_random_operations() to bigint.
   Without that, it'll be impossible to run the function with a seed
   which was selected by GetCurrentTimestamp(), and if a randomly
   selected seed ever results in a failure, we'll likely want to run
   with the same seed to debug the issue.
3. Report the seed in the error messages in test_random_operations().
   If the buildfarm were ever to fail there, we'd certainly want to know
   what this was.
4. Add CHECK_FOR_INTERRUPTS() to test_random_operations().  Someone might
   run with a large num_ops and they'd have no way to cancel the query.
5. Minor cosmetic fixes; header order and whitespace issue.

To allow #1, the STRICT modifier had to be removed.  The additional
prechecks were added as I didn't see how else to handle someone passing
those parameters as NULL.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Greg Burd <greg@burd.me>
Discussion: https://postgr.es/m/CAApHDvrDW9W72vAr7h7XeCu7+Qz-_Vff02Q+RPPuVeM0Qf0MCw@mail.gmail.com
NikolayS added a commit that referenced this pull request Apr 22, 2026
The previous version checked PromoteIsTriggered() in the wait loop,
which only reads the LocalPromoteIsTriggered cache. The cache is
populated by CheckForStandbyTrigger(), which does the real work:
detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered().
Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays
false forever — so pg_promote()'s signal went unnoticed in our loop.

Empirically verified the bug and fix:
- Before (PromoteIsTriggered only): pg_promote(wait => true,
  wait_seconds => 30) returns FALSE after 30 seconds, standby still
  in_recovery.
- After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second,
  standby is promoted.

Raw test run from /tmp/verify_promote.sh:
  [18:56:24] wait for standby to enter pause
  [18:56:29] pause state: paused
  [18:56:29] *** calling pg_promote() now ***
  [18:56:30] pg_promote returned in 0.937961725s
  [18:56:30] pg_is_in_recovery post-promote: f
  [18:56:30] PASS: promote escaped the pause in under 10s

CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes
it extern (drop the static, add the extern in xlogrecovery.h). Mirrors
the existing recoveryPausesHere() escape loop in the same file.
@NikolayS NikolayS force-pushed the claude/fix-stuck-query-017nKNrbUg5YLp2qxjFZvQg4 branch from 68ca915 to b740b39 Compare April 22, 2026 17:34
claude added 2 commits April 22, 2026 11:17
Adds comprehensive test suite to reproduce Theory 1 (shared memory queue
saturation) for the recurring IPC:ParallelFinish hang issue.

Test files:
- test_parallel_queue_saturation.sql: Main reproduction test with 250K dead
  tuples and flood_error_queue() function to saturate 16KB error queues
- monitor_parallel_hang.sql: Monitoring script to observe wait events
- test_parallel_hang_alternative.sql: 7 alternative test approaches
- run_reproduction_test.sh: Automated setup and execution script
- test_parallel_hang_README.md: Complete documentation

Theory being tested: Workers block indefinitely when error queues fill up,
creating circular dependency where workers need leader to drain queue but
leader only drains when ParallelMessagePending flag is set, which requires
successful worker message send (impossible when queue full).

Note: Tests target PostgreSQL 16.3 specifically per production environment.
Analyzes two critical commits merged after PostgreSQL 16.3:
- 6f6521d (16.4): Don't enter parallel mode when holding interrupts
- 06424e9 (16.5): Improved fix for interrupt handling

These commits directly address the IPC:ParallelFinish hang issue by
preventing parallel worker launch when leader cannot process interrupts,
which eliminates the deadlock scenario where workers block on full error
queues while leader cannot drain them.

Recommendation: Upgrade to PostgreSQL 16.5+ to resolve production issue.
@NikolayS NikolayS force-pushed the claude/fix-stuck-query-017nKNrbUg5YLp2qxjFZvQg4 branch from b740b39 to 1dc55c7 Compare April 22, 2026 18:17
NikolayS added a commit that referenced this pull request Apr 22, 2026
The previous version checked PromoteIsTriggered() in the wait loop,
which only reads the LocalPromoteIsTriggered cache. The cache is
populated by CheckForStandbyTrigger(), which does the real work:
detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered().
Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays
false forever — so pg_promote()'s signal went unnoticed in our loop.

Empirically verified the bug and fix:
- Before (PromoteIsTriggered only): pg_promote(wait => true,
  wait_seconds => 30) returns FALSE after 30 seconds, standby still
  in_recovery.
- After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second,
  standby is promoted.

Raw test run from /tmp/verify_promote.sh:
  [18:56:24] wait for standby to enter pause
  [18:56:29] pause state: paused
  [18:56:29] *** calling pg_promote() now ***
  [18:56:30] pg_promote returned in 0.937961725s
  [18:56:30] pg_is_in_recovery post-promote: f
  [18:56:30] PASS: promote escaped the pause in under 10s

CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes
it extern (drop the static, add the extern in xlogrecovery.h). Mirrors
the existing recoveryPausesHere() escape loop in the same file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants