Add test suite for IPC:ParallelFinish hang reproduction#1
Open
Add test suite for IPC:ParallelFinish hang reproduction#1
Conversation
NikolayS
pushed a commit
that referenced
this pull request
Dec 23, 2025
truncate_useless_pathkeys() seems to have neglected to account for PathKeys that might be useful for WindowClause evaluation. Modify it so that it properly accounts for that. Making this work required adjusting two things: 1. Change from checking query_pathkeys to check sort_pathkeys instead. 2. Add explicit check for window_pathkeys For #1, query_pathkeys gets set in standard_qp_callback() according to the sort order requirements for the first operation to be applied after the join planner is finished, so this changes depending on which upper planner operations a particular query needs. If the query has window functions and no GROUP BY, then query_pathkeys gets set to window_pathkeys. Before this change, this meant PathKeys useful for the ORDER BY were not accounted for in queries with window functions. Because of #1, #2 is now required so that we explicitly check to ensure we don't truncate away PathKeys useful for window functions. Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrj3HTKmXoLMbUjTO=_MNMxM=cnuCSyBKidAVibmYPnrg@mail.gmail.com
NikolayS
pushed a commit
that referenced
this pull request
Feb 6, 2026
cost_tidrangescan() was setting the disabled_nodes value correctly, and then immediately resetting it to zero, due to poor code editing on my part. materialized_finished_plan correctly set matpath.parent to zero, but forgot to also set matpath.parallel_workers = 0, causing an access to uninitialized memory in cost_material. (This shouldn't result in any real problem, but it makes valgrind unhappy.) reparameterize_path was dereferencing a variable before verifying that it was not NULL. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> (issue #1) Reported-by: Michael Paquier <michael@paquier.xyz> (issue #1) Diagnosed-by: Lukas Fittl <lukas@fittl.com> (issue #1) Reported-by: Zsolt Parragi <zsolt.parragi@percona.com> (issue #2) Reported-by: Richard Guo <guofenglinux@gmail.com> (issue #3) Discussion: http://postgr.es/m/CAN4CZFPvwjNJEZ_JT9Y67yR7C=KMNa=LNefOB8ZY7TKDcmAXOA@mail.gmail.com Discussion: http://postgr.es/m/aXrnPgrq6Gggb5TG@paquier.xyz
9 tasks
NikolayS
added a commit
that referenced
this pull request
Apr 16, 2026
The previous version checked PromoteIsTriggered() in the wait loop, which only reads the LocalPromoteIsTriggered cache. The cache is populated by CheckForStandbyTrigger(), which does the real work: detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered(). Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays false forever — so pg_promote()'s signal went unnoticed in our loop. Empirically verified the bug and fix: - Before (PromoteIsTriggered only): pg_promote(wait => true, wait_seconds => 30) returns FALSE after 30 seconds, standby still in_recovery. - After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second, standby is promoted. Raw test run from /tmp/verify_promote.sh: [18:56:24] wait for standby to enter pause [18:56:29] pause state: paused [18:56:29] *** calling pg_promote() now *** [18:56:30] pg_promote returned in 0.937961725s [18:56:30] pg_is_in_recovery post-promote: f [18:56:30] PASS: promote escaped the pause in under 10s CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes it extern (drop the static, add the extern in xlogrecovery.h). Mirrors the existing recoveryPausesHere() escape loop in the same file.
NikolayS
added a commit
that referenced
this pull request
Apr 19, 2026
Part I: nine-section essay on 2016-2025 PostgreSQL contributor activity, built on Robert Haas's published dataset plus Andrey Borodin's 2025 Reviewed-by tag parse. Part II: five additional lenses (affiliation proxy, subdirectory churn, back-branch activity, commitfest status funnel, mailing-list talkers vs code authors), built directly from git and scraped commitfest.postgresql.org. See docs/README.md and issue #1 for details.
NikolayS
pushed a commit
that referenced
this pull request
Apr 22, 2026
1. Make it so test_random_operations() can accept a NULL to have the function select a random seed. 2. Widen the seed parameter of test_random_operations() to bigint. Without that, it'll be impossible to run the function with a seed which was selected by GetCurrentTimestamp(), and if a randomly selected seed ever results in a failure, we'll likely want to run with the same seed to debug the issue. 3. Report the seed in the error messages in test_random_operations(). If the buildfarm were ever to fail there, we'd certainly want to know what this was. 4. Add CHECK_FOR_INTERRUPTS() to test_random_operations(). Someone might run with a large num_ops and they'd have no way to cancel the query. 5. Minor cosmetic fixes; header order and whitespace issue. To allow #1, the STRICT modifier had to be removed. The additional prechecks were added as I didn't see how else to handle someone passing those parameters as NULL. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Greg Burd <greg@burd.me> Discussion: https://postgr.es/m/CAApHDvrDW9W72vAr7h7XeCu7+Qz-_Vff02Q+RPPuVeM0Qf0MCw@mail.gmail.com
NikolayS
added a commit
that referenced
this pull request
Apr 22, 2026
The previous version checked PromoteIsTriggered() in the wait loop, which only reads the LocalPromoteIsTriggered cache. The cache is populated by CheckForStandbyTrigger(), which does the real work: detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered(). Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays false forever — so pg_promote()'s signal went unnoticed in our loop. Empirically verified the bug and fix: - Before (PromoteIsTriggered only): pg_promote(wait => true, wait_seconds => 30) returns FALSE after 30 seconds, standby still in_recovery. - After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second, standby is promoted. Raw test run from /tmp/verify_promote.sh: [18:56:24] wait for standby to enter pause [18:56:29] pause state: paused [18:56:29] *** calling pg_promote() now *** [18:56:30] pg_promote returned in 0.937961725s [18:56:30] pg_is_in_recovery post-promote: f [18:56:30] PASS: promote escaped the pause in under 10s CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes it extern (drop the static, add the extern in xlogrecovery.h). Mirrors the existing recoveryPausesHere() escape loop in the same file.
68ca915 to
b740b39
Compare
Adds comprehensive test suite to reproduce Theory 1 (shared memory queue saturation) for the recurring IPC:ParallelFinish hang issue. Test files: - test_parallel_queue_saturation.sql: Main reproduction test with 250K dead tuples and flood_error_queue() function to saturate 16KB error queues - monitor_parallel_hang.sql: Monitoring script to observe wait events - test_parallel_hang_alternative.sql: 7 alternative test approaches - run_reproduction_test.sh: Automated setup and execution script - test_parallel_hang_README.md: Complete documentation Theory being tested: Workers block indefinitely when error queues fill up, creating circular dependency where workers need leader to drain queue but leader only drains when ParallelMessagePending flag is set, which requires successful worker message send (impossible when queue full). Note: Tests target PostgreSQL 16.3 specifically per production environment.
Analyzes two critical commits merged after PostgreSQL 16.3: - 6f6521d (16.4): Don't enter parallel mode when holding interrupts - 06424e9 (16.5): Improved fix for interrupt handling These commits directly address the IPC:ParallelFinish hang issue by preventing parallel worker launch when leader cannot process interrupts, which eliminates the deadlock scenario where workers block on full error queues while leader cannot drain them. Recommendation: Upgrade to PostgreSQL 16.5+ to resolve production issue.
b740b39 to
1dc55c7
Compare
NikolayS
added a commit
that referenced
this pull request
Apr 22, 2026
The previous version checked PromoteIsTriggered() in the wait loop, which only reads the LocalPromoteIsTriggered cache. The cache is populated by CheckForStandbyTrigger(), which does the real work: detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered(). Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays false forever — so pg_promote()'s signal went unnoticed in our loop. Empirically verified the bug and fix: - Before (PromoteIsTriggered only): pg_promote(wait => true, wait_seconds => 30) returns FALSE after 30 seconds, standby still in_recovery. - After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second, standby is promoted. Raw test run from /tmp/verify_promote.sh: [18:56:24] wait for standby to enter pause [18:56:29] pause state: paused [18:56:29] *** calling pg_promote() now *** [18:56:30] pg_promote returned in 0.937961725s [18:56:30] pg_is_in_recovery post-promote: f [18:56:30] PASS: promote escaped the pause in under 10s CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes it extern (drop the static, add the extern in xlogrecovery.h). Mirrors the existing recoveryPausesHere() escape loop in the same file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds comprehensive test suite to reproduce Theory 1 (shared memory queue saturation) for the recurring IPC:ParallelFinish hang issue.
Test files:
Theory being tested: Workers block indefinitely when error queues fill up, creating circular dependency where workers need leader to drain queue but leader only drains when ParallelMessagePending flag is set, which requires successful worker message send (impossible when queue full).
Note: Tests target PostgreSQL 16.3 specifically per production environment.