Skip to content

Move shutdown_worker RPC to initiate_shutdown#1197

Open
yuandrew wants to merge 12 commits intotemporalio:masterfrom
yuandrew:shutdown-rpc-initiate-shutdown
Open

Move shutdown_worker RPC to initiate_shutdown#1197
yuandrew wants to merge 12 commits intotemporalio:masterfrom
yuandrew:shutdown-rpc-initiate-shutdown

Conversation

@yuandrew
Copy link
Copy Markdown
Contributor

@yuandrew yuandrew commented Apr 3, 2026

What was changed

Moved shutdown_worker RPC to initiate_shutdown, which requires initiate_shutdown to now be an async fn

NOTE: also added a temp 5s timeout in order to avoid the race in temporalio/temporal#9545, will be safe to remove once the fix is fully deployed into cloud

Why?

shutdown_worker RPC needs to be sent at beginning of shutdown in order for server to know to send empty responses to pending polls.

Checklist

  1. Closes

  2. How was this tested:

Added graceful_shutdown_sends_shutdown_worker_rpc_during_initiate to ensure RPC is sent at beginning of shutdown process

  1. Any docs updates needed?

Note

Medium Risk
Touches core shutdown and polling behavior (including new timing-based timeouts), which can affect worker liveness and graceful termination under load. Changes are well-covered by new regression tests but still carry concurrency/race risk.

Overview
Fixes graceful worker shutdown so in-flight long polls can be drained without deadlocking.

ShutdownWorker RPC is now spawned during Worker::initiate_shutdown (stored and awaited in Worker::shutdown), and poll streams treat empty responses after shutdown as a termination signal rather than a normal poll timeout/retry. Adds a temporary 5s interrupt window for graceful polls to avoid a known server race (temporalio/temporal#9545), and increases the all-permits shutdown watchdog to 6s to match.

Adds targeted unit/integration tests covering: RPC sent during initiate, poll stream termination on empty-after-shutdown, permits timeout alignment, and the race scenario; adjusts heartbeat integration assertions to wait for shutdown state.

Reviewed by Cursor Bugbot for commit 5863db7. Bugbot is set up for automated code reviews on this repo. Configure here.

@yuandrew yuandrew requested a review from a team as a code owner April 3, 2026 18:28
Comment on lines -539 to -540
assert_eq!(workflow_task_slots.current_available_slots, 5);
assert_eq!(workflow_task_slots.current_used_slots, 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd these all go away?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new shutdown flow taking a little longer, these checks aren't deterministic. These checks are technically all checked in in_activity_checks, kept the assert_eq!(workflow_task_slots.total_processed_tasks, 2); check, which is unique to the shutdown case

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 522a529. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants