Skip to content

Guard against late-arriving polls after worker shutdown#9330

Closed
rkannan82 wants to merge 5 commits intomainfrom
kannan/shutdown-worker-poll-guard
Closed

Guard against late-arriving polls after worker shutdown#9330
rkannan82 wants to merge 5 commits intomainfrom
kannan/shutdown-worker-poll-guard

Conversation

@rkannan82
Copy link
Copy Markdown
Contributor

@rkannan82 rkannan82 commented Feb 13, 2026

What changed?

When CancelOutstandingWorkerPolls is called, the WorkerInstanceKey is cached in a TTL cache (70s default). Any subsequent poll arriving with this key returns empty immediately, preventing task dispatch to a shutting-down worker.

Why?

This handles the edge case where a poll request was in-flight (already sent by SDK) when ShutdownWorker was called, arriving at the server after the cancellation logic has completed. Without this guard, such polls could receive tasks that would never be processed.

How did you test it?

  • built
  • covered by existing tests
  • added new unit test(s)

Potential risks

  • Memory usage: Cache stores up to 50K entries (~10MB, based on ~200 bytes per entry). TTL is 70s (long poll timeout + buffer). When full, LRU eviction removes oldest entries first.

When CancelOutstandingWorkerPolls is called, the WorkerInstanceKey is
cached in a TTL cache (60s default). Any subsequent poll arriving with
this key returns empty immediately, preventing task dispatch to a
shutting-down worker.

This handles the edge case where a poll request was in-flight (already
sent by SDK) when ShutdownWorker was called, arriving at the server
after the cancellation logic has completed.

- Add ShutdownWorkerCacheTTL dynamic config (60s default)
- Add shutdownWorkers TTL cache to matchingEngineImpl
- Check cache early in PollWorkflowTaskQueue/PollActivityTaskQueue
- Add unit tests for cache behavior
@rkannan82 rkannan82 requested review from a team as code owners February 13, 2026 23:18
@rkannan82 rkannan82 marked this pull request as draft February 13, 2026 23:26
@rkannan82 rkannan82 force-pushed the kannan/shutdown-worker-poll-guard branch from 441d5b6 to 5c1df05 Compare February 13, 2026 23:27
`ShutdownWorkerCacheTTL is the time to live for entries in the shutdown worker cache. When a worker calls
ShutdownWorker, its WorkerInstanceKey is cached for this duration. Any poll arriving with a cached
WorkerInstanceKey returns empty immediately, preventing task dispatch to a shutting-down worker.
This should be longer than MatchingLongPollExpirationInterval (1 min default) to catch in-flight polls.`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't related to the length of a long poll at all, it about the interval where rpcs can get reordered on the network. Which is unbounded in theory but practically something like 10s-30s should be fine. The SDK isn't waiting that long anyway between calling ShutdownWorker and actually exiting, right?

outstandingPollers: collection.NewSyncMap[string, context.CancelFunc](),
workerInstancePollers: workerPollerTracker{pollers: make(map[string]map[string]context.CancelFunc)},
// 50000 entries ≈ 10MB (each entry ~200 bytes: UUID key + cache overhead)
shutdownWorkers: cache.New(50000, &cache.Options{TTL: config.ShutdownWorkerCacheTTL()}),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to use dynamic config, why not the cache size too? (I'm fine with neither being dynamic, actually)

`PollerHistoryTTL is the time to live for poller histories in the pollerHistory cache of a physical task queue. Poller histories are fetched when
requiring a list of pollers that polled a given task queue.`,
)
ShutdownWorkerCacheTTL = NewGlobalDurationSetting(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this requires a process restart to take effect

// This guards against polls that arrive after CancelOutstandingWorkerPolls completed.
if workerInstanceKey := request.GetWorkerInstanceKey(); workerInstanceKey != "" {
if e.shutdownWorkers.Get(workerInstanceKey) != nil {
return emptyPollWorkflowTaskQueueResponse, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think it should return nil, serviceerror.NewCanceled("worker shutdown") or something like that

@rkannan82
Copy link
Copy Markdown
Contributor Author

Already implemented in #9545.

@rkannan82 rkannan82 closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants