Guard against late-arriving polls after worker shutdown by rkannan82 · Pull Request #9330 · temporalio/temporal

rkannan82 · 2026-02-13T23:18:32Z

What changed?

When CancelOutstandingWorkerPolls is called, the WorkerInstanceKey is cached in a TTL cache (70s default). Any subsequent poll arriving with this key returns empty immediately, preventing task dispatch to a shutting-down worker.

Why?

This handles the edge case where a poll request was in-flight (already sent by SDK) when ShutdownWorker was called, arriving at the server after the cancellation logic has completed. Without this guard, such polls could receive tasks that would never be processed.

How did you test it?

built
covered by existing tests
added new unit test(s)

Potential risks

Memory usage: Cache stores up to 50K entries (~10MB, based on ~200 bytes per entry). TTL is 70s (long poll timeout + buffer). When full, LRU eviction removes oldest entries first.

When CancelOutstandingWorkerPolls is called, the WorkerInstanceKey is cached in a TTL cache (60s default). Any subsequent poll arriving with this key returns empty immediately, preventing task dispatch to a shutting-down worker. This handles the edge case where a poll request was in-flight (already sent by SDK) when ShutdownWorker was called, arriving at the server after the cancellation logic has completed. - Add ShutdownWorkerCacheTTL dynamic config (60s default) - Add shutdownWorkers TTL cache to matchingEngineImpl - Check cache early in PollWorkflowTaskQueue/PollActivityTaskQueue - Add unit tests for cache behavior

dnr · 2026-02-18T18:47:02Z

common/dynamicconfig/constants.go

+		`ShutdownWorkerCacheTTL is the time to live for entries in the shutdown worker cache. When a worker calls
+		ShutdownWorker, its WorkerInstanceKey is cached for this duration. Any poll arriving with a cached
+		WorkerInstanceKey returns empty immediately, preventing task dispatch to a shutting-down worker.
+		This should be longer than MatchingLongPollExpirationInterval (1 min default) to catch in-flight polls.`,


This isn't related to the length of a long poll at all, it about the interval where rpcs can get reordered on the network. Which is unbounded in theory but practically something like 10s-30s should be fine. The SDK isn't waiting that long anyway between calling ShutdownWorker and actually exiting, right?

dnr · 2026-02-18T18:48:25Z

service/matching/matching_engine.go

 		outstandingPollers:        collection.NewSyncMap[string, context.CancelFunc](),
 		workerInstancePollers:     workerPollerTracker{pollers: make(map[string]map[string]context.CancelFunc)},
+		// 50000 entries ≈ 10MB (each entry ~200 bytes: UUID key + cache overhead)
+		shutdownWorkers:           cache.New(50000, &cache.Options{TTL: config.ShutdownWorkerCacheTTL()}),


If you're going to use dynamic config, why not the cache size too? (I'm fine with neither being dynamic, actually)

dnr · 2026-02-18T18:48:56Z

common/dynamicconfig/constants.go

 		`PollerHistoryTTL is the time to live for poller histories in the pollerHistory cache of a physical task queue. Poller histories are fetched when
 		requiring a list of pollers that polled a given task queue.`,
 	)
+	ShutdownWorkerCacheTTL = NewGlobalDurationSetting(


Note that this requires a process restart to take effect

dnr · 2026-02-18T18:49:45Z

service/matching/matching_engine.go

+	// This guards against polls that arrive after CancelOutstandingWorkerPolls completed.
+	if workerInstanceKey := request.GetWorkerInstanceKey(); workerInstanceKey != "" {
+		if e.shutdownWorkers.Get(workerInstanceKey) != nil {
+			return emptyPollWorkflowTaskQueueResponse, nil


I would think it should return nil, serviceerror.NewCanceled("worker shutdown") or something like that

rkannan82 · 2026-04-10T22:50:48Z

Already implemented in #9545.

rkannan82 requested review from a team as code owners February 13, 2026 23:18

Add test for poll returning empty after worker shutdown

d7d6577

rkannan82 marked this pull request as draft February 13, 2026 23:26

rkannan82 added 2 commits February 13, 2026 15:26

Increase shutdown cache TTL to 2 minutes (longer than long poll timeout)

76ac033

Reduce TTL to 70s (1 min long poll + 10s buffer)

5c1df05

rkannan82 force-pushed the kannan/shutdown-worker-poll-guard branch from 441d5b6 to 5c1df05 Compare February 13, 2026 23:27

Increase shutdown cache to 50K entries (~10MB)

d663c23

dnr reviewed Feb 18, 2026

View reviewed changes

rkannan82 closed this Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guard against late-arriving polls after worker shutdown#9330

Guard against late-arriving polls after worker shutdown#9330
rkannan82 wants to merge 5 commits intomainfrom
kannan/shutdown-worker-poll-guard

rkannan82 commented Feb 13, 2026 •

edited

Loading

Uh oh!

dnr Feb 18, 2026

Uh oh!

dnr Feb 18, 2026

Uh oh!

dnr Feb 18, 2026

Uh oh!

dnr Feb 18, 2026

Uh oh!

rkannan82 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rkannan82 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed?

Why?

How did you test it?

Potential risks

Uh oh!

dnr Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

dnr Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

dnr Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

dnr Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

rkannan82 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rkannan82 commented Feb 13, 2026 •

edited

Loading