Skip to content

fix(kube-client): propagate abort signal through list requests#2954

Open
petersutter wants to merge 3 commits into
masterfrom
fix/propagate-abort-signal-through-list-requests
Open

fix(kube-client): propagate abort signal through list requests#2954
petersutter wants to merge 3 commits into
masterfrom
fix/propagate-abort-signal-through-list-requests

Conversation

@petersutter
Copy link
Copy Markdown
Member

@petersutter petersutter commented May 11, 2026

How to categorize this PR?
/area robustness
/kind bug

What this PR does / why we need it:
Symptom observed: one of two dashboard pods failed to populate the Shoot backend cache on startup. The list request for Shoots never received a response and the reflector stalled silentl - no error, no retry, no timeout.

List requests in the backend cache could silently stall forever when an HTTP/2 stream hung without emitting an error event. Two bugs combined to cause this:

  1. ListWatcher.list() never forwarded the abort signal to the list function, so a hung stream could not be cancelled - unlike watch(), which already passed the signal correctly.
  2. Client.fetch() awaited stream.getHeaders() indefinitely; the existing responseTimeout option was defined but never enforced.

The fix propagates the abort signal through ListWatcher.list() and applies responseTimeout (default 60 s) to the getHeaders() await via a setTimeout / try-finally guard. The previous default of 15 s was never enforced, so it had no practical effect; with enforcement now in place, the default is raised to 60 s to accommodate large list requests against busy clusters.

Additionally, this PR introduces a package-level KUBE_CLIENT_RESPONSE_TIMEOUT environment variable that sets the default responseTimeout for all kube-client instances (dashboard client, per-user clients, and derived kubeconfig clients). The Helm chart renders .Values.global.dashboard.kubeClient.responseTimeout into the container environment so operators can tune the timeout without code changes. Per-call options can still override the default.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
As a possible future follow-up, we could evaluate whether the package-level config-dependent singletons should be replaced by an explicit createClientSet() factory that receives responseTimeout and other transport options as constructor arguments. In that model, the backend would instantiate the client set at startup from its loaded config and inject it into route handlers, services, and hooks. This is intentionally not part of this PR: it would touch every backend module that imports from @gardener-dashboard/kube-client and should only be considered as a separate refactoring effort if we decide the reduced coupling is worth the larger change surface.

Release note:

Fix an issue where the dashboard backend cache could stop updating if Kubernetes API list requests became unresponsive.
Add `KUBE_CLIENT_RESPONSE_TIMEOUT` environment variable to configure the default Kubernetes API response-header timeout for dashboard backend requests. When using the Helm chart, this can be configured via `.Values.global.dashboard.kubeClient.responseTimeout`.

Summary by CodeRabbit

  • New Features

    • Add configurable response-header timeout via environment/configuration (can be set per-deployment).
  • Bug Fixes

    • List operations now receive abort signals like watch operations.
    • Added response-header timeout enforcement and increased default timeout to reduce hung requests.
    • Ensure request options default to an object to avoid undefined-parameter issues.
  • Tests

    • Added tests for abort-signal handling and response-header timeout behavior.

Review Change Stack

@gardener-prow gardener-prow Bot added the area/robustness Robustness, reliability, resilience related label May 11, 2026
@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented May 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign petersutter for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added kind/bug Bug cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 11, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c376e0d0-f2b1-4208-aff9-3ee2c226e378

📥 Commits

Reviewing files that changed from the base of the PR and between 373a442 and 88b3c9c.

📒 Files selected for processing (8)
  • charts/__tests__/gardener-dashboard/runtime/dashboard/deployment.spec.js
  • charts/gardener-dashboard/charts/runtime/templates/dashboard/deployment.yaml
  • charts/gardener-dashboard/values.yaml
  • packages/kube-client/__tests__/index.spec.js
  • packages/kube-client/lib/index.js
  • packages/request/__tests__/acceptance.spec.js
  • packages/request/__tests__/client.spec.js
  • packages/request/lib/Client.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/request/lib/Client.js

📝 Walkthrough

Walkthrough

For list APIs, abort signals are validated and ListWatcher.list forwards a set signal to listFunc. For requests, Client.fetch enforces a response-header timeout (destroying the stream with TimeoutError on timeout) and SessionPool.request defaults options. Kube-client reads KUBE_CLIENT_RESPONSE_TIMEOUT from env and exposes it via Helm values and tests.

Changes

Abort Signal Propagation in ListWatcher

Layer / File(s) Summary
Signal Validation in List Methods
packages/kube-client/lib/mixins.js
ClusterScoped.Readable.list, NamespaceScoped.Readable.list, and listAllNamespaces call assertSignal(signal) before options/search-params validation and request execution.
ListWatcher Signal Forwarding
packages/kube-client/lib/cache/ListWatcher.js
ListWatcher.list(query) conditionally attaches this.signal to options passed to listFunc when setAbortSignal() was called.
ListWatcher Signal Test
packages/kube-client/__tests__/cache.list-watcher.spec.js
New unit test verifies ListWatcher#list with an abort signal calls listFunc with the signal and merged searchParams.

Response Timeout Protection

Layer / File(s) Summary
TimeoutError Import
packages/request/lib/Client.js
Client.js now imports TimeoutError from the errors module.
SessionPool Request Defaults
packages/request/lib/SessionPool.js
SessionPool.request(headers, options = {}) defaults options to {} so session.request never receives undefined.
Client Fetch Timeout Implementation
packages/request/lib/Client.js
Client.fetch() starts a responseTimeout timer before stream.getHeaders(); if headers aren't received within responseTimeout the request stream is destroyed with a TimeoutError. The timer is cleared in try/finally. The default responseTimeout getter was increased to 60000 ms.
Client Timeout & Acceptance Tests
packages/request/__tests__/client.spec.js, packages/request/__tests__/acceptance.spec.js
Unit and acceptance tests added/updated: unit tests assert default timeout and fetch rejects with TimeoutError when headers don't arrive; acceptance tests add /delay route and a real-connection timeout case.

Kube-client responseTimeout env and Helm wiring

Layer / File(s) Summary
parseResponseTimeout and Client defaultOptions
packages/kube-client/lib/index.js
Parses and validates KUBE_CLIENT_RESPONSE_TIMEOUT from env, merges it into defaultOptions passed to Client/BaseClient; createClient and createDashboardClient now default options = {} and dashboardClient is created via createDashboardClient().
kube-client env propagation tests
packages/kube-client/__tests__/index.spec.js
Tests assert KUBE_CLIENT_RESPONSE_TIMEOUT is propagated into request.extend for package-level/dashboard/derived clients, that per-call overrides take precedence, and invalid env values fail fast.
Helm values and deployment env injection
charts/gardener-dashboard/values.yaml, charts/gardener-dashboard/charts/runtime/templates/dashboard/deployment.yaml, charts/__tests__/gardener-dashboard/runtime/dashboard/deployment.spec.js
Adds global.dashboard.kubeClient.responseTimeout value, conditionally injects KUBE_CLIENT_RESPONSE_TIMEOUT into container env when set, and adds a chart test verifying the env variable is rendered.

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers:

  • holgerkoser
  • klocke-io

"I set a signal and watch the clock,
If headers lag, I give a knock.
Streams fold neat, timeouts named with care,
Env and charts pass settings everywhere. 🐰"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: propagating the abort signal through list requests in ListWatcher.
Description check ✅ Passed The PR description is comprehensive and follows the template with all required sections completed, including area/kind categorization, clear problem explanation, solution details, and appropriate release notes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/propagate-abort-signal-through-list-requests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/request/__tests__/client.spec.js`:
- Around line 247-282: The test is using Jest APIs that don't exist under
Vitest; replace jest.fn() with vi.fn() for mocking (e.g., the mock for
stream.getHeaders and stream.destroy) and replace jest.advanceTimersByTime(...)
with vi.advanceTimersByTime(...) so the timer fast-forward works under Vitest;
update any other jest.* usages in this test (references around client.fetch,
stream.getHeaders, stream.destroy) to their vi equivalents so the test runs with
Vitest.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d7ca7124-6130-4b6d-a38e-dca1cc9a598f

📥 Commits

Reviewing files that changed from the base of the PR and between d2a7ab9 and 150fcc9.

📒 Files selected for processing (6)
  • packages/kube-client/__tests__/cache.list-watcher.spec.js
  • packages/kube-client/lib/cache/ListWatcher.js
  • packages/kube-client/lib/mixins.js
  • packages/request/__tests__/client.spec.js
  • packages/request/lib/Client.js
  • packages/request/lib/SessionPool.js

Comment thread packages/request/__tests__/client.spec.js
@petersutter petersutter force-pushed the fix/propagate-abort-signal-through-list-requests branch from 150fcc9 to f1f6bbb Compare May 12, 2026 07:21
List requests silently stalled forever when the HTTP/2 stream hung
because the abort signal was never forwarded by ListWatcher.list(),
and stream.getHeaders() had no timeout.

- ListWatcher.list() now passes this.signal, matching watch()
- Client.fetch() enforces responseTimeout on getHeaders() via try/finally
@petersutter petersutter force-pushed the fix/propagate-abort-signal-through-list-requests branch from f1f6bbb to 373a442 Compare May 12, 2026 08:30
Read a positive-integer millisecond timeout from the environment at
module load and apply it as the default responseTimeout for all
clients (dashboard, user, and derived kubeconfig clients). Per-call
options can still override. Helm chart renders the value from
.Values.global.dashboard.kubeClient.responseTimeout.
The previous 15 s default was never enforced, so it had no effect.
With the timeout now applied to getHeaders(), 15 s is too
aggressive for large list requests against busy clusters.
@gardener-prow gardener-prow Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/robustness Robustness, reliability, resilience related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/bug Bug size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant