tiproxy: add spec field `gracefulShutdownDeleteDelaySeconds` to gracefully mark unhealthy before deleting the pods by YangKeao · Pull Request #6829 · pingcap/tidb-operator

YangKeao · 2026-04-16T09:24:47Z

Background

I'm designing the graceful restarting of TiProxy in cloud environment.

The intended behavior is:

Use a big maxSurge.
Patch the existing TiProxyGroup to enable a long graceful shutdown delete delay.
Restart.
Old TiProxy instances are first marked unhealthy, then kept alive for a while before the old pods are actually deleted.

This is mainly for cloud load balancers: existing long-lived connections can continue to work on the old TiProxy (with big enough target_health_state.unhealthy.draining_interval_seconds for AWS and disable ConnectionDrainEnabled for aliyun), while new connections should be sent to the new TiProxy instances.

We cannot rely on changing terminationGracePeriodSeconds for existing pods, because that would itself require restarting them. So this PR adds a controller-side graceful delete flow for TiProxy.

Design

Add a new spec field spec.template.spec.gracefulShutdownDeleteDelaySeconds to TiProxyGroup / TiProxy.
This field is treated as reloadable, so patching it does not trigger a rolling restart by itself.
When a TiProxy object is being deleted and this field is set to a positive value:
1. operator first tries to call POST /api/debug/health/unhealthy
2. if the API is not supported (404), operator falls back to sending SIGTERM to the TiProxy process by pods/exec
3. only after TiProxy is confirmed unhealthy, operator writes core.pingcap.com/tiproxy-graceful-shutdown-begin-time on the pod and starts the delete-delay timer
4. after the timer expires, operator deletes the pod
If TiProxy cannot be marked unhealthy, operator will keep retrying and will not start the delete-delay timer.
When the whole Cluster is deleting, this graceful delay is skipped and the pod is deleted directly.

This design keeps the user-facing control in spec, avoids changing terminationGracePeriodSeconds, and supports both new TiProxy versions (with unhealthy API) and older ones (with SIGTERM fallback).

Usage

Patch an existing TiProxyGroup, so old TiProxy pods will be kept for a while after they are marked unhealthy:

kubectl --context "$CONTEXT" -n "$NS" patch tiproxygroup pg --type merge -p '{
  "spec": {
    "template": {
      "spec": {
        "gracefulShutdownDeleteDelaySeconds": 20
      }
    }
  }
}'

Then trigger a rolling restart, for example by changing config or image. With a large maxSurge, new TiProxy pods can come up first, and old TiProxy pods will only be deleted after entering graceful shutdown and waiting for
the configured delay.

ti-chi-bot · 2026-04-16T09:24:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jlerche for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2026-04-16T09:33:18Z

Codecov Report

❌ Patch coverage is 52.22222% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.70%. Comparing base (7b536e6) to head (340ab15).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6829      +/-   ##
==========================================
+ Coverage   37.44%   37.70%   +0.26%     
==========================================
  Files         392      393       +1     
  Lines       22434    22588     +154     
==========================================
+ Hits         8400     8517     +117     
- Misses      14034    14071      +37

Flag	Coverage Δ
unittest	`37.70% <52.22%> (+0.26%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liubog2008 · 2026-04-17T03:06:36Z

+	})
+}
+
+func drainOrDeletePod(ctx context.Context, c client.Client, tiproxy *v1alpha1.TiProxy, pod *corev1.Pod) (time.Duration, error) {


How to notify the tiproxy that it is terminating?

Use the unhealthy API in TiProxy (api: add /health/unhealthy endpoint to mark the instance as unhealthy tiproxy#1141 not merged yet).

If the current version of TiProxy doesn't have the API, use the kill xxx to send SIGTERM, which requires the TiProxy to have a large enough graceful-wait-before-shutdown to continue to work during the termination.

We can set the TiProxyGroup.spec.template.spec.gracefulShutdownDeleteDelaySeconds to 24h and set the tiproxy config graceful-wait-before-shutdown to 25h, so though the tiproxy is not actually internally graceful shutdown, the connection should be quited gracefully (in 24h).

Signed-off-by: Yang Keao <yangkeao@chunibyo.icu>

YangKeao · 2026-04-28T06:01:35Z

/hold

Merge after pingcap/tiproxy#1141

Copilot

Pull request overview

Adds a controller-side graceful delete flow for TiProxy pods, allowing old instances to be marked unhealthy and kept alive for a configurable delay before deletion (to better support cloud load balancer draining during rolling restarts).

Changes:

Introduces spec.template.spec.gracefulShutdownDeleteDelaySeconds for TiProxyGroup/TiProxy and treats it as reloadable (doesn’t trigger restarts by itself).
Implements a TiProxy finalizer workflow to mark pods unhealthy via TiProxy API (with SIGTERM exec fallback) and then delay pod deletion using a pod annotation timestamp.
Expands unit/e2e coverage, updates CRDs, and adds RBAC permission for pods/exec.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/e2e/tiproxy/tiproxy.go	Adds an e2e case validating service backend stability during graceful rolling updates with large maxSurge.
pkg/updater/actor_delete_test.go	Adds regression tests around delete options to avoid orphaning dependents.
pkg/updater/actor.go	Refactors delete options construction (preconditions) before calling Delete.
pkg/tiproxyapi/v1/client_test.go	Extends tests for health semantics and adds MarkUnhealthy test.
pkg/tiproxyapi/v1/client.go	Adds `MarkUnhealthy` API and adjusts health check behavior.
pkg/reloadable/tiproxy_test.go	Adds test asserting the new delete-delay field is reloadable.
pkg/reloadable/tiproxy.go	Ignores `GracefulShutdownDeleteDelaySeconds` in template equality to keep it reloadable.
pkg/controllers/tiproxy/tasks/finalizer_test.go	Adds tests for drain/delete delay behavior (API + fallback paths).
pkg/controllers/tiproxy/tasks/finalizer.go	Implements graceful drain + delayed delete logic and pod exec SIGTERM fallback.
pkg/controllers/tiproxy/controller.go	Wires controller `rest.Config` for exec fallback.
pkg/controllers/tiproxy/builder.go	Integrates drain task into deleting flow; ensures pod context is available earlier.
manifests/crd/core.pingcap.com_tiproxygroups.yaml	Adds CRD schema for `gracefulShutdownDeleteDelaySeconds` under template spec.
manifests/crd/core.pingcap.com_tiproxies.yaml	Adds CRD schema for `gracefulShutdownDeleteDelaySeconds` on TiProxy spec.
charts/tidb-operator/templates/rbac.yaml	Grants `pods/exec` create permission for SIGTERM fallback.
api/core/v1alpha1/zz_generated.deepcopy.go	Adds deepcopy support for the new field.
api/core/v1alpha1/tiproxy_types.go	Adds the new API field to `TiProxyTemplateSpec`.
api/core/v1alpha1/common_types.go	Adds the pod annotation key used to track graceful shutdown begin time.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+func markGracefulShutdownBeginTime(ctx context.Context, c client.Client, pod *corev1.Pod, startAt time.Time) error {
+	newPod := pod.DeepCopy()
+	if newPod.Annotations == nil {
+		newPod.Annotations = map[string]string{}
+	}
+	newPod.Annotations[v1alpha1.AnnoKeyTiProxyGracefulShutdownBeginTime] = startAt.Format(time.RFC3339Nano)
+	return c.Update(ctx, newPod)
+}


YangKeao requested a review from liubog2008 April 16, 2026 09:24

ti-chi-bot Bot requested a review from howardlau1999 April 16, 2026 09:24

github-actions Bot added the v2 for operator v2 label Apr 16, 2026

ti-chi-bot Bot added the size/XL label Apr 16, 2026

YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from 1b9bd2d to de28487 Compare April 16, 2026 09:26

liubog2008 reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/controllers/tiproxy/tasks/finalizer.go Outdated

liubog2008 reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/controllers/tiproxy/tasks/finalizer.go Outdated

liubog2008 reviewed Apr 17, 2026

View reviewed changes

YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from de28487 to d837373 Compare April 22, 2026 05:28

ti-chi-bot Bot added size/XXL and removed size/XL labels Apr 22, 2026

YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from d837373 to c2effbb Compare April 27, 2026 08:42

support delayed TiProxy graceful shutdown via unhealthy API

31caa56

YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from c2effbb to 31caa56 Compare April 27, 2026 08:47

YangKeao added 2 commits April 27, 2026 16:53

fix: add missing license header for updater test

1cc5e30

fix: address tiproxy graceful shutdown lint issues

e19a3d3

YangKeao changed the title ~~tiproxy: add annotation tiproxy-graceful-shutdown-delete-delay-seconds to remove label before deleting pods~~ tiproxy: add spec field gracefulShutdownDeleteDelaySeconds to gracefully mark unhealthy before deleting the pods Apr 27, 2026

simplify tiproxy graceful shutdown fallback

340ab15

Signed-off-by: Yang Keao <yangkeao@chunibyo.icu>

YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from dd13296 to 340ab15 Compare April 28, 2026 03:18

YangKeao requested a review from liubog2008 April 28, 2026 06:00

ti-chi-bot Bot added the do-not-merge/hold label Apr 28, 2026

fgksgf requested a review from Copilot April 30, 2026 09:59

Copilot started reviewing on behalf of fgksgf April 30, 2026 10:00 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tiproxy: add spec field `gracefulShutdownDeleteDelaySeconds` to gracefully mark unhealthy before deleting the pods#6829

tiproxy: add spec field `gracefulShutdownDeleteDelaySeconds` to gracefully mark unhealthy before deleting the pods#6829
YangKeao wants to merge 4 commits intopingcap:mainfrom
YangKeao:feature/annotation-to-extend-graceful-shutdown

YangKeao commented Apr 16, 2026 •

edited

Loading

Uh oh!

ti-chi-bot Bot commented Apr 16, 2026

Uh oh!

codecov-commenter commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

liubog2008 Apr 17, 2026

Uh oh!

YangKeao Apr 28, 2026

Uh oh!

YangKeao commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

YangKeao commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Design

Usage

Uh oh!

ti-chi-bot Bot commented Apr 16, 2026

Uh oh!

codecov-commenter commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

liubog2008 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

YangKeao Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

YangKeao commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YangKeao commented Apr 16, 2026 •

edited

Loading

codecov-commenter commented Apr 16, 2026 •

edited

Loading