Skip to content

tiproxy: add spec field gracefulShutdownDeleteDelaySeconds to gracefully mark unhealthy before deleting the pods#6829

Open
YangKeao wants to merge 4 commits intopingcap:mainfrom
YangKeao:feature/annotation-to-extend-graceful-shutdown
Open

tiproxy: add spec field gracefulShutdownDeleteDelaySeconds to gracefully mark unhealthy before deleting the pods#6829
YangKeao wants to merge 4 commits intopingcap:mainfrom
YangKeao:feature/annotation-to-extend-graceful-shutdown

Conversation

@YangKeao
Copy link
Copy Markdown
Member

@YangKeao YangKeao commented Apr 16, 2026

Background

I'm designing the graceful restarting of TiProxy in cloud environment.

The intended behavior is:

  1. Use a big maxSurge.
  2. Patch the existing TiProxyGroup to enable a long graceful shutdown delete delay.
  3. Restart.
  4. Old TiProxy instances are first marked unhealthy, then kept alive for a while before the old pods are actually deleted.

This is mainly for cloud load balancers: existing long-lived connections can continue to work on the old TiProxy (with big enough target_health_state.unhealthy.draining_interval_seconds for AWS and disable ConnectionDrainEnabled for aliyun), while new connections should be sent to the new TiProxy instances.

We cannot rely on changing terminationGracePeriodSeconds for existing pods, because that would itself require restarting them. So this PR adds a controller-side graceful delete flow for TiProxy.

Design

  1. Add a new spec field spec.template.spec.gracefulShutdownDeleteDelaySeconds to TiProxyGroup / TiProxy.
  2. This field is treated as reloadable, so patching it does not trigger a rolling restart by itself.
  3. When a TiProxy object is being deleted and this field is set to a positive value:
    1. operator first tries to call POST /api/debug/health/unhealthy
    2. if the API is not supported (404), operator falls back to sending SIGTERM to the TiProxy process by pods/exec
    3. only after TiProxy is confirmed unhealthy, operator writes core.pingcap.com/tiproxy-graceful-shutdown-begin-time on the pod and starts the delete-delay timer
    4. after the timer expires, operator deletes the pod
  4. If TiProxy cannot be marked unhealthy, operator will keep retrying and will not start the delete-delay timer.
  5. When the whole Cluster is deleting, this graceful delay is skipped and the pod is deleted directly.

This design keeps the user-facing control in spec, avoids changing terminationGracePeriodSeconds, and supports both new TiProxy versions (with unhealthy API) and older ones (with SIGTERM fallback).

Usage

Patch an existing TiProxyGroup, so old TiProxy pods will be kept for a while after they are marked unhealthy:

kubectl --context "$CONTEXT" -n "$NS" patch tiproxygroup pg --type merge -p '{
  "spec": {
    "template": {
      "spec": {
        "gracefulShutdownDeleteDelaySeconds": 20
      }
    }
  }
}'

Then trigger a rolling restart, for example by changing config or image. With a large maxSurge, new TiProxy pods can come up first, and old TiProxy pods will only be deleted after entering graceful shutdown and waiting for
the configured delay.

@YangKeao YangKeao requested a review from liubog2008 April 16, 2026 09:24
@ti-chi-bot ti-chi-bot Bot requested a review from howardlau1999 April 16, 2026 09:24
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 16, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jlerche for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions Bot added the v2 for operator v2 label Apr 16, 2026
@ti-chi-bot ti-chi-bot Bot added the size/XL label Apr 16, 2026
@YangKeao YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from 1b9bd2d to de28487 Compare April 16, 2026 09:26
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 52.22222% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.70%. Comparing base (7b536e6) to head (340ab15).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6829      +/-   ##
==========================================
+ Coverage   37.44%   37.70%   +0.26%     
==========================================
  Files         392      393       +1     
  Lines       22434    22588     +154     
==========================================
+ Hits         8400     8517     +117     
- Misses      14034    14071      +37     
Flag Coverage Δ
unittest 37.70% <52.22%> (+0.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/controllers/tiproxy/tasks/finalizer.go Outdated
Comment thread pkg/controllers/tiproxy/tasks/finalizer.go Outdated
})
}

func drainOrDeletePod(ctx context.Context, c client.Client, tiproxy *v1alpha1.TiProxy, pod *corev1.Pod) (time.Duration, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to notify the tiproxy that it is terminating?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Use the unhealthy API in TiProxy (api: add /health/unhealthy endpoint to mark the instance as unhealthy tiproxy#1141 not merged yet).
  2. If the current version of TiProxy doesn't have the API, use the kill xxx to send SIGTERM, which requires the TiProxy to have a large enough graceful-wait-before-shutdown to continue to work during the termination.

We can set the TiProxyGroup.spec.template.spec.gracefulShutdownDeleteDelaySeconds to 24h and set the tiproxy config graceful-wait-before-shutdown to 25h, so though the tiproxy is not actually internally graceful shutdown, the connection should be quited gracefully (in 24h).

@YangKeao YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from de28487 to d837373 Compare April 22, 2026 05:28
@ti-chi-bot ti-chi-bot Bot added size/XXL and removed size/XL labels Apr 22, 2026
@YangKeao YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from d837373 to c2effbb Compare April 27, 2026 08:42
@YangKeao YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from c2effbb to 31caa56 Compare April 27, 2026 08:47
@YangKeao YangKeao changed the title tiproxy: add annotation tiproxy-graceful-shutdown-delete-delay-seconds to remove label before deleting pods tiproxy: add spec field gracefulShutdownDeleteDelaySeconds to gracefully mark unhealthy before deleting the pods Apr 27, 2026
Signed-off-by: Yang Keao <yangkeao@chunibyo.icu>
@YangKeao YangKeao force-pushed the feature/annotation-to-extend-graceful-shutdown branch from dd13296 to 340ab15 Compare April 28, 2026 03:18
@YangKeao YangKeao requested a review from liubog2008 April 28, 2026 06:00
@YangKeao
Copy link
Copy Markdown
Member Author

/hold

Merge after pingcap/tiproxy#1141

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a controller-side graceful delete flow for TiProxy pods, allowing old instances to be marked unhealthy and kept alive for a configurable delay before deletion (to better support cloud load balancer draining during rolling restarts).

Changes:

  • Introduces spec.template.spec.gracefulShutdownDeleteDelaySeconds for TiProxyGroup/TiProxy and treats it as reloadable (doesn’t trigger restarts by itself).
  • Implements a TiProxy finalizer workflow to mark pods unhealthy via TiProxy API (with SIGTERM exec fallback) and then delay pod deletion using a pod annotation timestamp.
  • Expands unit/e2e coverage, updates CRDs, and adds RBAC permission for pods/exec.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/e2e/tiproxy/tiproxy.go Adds an e2e case validating service backend stability during graceful rolling updates with large maxSurge.
pkg/updater/actor_delete_test.go Adds regression tests around delete options to avoid orphaning dependents.
pkg/updater/actor.go Refactors delete options construction (preconditions) before calling Delete.
pkg/tiproxyapi/v1/client_test.go Extends tests for health semantics and adds MarkUnhealthy test.
pkg/tiproxyapi/v1/client.go Adds MarkUnhealthy API and adjusts health check behavior.
pkg/reloadable/tiproxy_test.go Adds test asserting the new delete-delay field is reloadable.
pkg/reloadable/tiproxy.go Ignores GracefulShutdownDeleteDelaySeconds in template equality to keep it reloadable.
pkg/controllers/tiproxy/tasks/finalizer_test.go Adds tests for drain/delete delay behavior (API + fallback paths).
pkg/controllers/tiproxy/tasks/finalizer.go Implements graceful drain + delayed delete logic and pod exec SIGTERM fallback.
pkg/controllers/tiproxy/controller.go Wires controller rest.Config for exec fallback.
pkg/controllers/tiproxy/builder.go Integrates drain task into deleting flow; ensures pod context is available earlier.
manifests/crd/core.pingcap.com_tiproxygroups.yaml Adds CRD schema for gracefulShutdownDeleteDelaySeconds under template spec.
manifests/crd/core.pingcap.com_tiproxies.yaml Adds CRD schema for gracefulShutdownDeleteDelaySeconds on TiProxy spec.
charts/tidb-operator/templates/rbac.yaml Grants pods/exec create permission for SIGTERM fallback.
api/core/v1alpha1/zz_generated.deepcopy.go Adds deepcopy support for the new field.
api/core/v1alpha1/tiproxy_types.go Adds the new API field to TiProxyTemplateSpec.
api/core/v1alpha1/common_types.go Adds the pod annotation key used to track graceful shutdown begin time.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tiproxyapi/v1/client.go
Comment on lines +107 to +114
func markGracefulShutdownBeginTime(ctx context.Context, c client.Client, pod *corev1.Pod, startAt time.Time) error {
newPod := pod.DeepCopy()
if newPod.Annotations == nil {
newPod.Annotations = map[string]string{}
}
newPod.Annotations[v1alpha1.AnnoKeyTiProxyGracefulShutdownBeginTime] = startAt.Format(time.RFC3339Nano)
return c.Update(ctx, newPod)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants