tiproxy: add spec field gracefulShutdownDeleteDelaySeconds to gracefully mark unhealthy before deleting the pods#6829
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
1b9bd2d to
de28487
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6829 +/- ##
==========================================
+ Coverage 37.44% 37.70% +0.26%
==========================================
Files 392 393 +1
Lines 22434 22588 +154
==========================================
+ Hits 8400 8517 +117
- Misses 14034 14071 +37
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| }) | ||
| } | ||
|
|
||
| func drainOrDeletePod(ctx context.Context, c client.Client, tiproxy *v1alpha1.TiProxy, pod *corev1.Pod) (time.Duration, error) { |
There was a problem hiding this comment.
How to notify the tiproxy that it is terminating?
There was a problem hiding this comment.
- Use the
unhealthyAPI in TiProxy (api: add/health/unhealthyendpoint to mark the instance as unhealthy tiproxy#1141 not merged yet). - If the current version of TiProxy doesn't have the API, use the
kill xxxto send SIGTERM, which requires the TiProxy to have a large enoughgraceful-wait-before-shutdownto continue to work during the termination.
We can set the TiProxyGroup.spec.template.spec.gracefulShutdownDeleteDelaySeconds to 24h and set the tiproxy config graceful-wait-before-shutdown to 25h, so though the tiproxy is not actually internally graceful shutdown, the connection should be quited gracefully (in 24h).
de28487 to
d837373
Compare
d837373 to
c2effbb
Compare
c2effbb to
31caa56
Compare
tiproxy-graceful-shutdown-delete-delay-seconds to remove label before deleting podsgracefulShutdownDeleteDelaySeconds to gracefully mark unhealthy before deleting the pods
Signed-off-by: Yang Keao <yangkeao@chunibyo.icu>
dd13296 to
340ab15
Compare
|
/hold Merge after pingcap/tiproxy#1141 |
There was a problem hiding this comment.
Pull request overview
Adds a controller-side graceful delete flow for TiProxy pods, allowing old instances to be marked unhealthy and kept alive for a configurable delay before deletion (to better support cloud load balancer draining during rolling restarts).
Changes:
- Introduces
spec.template.spec.gracefulShutdownDeleteDelaySecondsfor TiProxyGroup/TiProxy and treats it as reloadable (doesn’t trigger restarts by itself). - Implements a TiProxy finalizer workflow to mark pods unhealthy via TiProxy API (with SIGTERM exec fallback) and then delay pod deletion using a pod annotation timestamp.
- Expands unit/e2e coverage, updates CRDs, and adds RBAC permission for
pods/exec.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/tiproxy/tiproxy.go | Adds an e2e case validating service backend stability during graceful rolling updates with large maxSurge. |
| pkg/updater/actor_delete_test.go | Adds regression tests around delete options to avoid orphaning dependents. |
| pkg/updater/actor.go | Refactors delete options construction (preconditions) before calling Delete. |
| pkg/tiproxyapi/v1/client_test.go | Extends tests for health semantics and adds MarkUnhealthy test. |
| pkg/tiproxyapi/v1/client.go | Adds MarkUnhealthy API and adjusts health check behavior. |
| pkg/reloadable/tiproxy_test.go | Adds test asserting the new delete-delay field is reloadable. |
| pkg/reloadable/tiproxy.go | Ignores GracefulShutdownDeleteDelaySeconds in template equality to keep it reloadable. |
| pkg/controllers/tiproxy/tasks/finalizer_test.go | Adds tests for drain/delete delay behavior (API + fallback paths). |
| pkg/controllers/tiproxy/tasks/finalizer.go | Implements graceful drain + delayed delete logic and pod exec SIGTERM fallback. |
| pkg/controllers/tiproxy/controller.go | Wires controller rest.Config for exec fallback. |
| pkg/controllers/tiproxy/builder.go | Integrates drain task into deleting flow; ensures pod context is available earlier. |
| manifests/crd/core.pingcap.com_tiproxygroups.yaml | Adds CRD schema for gracefulShutdownDeleteDelaySeconds under template spec. |
| manifests/crd/core.pingcap.com_tiproxies.yaml | Adds CRD schema for gracefulShutdownDeleteDelaySeconds on TiProxy spec. |
| charts/tidb-operator/templates/rbac.yaml | Grants pods/exec create permission for SIGTERM fallback. |
| api/core/v1alpha1/zz_generated.deepcopy.go | Adds deepcopy support for the new field. |
| api/core/v1alpha1/tiproxy_types.go | Adds the new API field to TiProxyTemplateSpec. |
| api/core/v1alpha1/common_types.go | Adds the pod annotation key used to track graceful shutdown begin time. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func markGracefulShutdownBeginTime(ctx context.Context, c client.Client, pod *corev1.Pod, startAt time.Time) error { | ||
| newPod := pod.DeepCopy() | ||
| if newPod.Annotations == nil { | ||
| newPod.Annotations = map[string]string{} | ||
| } | ||
| newPod.Annotations[v1alpha1.AnnoKeyTiProxyGracefulShutdownBeginTime] = startAt.Format(time.RFC3339Nano) | ||
| return c.Update(ctx, newPod) | ||
| } |
Background
I'm designing the graceful restarting of TiProxy in cloud environment.
The intended behavior is:
maxSurge.TiProxyGroupto enable a long graceful shutdown delete delay.This is mainly for cloud load balancers: existing long-lived connections can continue to work on the old TiProxy (with big enough
target_health_state.unhealthy.draining_interval_secondsfor AWS and disableConnectionDrainEnabledfor aliyun), while new connections should be sent to the new TiProxy instances.We cannot rely on changing
terminationGracePeriodSecondsfor existing pods, because that would itself require restarting them. So this PR adds a controller-side graceful delete flow for TiProxy.Design
spec.template.spec.gracefulShutdownDeleteDelaySecondstoTiProxyGroup/TiProxy.TiProxyobject is being deleted and this field is set to a positive value:POST /api/debug/health/unhealthy404), operator falls back to sendingSIGTERMto the TiProxy process bypods/execcore.pingcap.com/tiproxy-graceful-shutdown-begin-timeon the pod and starts the delete-delay timerClusteris deleting, this graceful delay is skipped and the pod is deleted directly.This design keeps the user-facing control in
spec, avoids changingterminationGracePeriodSeconds, and supports both new TiProxy versions (with unhealthy API) and older ones (withSIGTERMfallback).Usage
Patch an existing
TiProxyGroup, so old TiProxy pods will be kept for a while after they are marked unhealthy: