fix: change etcd to use /livez and /readyz#7381
fix: change etcd to use /livez and /readyz#7381vgt-rangehrn wants to merge 1 commit intokarmada-io:masterfrom
Conversation
|
Welcome @vgt-rangehrn! It looks like this is your first PR to karmada-io/karmada 🎉 |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the etcd configuration across deployment manifests and operator templates. The primary goal is to resolve compatibility issues with newer etcd images that lack the shell environment required for the previous health check mechanism, transitioning instead to standard Kubernetes HTTP probes. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the etcd configuration across deployment manifests, Helm charts, and the operator to use HTTP-based liveness and readiness probes on port 8080, replacing the previous exec-based liveness probes. It also configures etcd to listen for metrics on port 8080. The review feedback suggests declaring port 8080 in the container's ports list for better visibility and recommends making the metrics port configurable within the operator logic to avoid hardcoding the value.
There was a problem hiding this comment.
Pull request overview
This PR updates Karmada’s embedded etcd manifests to use etcd’s HTTP /livez and /readyz endpoints for liveness/readiness checks, addressing failures caused by the etcd 3.6.6-0 image no longer including /bin/sh.
Changes:
- Switch etcd liveness probing from
exec(/bin/sh+etcdctl) tohttpGeton/livez. - Add etcd readiness probing via
httpGeton/readyz. - Enable an HTTP metrics/health listener via
--listen-metrics-urlsto serve these endpoints.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| operator/pkg/controlplane/etcd/manifests.go | Updates operator-managed etcd StatefulSet template with /livez//readyz probes and metrics listener. |
| charts/karmada/templates/etcd.yaml | Updates Helm internal-etcd StatefulSet template probes and adds metrics listener args. |
| artifacts/deploy/karmada-etcd.yaml | Updates the static deploy manifest probes and adds metrics listener args. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| livenessProbe: | ||
| exec: | ||
| command: | ||
| - /bin/sh | ||
| - -ec | ||
| - etcdctl endpoint health --endpoints https://127.0.0.1:{{ .EtcdListenClientPort }} --cacert=/etc/karmada/pki/etcd/etcd-ca.crt --cert=/etc/karmada/pki/etcd/etcd-server.crt --key=/etc/karmada/pki/etcd/etcd-server.key | ||
| httpGet: | ||
| path: /livez | ||
| port: 8080 | ||
| scheme: HTTP | ||
| initialDelaySeconds: 15 | ||
| timeoutSeconds: 5 | ||
| periodSeconds: 10 | ||
| successThreshold: 1 | ||
| failureThreshold: 3 |
There was a problem hiding this comment.
This changes the etcd liveness probe from a very long initial delay/period (600s/60s) to a much more aggressive cadence (initialDelaySeconds=15, failure after ~30s). That can cause restart loops on slower starts (e.g., disk recovery or large data dirs). Consider keeping the previous timing values, or adding a startupProbe to cover slow initialization while keeping liveness strict for steady state.
There was a problem hiding this comment.
In my testing, etcd took something like 12s to be ready. I increased the initialDelay to 60s, so the pod would only be restarted if it wasn't ready after 90s. I think that should be fine, but I'm open to input.
| - --listen-client-urls=https://0.0.0.0:{{ .EtcdListenClientPort }} | ||
| - --listen-peer-urls=http://0.0.0.0:{{ .EtcdListenPeerPort }} | ||
| - --listen-metrics-urls=http://0.0.0.0:8080 | ||
| - --advertise-client-urls=https://{{ .EtcdClientService }}.{{ .Namespace }}.svc.cluster.local:{{ .EtcdListenClientPort }} |
There was a problem hiding this comment.
Binding etcd’s metrics/health listener to 0.0.0.0 over plain HTTP makes /metrics, /livez, and /readyz reachable from any pod that can reach the pod IP, which is a security posture change compared to the previous mTLS-only client port checks. If this endpoint only needs to be consumed by kubelet, consider mitigating exposure (e.g., a NetworkPolicy to restrict access in the namespace, or alternative probe mechanisms that don’t require an unauthenticated HTTP listener).
There was a problem hiding this comment.
I think this shouldn't be an issue, but I'm open to other opinions.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7381 +/- ##
==========================================
- Coverage 42.17% 41.92% -0.25%
==========================================
Files 875 879 +4
Lines 53603 54334 +731
==========================================
+ Hits 22605 22779 +174
- Misses 29301 29831 +530
- Partials 1697 1724 +27
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
cc @XiShanYongYe-Chang here, as I remember we just updated the endpoint at #7151. |
|
Thanks @vgt-rangehrn can you help sign off refer to https://github.com/karmada-io/karmada/pull/7381/checks?check_run_id=70778541996 |
|
Hi @vgt-rangehrn Are you having some kind of problem? |
No, I just don't have too much time to work on this at the moment. But I'll try to make some progress now and again. |
|
Thanks @vgt-rangehrn take it easy. |
|
If the pr is ready for review, cc me anytime. thx~ |
|
@XiShanYongYe-Chang I think it's ready. I basically did test-driven development, so I'm not sure if there are other places in the code which require changes which are not caught by the tests. |
|
Thanks, let me take a look~ |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request transitions etcd liveness and readiness probes from exec-based checks to HTTP-based probes on a new metrics port (2381). These updates are applied to deployment manifests, Helm charts, the operator, and karmadactl. Feedback was provided regarding a large anonymous struct in the operator's etcd implementation, suggesting it be refactored into a named struct to improve code maintainability and encapsulation.
| etcdStatefulSetBytes, err := util.ParseTemplate(KarmadaEtcdStatefulSet, struct { | ||
| KarmadaInstanceName, StatefulSetName, Namespace, Image string | ||
| ImagePullPolicy, EtcdClientService, CertsSecretName string | ||
| InitialCluster, EtcdDataVolumeName, EtcdCipherSuites string | ||
| Replicas, EtcdListenClientPort, EtcdListenPeerPort int32 | ||
| KarmadaInstanceName, StatefulSetName, Namespace, Image string | ||
| ImagePullPolicy, EtcdClientService, CertsSecretName string | ||
| InitialCluster, EtcdDataVolumeName, EtcdCipherSuites string | ||
| Replicas, EtcdListenClientPort, EtcdListenPeerPort, EtcdMetricsPort int32 | ||
| }{ |
There was a problem hiding this comment.
The anonymous struct used for template parsing is becoming quite large (13 fields). While the repository style guide suggests limiting function parameters to 5, the same principle of readability and maintainability applies to structs. Consider refactoring this into a named struct. Note that per repository rules, the struct name should be plural if it logically represents a collection of multiple items (e.g., WorkloadAffinityGroups).
References
- Function parameters should be limited to 5; refactor or encapsulate if exceeded. (link)
- A struct name should be plural if it logically represents a collection of multiple items, even if those items are distinct fields within the struct.
There was a problem hiding this comment.
It's okay if this PR isn't handled.
When I create the pr, the etcd image is
|
| path: /livez | ||
| port: 2381 | ||
| scheme: HTTP | ||
| initialDelaySeconds: 60 |
There was a problem hiding this comment.
I observed that the value of initialDelaySeconds has been adjusted significantly. What is the reason for this adjustment?
There was a problem hiding this comment.
I tested locally, everything is working fine, not sure why we set it to 600s before.
There was a problem hiding this comment.
I started with the configuration from here and increased the initialDelaySeconds based on the copilot comment (and my own testing). On the other hand, there would probably be no harm in leaving it at 600.
|
Hi @vgt-rangehrn can you help squash the commits into one? |
Signed-off-by: Roman Angehrn <roman.angehrn@vgt.energy>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
XiShanYongYe-Chang
left a comment
There was a problem hiding this comment.
Thanks
/lgtm
/cc @RainbowMango
@vgt-rangehrn, do you think we need to cherry-pick it to the previous 1.17 release?
I think that's up to you, but it might be a good idea, since the current health check is not functional and etcd is always reported as healthy. |
Thanks, after this PR is merged, we will continue to handle the cherry-pick matters. Can you help add a release note? |

What type of PR is this?
/kind bug
What this PR does / why we need it:
Changes etcd to use /livez and /readyz endpoints for liveness and readiness checks.
Needed because the current etcd image (3.6.6-0) doesn't support the current liveness check (which uses /bin/sh which is no longer available in the container) anymore.
Which issue(s) this PR fixes:
Fixes #7380
Special notes for your reviewer:
The configuration for the liveness and readiness checks is based on this: https://etcd.io/docs/v3.6/op-guide/kubernetes/#example-manifest.
Does this PR introduce a user-facing change?: