Skip to content

fix: change etcd to use /livez and /readyz#7381

Open
vgt-rangehrn wants to merge 1 commit intokarmada-io:masterfrom
vgt-rangehrn:master
Open

fix: change etcd to use /livez and /readyz#7381
vgt-rangehrn wants to merge 1 commit intokarmada-io:masterfrom
vgt-rangehrn:master

Conversation

@vgt-rangehrn
Copy link
Copy Markdown

@vgt-rangehrn vgt-rangehrn commented Apr 10, 2026

What type of PR is this?
/kind bug

What this PR does / why we need it:
Changes etcd to use /livez and /readyz endpoints for liveness and readiness checks.
Needed because the current etcd image (3.6.6-0) doesn't support the current liveness check (which uses /bin/sh which is no longer available in the container) anymore.
Which issue(s) this PR fixes:

Fixes #7380

Special notes for your reviewer:

The configuration for the liveness and readiness checks is based on this: https://etcd.io/docs/v3.6/op-guide/kubernetes/#example-manifest.
Does this PR introduce a user-facing change?:

`etcd`: Change health probes to use the `/livez` and `/readyz` endpoints, since the previous shell-based probes no longer work in the 3.6.6-0 etcd container image.

Copilot AI review requested due to automatic review settings April 10, 2026 11:56
@karmada-bot karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 10, 2026
@karmada-bot
Copy link
Copy Markdown
Contributor

Welcome @vgt-rangehrn! It looks like this is your first PR to karmada-io/karmada 🎉

@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 10, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the etcd configuration across deployment manifests and operator templates. The primary goal is to resolve compatibility issues with newer etcd images that lack the shell environment required for the previous health check mechanism, transitioning instead to standard Kubernetes HTTP probes.

Highlights

  • Liveness and Readiness Probes: Replaced the shell-based liveness probe with HTTP-based /livez and /readyz endpoints to ensure compatibility with newer etcd container images.
  • Metrics Configuration: Added the --listen-metrics-urls flag to the etcd startup arguments to support the new HTTP health check endpoints.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the etcd configuration across deployment manifests, Helm charts, and the operator to use HTTP-based liveness and readiness probes on port 8080, replacing the previous exec-based liveness probes. It also configures etcd to listen for metrics on port 8080. The review feedback suggests declaring port 8080 in the container's ports list for better visibility and recommends making the metrics port configurable within the operator logic to avoid hardcoding the value.

Comment thread artifacts/deploy/karmada-etcd.yaml Outdated
Comment thread charts/karmada/templates/etcd.yaml Outdated
Comment thread operator/pkg/controlplane/etcd/manifests.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Karmada’s embedded etcd manifests to use etcd’s HTTP /livez and /readyz endpoints for liveness/readiness checks, addressing failures caused by the etcd 3.6.6-0 image no longer including /bin/sh.

Changes:

  • Switch etcd liveness probing from exec (/bin/sh + etcdctl) to httpGet on /livez.
  • Add etcd readiness probing via httpGet on /readyz.
  • Enable an HTTP metrics/health listener via --listen-metrics-urls to serve these endpoints.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File Description
operator/pkg/controlplane/etcd/manifests.go Updates operator-managed etcd StatefulSet template with /livez//readyz probes and metrics listener.
charts/karmada/templates/etcd.yaml Updates Helm internal-etcd StatefulSet template probes and adds metrics listener args.
artifacts/deploy/karmada-etcd.yaml Updates the static deploy manifest probes and adds metrics listener args.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread operator/pkg/controlplane/etcd/manifests.go
Comment on lines 73 to 82
livenessProbe:
exec:
command:
- /bin/sh
- -ec
- etcdctl endpoint health --endpoints https://127.0.0.1:{{ .EtcdListenClientPort }} --cacert=/etc/karmada/pki/etcd/etcd-ca.crt --cert=/etc/karmada/pki/etcd/etcd-server.crt --key=/etc/karmada/pki/etcd/etcd-server.key
httpGet:
path: /livez
port: 8080
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the etcd liveness probe from a very long initial delay/period (600s/60s) to a much more aggressive cadence (initialDelaySeconds=15, failure after ~30s). That can cause restart loops on slower starts (e.g., disk recovery or large data dirs). Consider keeping the previous timing values, or adding a startupProbe to cover slow initialization while keeping liveness strict for steady state.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, etcd took something like 12s to be ready. I increased the initialDelay to 60s, so the pod would only be restarted if it wasn't ready after 90s. I think that should be fine, but I'm open to input.

Comment thread operator/pkg/controlplane/etcd/manifests.go
Comment on lines 53 to 56
- --listen-client-urls=https://0.0.0.0:{{ .EtcdListenClientPort }}
- --listen-peer-urls=http://0.0.0.0:{{ .EtcdListenPeerPort }}
- --listen-metrics-urls=http://0.0.0.0:8080
- --advertise-client-urls=https://{{ .EtcdClientService }}.{{ .Namespace }}.svc.cluster.local:{{ .EtcdListenClientPort }}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binding etcd’s metrics/health listener to 0.0.0.0 over plain HTTP makes /metrics, /livez, and /readyz reachable from any pod that can reach the pod IP, which is a security posture change compared to the previous mTLS-only client port checks. If this endpoint only needs to be consumed by kubelet, consider mitigating exposure (e.g., a NetworkPolicy to restrict access in the namespace, or alternative probe mechanisms that don’t require an unauthenticated HTTP listener).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shouldn't be an issue, but I'm open to other opinions.

Comment thread charts/karmada/templates/etcd.yaml
Comment thread charts/karmada/templates/etcd.yaml
Comment thread charts/karmada/templates/etcd.yaml
Comment thread artifacts/deploy/karmada-etcd.yaml
Comment thread artifacts/deploy/karmada-etcd.yaml
Comment thread artifacts/deploy/karmada-etcd.yaml Outdated
@vgt-rangehrn vgt-rangehrn marked this pull request as draft April 10, 2026 12:10
@karmada-bot karmada-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 10, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 96.77419% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 41.92%. Comparing base (d43f934) to head (45e71e3).
⚠️ Report is 62 commits behind head on master.

Files with missing lines Patch % Lines
pkg/karmadactl/cmdinit/kubernetes/command.go 0.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7381      +/-   ##
==========================================
- Coverage   42.17%   41.92%   -0.25%     
==========================================
  Files         875      879       +4     
  Lines       53603    54334     +731     
==========================================
+ Hits        22605    22779     +174     
- Misses      29301    29831     +530     
- Partials     1697     1724      +27     
Flag Coverage Δ
unittests 41.92% <96.77%> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RainbowMango
Copy link
Copy Markdown
Member

cc @XiShanYongYe-Chang here, as I remember we just updated the endpoint at #7151.

@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

Hi @vgt-rangehrn Are you having some kind of problem?

@karmada-bot karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 14, 2026
@vgt-rangehrn
Copy link
Copy Markdown
Author

Are you having some kind of problem?

No, I just don't have too much time to work on this at the moment. But I'll try to make some progress now and again.

@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

Thanks @vgt-rangehrn take it easy.

@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

If the pr is ready for review, cc me anytime. thx~

@vgt-rangehrn
Copy link
Copy Markdown
Author

@XiShanYongYe-Chang I think it's ready. I basically did test-driven development, so I'm not sure if there are other places in the code which require changes which are not caught by the tests.

@vgt-rangehrn vgt-rangehrn marked this pull request as ready for review April 29, 2026 15:43
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2026
@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

Thanks, let me take a look~
/assign

@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions etcd liveness and readiness probes from exec-based checks to HTTP-based probes on a new metrics port (2381). These updates are applied to deployment manifests, Helm charts, the operator, and karmadactl. Feedback was provided regarding a large anonymous struct in the operator's etcd implementation, suggesting it be refactored into a named struct to improve code maintainability and encapsulation.

Comment on lines 69 to 74
etcdStatefulSetBytes, err := util.ParseTemplate(KarmadaEtcdStatefulSet, struct {
KarmadaInstanceName, StatefulSetName, Namespace, Image string
ImagePullPolicy, EtcdClientService, CertsSecretName string
InitialCluster, EtcdDataVolumeName, EtcdCipherSuites string
Replicas, EtcdListenClientPort, EtcdListenPeerPort int32
KarmadaInstanceName, StatefulSetName, Namespace, Image string
ImagePullPolicy, EtcdClientService, CertsSecretName string
InitialCluster, EtcdDataVolumeName, EtcdCipherSuites string
Replicas, EtcdListenClientPort, EtcdListenPeerPort, EtcdMetricsPort int32
}{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The anonymous struct used for template parsing is becoming quite large (13 fields). While the repository style guide suggests limiting function parameters to 5, the same principle of readability and maintainability applies to structs. Consider refactoring this into a named struct. Note that per repository rules, the struct name should be plural if it logically represents a collection of multiple items (e.g., WorkloadAffinityGroups).

References
  1. Function parameters should be limited to 5; refactor or encapsulate if exceeded. (link)
  2. A struct name should be plural if it logically represents a collection of multiple items, even if those items are distinct fields within the struct.

Copy link
Copy Markdown
Member

@XiShanYongYe-Chang XiShanYongYe-Chang May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay if this PR isn't handled.

@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

cc @XiShanYongYe-Chang here, as I remember we just updated the endpoint at #7151.

When I create the pr, the etcd image is registry.k8s.io/etcd:3.6.0-0, which allows to run /bin/sh:

image

Copy link
Copy Markdown
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

path: /livez
port: 2381
scheme: HTTP
initialDelaySeconds: 60
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed that the value of initialDelaySeconds has been adjusted significantly. What is the reason for this adjustment?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested locally, everything is working fine, not sure why we set it to 600s before.

Copy link
Copy Markdown
Author

@vgt-rangehrn vgt-rangehrn May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with the configuration from here and increased the initialDelaySeconds based on the copilot comment (and my own testing). On the other hand, there would probably be no harm in leaving it at 600.

@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

Hi @vgt-rangehrn can you help squash the commits into one?

Signed-off-by: Roman Angehrn <roman.angehrn@vgt.energy>
@karmada-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from xishanyongye-chang. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks
/lgtm
/cc @RainbowMango

@vgt-rangehrn, do you think we need to cherry-pick it to the previous 1.17 release?

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2026
@vgt-rangehrn
Copy link
Copy Markdown
Author

do you think we need to cherry-pick it to the previous 1.17 release?

I think that's up to you, but it might be a good idea, since the current health check is not functional and etcd is always reported as healthy.

@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

I think that's up to you, but it might be a good idea, since the current health check is not functional and etcd is always reported as healthy.

Thanks, after this PR is merged, we will continue to handle the cherry-pick matters. Can you help add a release note?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

etcd liveness probe doesn't work

6 participants