-
Notifications
You must be signed in to change notification settings - Fork 18
multiple: collect host-level journal entries for better debugging #2297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sespiros
wants to merge
10
commits into
main
Choose a base branch
from
sse/gather-host-logs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
f775953
get-logs: collect host-level journal entries
sespiros b2df1ab
justfile: integrate log collection into e2e
sespiros c51a9d1
justfile: add download-logs target
sespiros 00d86b0
e2e: use download-logs in CI
sespiros ff7f57c
dev-docs: add e2e debugging guide
sespiros 1c30e77
get-logs: collect pod-sandbox metadata via crictl
sespiros 59f364a
get-logs: namespace host logs and metadata per node
sespiros 5fa4a04
fixup! dev-docs: add e2e debugging guide
sespiros 1cea044
fixup! get-logs: namespace host logs and metadata per node
sespiros 51c5001
fixup! get-logs: collect host-level journal entries
sespiros File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| # Debugging e2e failures | ||
|
|
||
| ## Collecting logs | ||
|
|
||
| ### After a `just e2e` run | ||
|
|
||
| `just e2e` deploys a log-collector DaemonSet that streams pod logs from the | ||
| start. After the test finishes (pass or fail), download the logs: | ||
|
|
||
| ```bash | ||
| just download-logs | ||
| ``` | ||
|
|
||
| Logs are written to `workspace/logs/`. | ||
|
|
||
| ### After a manual deployment (`just`) | ||
|
|
||
| If you deployed with `just` (the default target) and want to collect logs: | ||
|
|
||
| ```bash | ||
| just download-logs | ||
| ``` | ||
|
|
||
| This deploys the log-collector DaemonSet (if not already running), collects | ||
| host-level journal entries, and downloads everything. | ||
|
|
||
| ### In CI | ||
|
|
||
| CI runs `just download-logs` automatically after every e2e test. Logs are | ||
| uploaded as GitHub Actions artifacts. To find them: go to the workflow run, | ||
| scroll to the bottom of the run summary page, and look for artifacts named | ||
| `e2e_pod_logs-<platform>-<test>` (for example, `e2e_pod_logs-Metal-QEMU-SNP-openssl`). | ||
| Alternatively you can expand the "Upload logs" step in a particular test and | ||
| get the Artifact download URL. | ||
|
|
||
| ## Log structure | ||
|
|
||
| ``` | ||
| workspace/logs/ | ||
| ├── <namespace>_<pod>_<uid>/ # pod container logs | ||
| │ └── <container>/0.log | ||
| ├── host/<node-name>/ # host-level journal logs (per node) | ||
| │ ├── kernel.log # journalctl -k (SEV-ES termination, VFIO/IOMMU) | ||
| │ ├── k3s.log # journalctl -u k3s (k3s-specific kubelet/containerd) | ||
| │ ├── kubelet.log # journalctl -u kubelet (non-k3s runners) | ||
| │ ├── containerd.log # journalctl -u containerd (non-k3s runners) | ||
| │ └── kata.log # journalctl -t kata (QEMU lifecycle, register dumps) | ||
| ├── metadata/<node-name>/ | ||
| │ └── sandbox-map.txt # CVM pod name -> kata sandbox ID | ||
| └── <namespace>-k8s-events.yaml # kubernetes events | ||
| ``` | ||
|
|
||
| Host logs are time-scoped to the namespace creation time, so they only contain | ||
| entries relevant to the test run. | ||
|
|
||
| ## Debugging CVM failures | ||
|
|
||
| CVM boot failures (for example, SEV-ES termination, OVMF crashes) leave no trace in | ||
| pod logs -the guest never starts. Look at host-level logs instead: | ||
|
|
||
| 1. **kernel.log** -look for `SEV-ES guest requested termination`, VFIO/IOMMU | ||
| errors, or KVM failures. | ||
| 2. **kata.log** -look for `detected guest crash`, QEMU launch arguments, | ||
| register dumps, and console output (`vmconsole=` lines contain guest serial | ||
| output). | ||
| 3. **k3s.log** -look for `task is in unknown state` or containerd errors that | ||
| indicate the CVM process died. | ||
|
|
||
| ## Tracing a pod to its sandbox in kata.log | ||
|
|
||
| kata.log contains interleaved logs from all sandboxes. The collected metadata | ||
| file (`metadata/sandbox-map.txt`) maps CVM pod names to kata sandbox IDs. | ||
| The sandbox map only includes pods that are still running at log collection time. | ||
| Pods that might have been deleted earlier in the test (one such example is the | ||
| regression test which creates and tears down multiple rounds of pods) won't have entries. | ||
|
|
||
| 1. Find the sandbox ID for a pod: | ||
|
|
||
| ```bash | ||
| cat workspace/logs/metadata/*/sandbox-map.txt | ||
| # coordinator-0 f4bb878b2e58bd3bd5a89fe2bc99b7368fc6aa070a0b8490a5c69a7c9816be65 | ||
| # openssl-backend-757688b785-dvr4c 3658285f5581ad51... | ||
| # openssl-frontend-575dfdbb89-srwvr 828d8660496f6ac4... | ||
| ``` | ||
|
|
||
| 2. Filter kata.log for a specific pod's sandbox: | ||
|
|
||
| ```bash | ||
| sandbox=$(grep coordinator workspace/logs/metadata/*/sandbox-map.txt | awk '{print $2}') | ||
| grep "$sandbox" workspace/logs/host/kata.log | ||
| ``` | ||
|
|
||
| ### Fallback: Finding sandboxes by runtime class hash | ||
|
|
||
| If a pod is missing from the sandbox map (deleted before log collection), you | ||
| can find its sandbox ID using the runtime class hash from kata.log. The hash | ||
| is the last component of the runtime class name (for example, `d17bc85e` from | ||
| `contrast-cc-metal-qemu-snp-d17bc85e`): | ||
|
|
||
| ```bash | ||
| grep "d17bc85e" workspace/logs/host/*/kata.log | grep -oP 'sandbox=\K[a-f0-9]+' | sort -u | ||
| ``` | ||
|
|
||
| This lists all sandbox IDs for that runtime class. Cross-reference with the | ||
| sandbox map to identify which ones are unmapped. | ||
|
|
||
| Note that some kata log lines (config loading, factory init, device cold plug) | ||
| don't have a sandbox ID. These are shared across all CVMs and may be relevant | ||
| for debugging startup failures. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| #!/usr/bin/env bash | ||
| # Copyright 2026 Edgeless Systems GmbH | ||
| # SPDX-License-Identifier: BUSL-1.1 | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| since="${1:?usage: collect-host-logs <since>}" | ||
| node="${NODE_NAME:?NODE_NAME must be set}" | ||
| mkdir -p "/export/logs/host/$node" | ||
| echo "Collecting kernel logs (since $since)..." >&2 | ||
| journalctl --directory=/journal -k --since="$since" --no-pager 2>/dev/null >/export/logs/host/"$node"/kernel.log || rm -f /export/logs/host/"$node"/kernel.log | ||
| echo "Collecting k3s logs (since $since)..." >&2 | ||
| journalctl --directory=/journal -u k3s --since="$since" --no-pager 2>/dev/null >/export/logs/host/"$node"/k3s.log || rm -f /export/logs/host/"$node"/k3s.log | ||
| echo "Collecting kubelet logs (since $since)..." >&2 | ||
| journalctl --directory=/journal -u kubelet --since="$since" --no-pager 2>/dev/null >/export/logs/host/"$node"/kubelet.log || rm -f /export/logs/host/"$node"/kubelet.log | ||
| echo "Collecting containerd logs (since $since)..." >&2 | ||
| journalctl --directory=/journal -u containerd --since="$since" --no-pager 2>/dev/null >/export/logs/host/"$node"/containerd.log || rm -f /export/logs/host/"$node"/containerd.log | ||
| echo "Collecting kata logs (since $since)..." >&2 | ||
| journalctl --directory=/journal -t kata --since="$since" --no-pager 2>/dev/null >/export/logs/host/"$node"/kata.log || rm -f /export/logs/host/"$node"/kata.log | ||
| echo "Collecting pod-sandbox metadata..." >&2 | ||
| mkdir -p "/export/logs/metadata/$node" | ||
| for sock in /run/k3s/containerd/containerd.sock /run/containerd/containerd.sock; do | ||
| if [[ -S $sock ]]; then | ||
| CONTAINER_RUNTIME_ENDPOINT="unix://$sock" crictl pods -o json 2>/dev/null | | ||
| jq -r --arg ns "${POD_NAMESPACE:-}" \ | ||
| '.items[] | select(.metadata.namespace == $ns and .runtimeHandler != "" and .runtimeHandler != null) | "\(.metadata.name)\t\(.id)"' \ | ||
| >"/export/logs/metadata/$node/sandbox-map.txt" | ||
| break | ||
| fi | ||
| done | ||
| echo "Host log collection complete." >&2 |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth noting that we only get sandbox IDs for pods that are present during log collection (this mostly makes a difference for the regression test, where we're starting and stopping a lot of things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm good point. I added a note. I haven't looked into the behavior of each of our regression tests. I assume if a regression test fails, execution stops so that's the point where you could do
just download-logsand you would get the failed pod's sandbox included in the map.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it continues. But let's move that into a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added back some documentation for figuring out the mapping from the runtime and created a ticket for a follow-up PR for this.