deploy: capture manager console log on ssh fail#2897
Open
ideaship wants to merge 1 commit into
Open
Conversation
When the manager never accepts SSH public-key authentication during
bring-up, the "Wait until ssh public key authentication to the manager
works" probe exhausts its 60 retries and the job fails on first contact
with the manager. The usual cause is a cloud-init timing race: sshd is
up and the host key is scannable, but cloud-init has not yet written the
image user's authorized_keys. Diagnosing it needs the manager's
cloud-init log, which cannot be fetched over SSH precisely because SSH
is what failed -- and post.yml then runs cleanup.py unconditionally,
destroying the manager (and its console) before anything reads it.
Wrap the port-22 wait, host-key scan, and the auth probe in a block, and
on failure capture the manager's serial console log out-of-band via
"openstack console log show" using the orchestrator's existing cloud
credentials (OS_CLOUD), no manager SSH required. The log is printed into
the job output for triage, then the play re-fails so the build still
reports the failure. The capture runs in the run phase, before the
post-run cleanup tears the manager down.
The rescue covers all three readiness probes, so the final failure
message reports which task actually failed (ansible_failed_task.name /
ansible_failed_result.msg) rather than assuming the auth probe. The
manager instance name is parameterised via manager_server_name
("{{ testbed_prefix | default('testbed') }}-manager") instead of being
hard-coded. The capture step is failed_when: false so an instrumentation
hiccup cannot mask the real failure.
The block carries a "# noqa: osism-fqcn" to work around a rule bug on
ansible-lint images predating osism/container-images#902: the rule
mis-detects the block keyword as a non-FQCN action. The suppression is
harmless once that fix ships.
AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
61a7c5e to
6ecdda0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
testbed-upgrade-stable-ubuntu-24.04(and other jobs runningplaybooks/deploy.yml) intermittently fail during manager bring-up atWait until ssh public key authentication to the manager works: sshd is up
and the host key is scannable, but public-key auth is refused for the whole
retry window. The usual cause is a cloud-init timing race — the image user's
authorized_keyshas not been written yet.The hard part is diagnosis. The manager's cloud-init log cannot be pulled over
SSH (SSH is exactly what failed), and
post.ymlthen runscleanup.pyunconditionally, destroying the manager — and its console — before anything
reads it. Every occurrence so far has had to be reconstructed from the
orchestrator side only.
Most recent occurrence, which prompted this change (periodic-midnight,
2026-05-29 — 60/60 public-key retries denied, failed at ~8m):
https://zuul.services.osism.tech/t/osism/build/441363899a4446aeb1146da5efb5899e
What this PR does
Wraps the three readiness probes (port-22 wait, host-key scan, auth probe) in a
block, and on failure captures the manager's serial console logout-of-band via
openstack console log showusing the orchestrator's existingcloud credentials (
OS_CLOUD) — no manager SSH required. The log is printedinto the job output for triage, then the play re-fails so the build still
reports the failure. The capture runs in the run phase, before the post-run
cleanup tears the manager down.
This is diagnostic instrumentation: the success path is unchanged. The next
time the race fires, the cloud-init console output should show whether
cloud-init stalled or errored, and where.
Notes:
manager_server_name(defaulttestbed-manager) is parameterised ratherthan hard-coded.
failed_when: falseso an instrumentation hiccup cannot maskthe real failure.
Earlier work on this problem
docker, merged 2026-05-08) added this very probe, replacing a fixed 60s
pause with a retrying
ssh-keyscan+ active public-key probe so bootstrapcontinues as soon as auth is ready. It absorbs the race when cloud-init
finishes within the window; this PR adds the diagnostics for when it does not.
ANSIBLE_SSH_RETRIES=3) addressed a related butdistinct SSH "Permission denied (publickey)" race on the production
osism applypath (a per-task ControlPath cold-burst) — a differentmanifestation, noted here for context.
Testing
YAML validated locally; full
ansible-lint/ syntax check runs in CI. Thecapture path only runs on a real readiness failure, so it is best validated by
the next occurrence (or a forced failure).
🤖 Generated with Claude Code