Skip to content

deploy: capture manager console log on ssh fail#2897

Open
ideaship wants to merge 1 commit into
mainfrom
deploy-capture-manager-console-log
Open

deploy: capture manager console log on ssh fail#2897
ideaship wants to merge 1 commit into
mainfrom
deploy-capture-manager-console-log

Conversation

@ideaship
Copy link
Copy Markdown
Contributor

Problem

testbed-upgrade-stable-ubuntu-24.04 (and other jobs running
playbooks/deploy.yml) intermittently fail during manager bring-up at
Wait until ssh public key authentication to the manager works: sshd is up
and the host key is scannable, but public-key auth is refused for the whole
retry window. The usual cause is a cloud-init timing race — the image user's
authorized_keys has not been written yet.

The hard part is diagnosis. The manager's cloud-init log cannot be pulled over
SSH (SSH is exactly what failed), and post.yml then runs cleanup.py
unconditionally, destroying the manager — and its console — before anything
reads it. Every occurrence so far has had to be reconstructed from the
orchestrator side only.

Most recent occurrence, which prompted this change (periodic-midnight,
2026-05-29 — 60/60 public-key retries denied, failed at ~8m):
https://zuul.services.osism.tech/t/osism/build/441363899a4446aeb1146da5efb5899e

What this PR does

Wraps the three readiness probes (port-22 wait, host-key scan, auth probe) in a
block, and on failure captures the manager's serial console log
out-of-band via openstack console log show using the orchestrator's existing
cloud credentials (OS_CLOUD) — no manager SSH required. The log is printed
into the job output for triage, then the play re-fails so the build still
reports the failure. The capture runs in the run phase, before the post-run
cleanup tears the manager down.

This is diagnostic instrumentation: the success path is unchanged. The next
time the race fires, the cloud-init console output should show whether
cloud-init stalled or errored, and where.

Notes:

  • manager_server_name (default testbed-manager) is parameterised rather
    than hard-coded.
  • The rescue covers all three probes and reports which task actually failed.
  • The capture is failed_when: false so an instrumentation hiccup cannot mask
    the real failure.

Earlier work on this problem

Testing

YAML validated locally; full ansible-lint / syntax check runs in CI. The
capture path only runs on a real readiness failure, so it is best validated by
the next occurrence (or a forced failure).

🤖 Generated with Claude Code

When the manager never accepts SSH public-key authentication during
bring-up, the "Wait until ssh public key authentication to the manager
works" probe exhausts its 60 retries and the job fails on first contact
with the manager. The usual cause is a cloud-init timing race: sshd is
up and the host key is scannable, but cloud-init has not yet written the
image user's authorized_keys. Diagnosing it needs the manager's
cloud-init log, which cannot be fetched over SSH precisely because SSH
is what failed -- and post.yml then runs cleanup.py unconditionally,
destroying the manager (and its console) before anything reads it.

Wrap the port-22 wait, host-key scan, and the auth probe in a block, and
on failure capture the manager's serial console log out-of-band via
"openstack console log show" using the orchestrator's existing cloud
credentials (OS_CLOUD), no manager SSH required. The log is printed into
the job output for triage, then the play re-fails so the build still
reports the failure. The capture runs in the run phase, before the
post-run cleanup tears the manager down.

The rescue covers all three readiness probes, so the final failure
message reports which task actually failed (ansible_failed_task.name /
ansible_failed_result.msg) rather than assuming the auth probe. The
manager instance name is parameterised via manager_server_name
("{{ testbed_prefix | default('testbed') }}-manager") instead of being
hard-coded. The capture step is failed_when: false so an instrumentation
hiccup cannot mask the real failure.

The block carries a "# noqa: osism-fqcn" to work around a rule bug on
ansible-lint images predating osism/container-images#902: the rule
mis-detects the block keyword as a non-FQCN action. The suppression is
harmless once that fix ships.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
@ideaship ideaship force-pushed the deploy-capture-manager-console-log branch from 61a7c5e to 6ecdda0 Compare May 29, 2026 11:31
@ideaship ideaship marked this pull request as ready for review May 29, 2026 11:42
@ideaship ideaship moved this from Ready to In progress in Human Board May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants