deploy: capture manager console log on ssh fail by ideaship · Pull Request #2897 · osism/testbed

ideaship · 2026-05-29T11:21:35Z

Problem

testbed-upgrade-stable-ubuntu-24.04 (and other jobs running
playbooks/deploy.yml) intermittently fail during manager bring-up at
Wait until ssh public key authentication to the manager works: sshd is up
and the host key is scannable, but public-key auth is refused for the whole
retry window. The usual cause is a cloud-init timing race — the image user's
authorized_keys has not been written yet.

The hard part is diagnosis. The manager's cloud-init log cannot be pulled over
SSH (SSH is exactly what failed), and post.yml then runs cleanup.py
unconditionally, destroying the manager — and its console — before anything
reads it. Every occurrence so far has had to be reconstructed from the
orchestrator side only.

Most recent occurrence, which prompted this change (periodic-midnight,
2026-05-29 — 60/60 public-key retries denied, failed at ~8m):
https://zuul.services.osism.tech/t/osism/build/441363899a4446aeb1146da5efb5899e

What this PR does

Wraps the three readiness probes (port-22 wait, host-key scan, auth probe) in a
block, and on failure captures the manager's serial console log
out-of-band via openstack console log show using the orchestrator's existing
cloud credentials (OS_CLOUD) — no manager SSH required. The log is printed
into the job output for triage, then the play re-fails so the build still
reports the failure. The capture runs in the run phase, before the post-run
cleanup tears the manager down.

This is diagnostic instrumentation: the success path is unchanged. The next
time the race fires, the cloud-init console output should show whether
cloud-init stalled or errored, and where.

Notes:

manager_server_name (default testbed-manager) is parameterised rather
than hard-coded.
The rescue covers all three probes and reports which task actually failed.
The capture is failed_when: false so an instrumentation hiccup cannot mask
the real failure.

Earlier work on this problem

deploy: replace fixed delays with event-driven waits for SSH auth and docker #2887 (deploy: replace fixed delays with event-driven waits for SSH auth and
docker, merged 2026-05-08) added this very probe, replacing a fixed 60s
pause with a retrying ssh-keyscan + active public-key probe so bootstrap
continues as soon as auth is ready. It absorbs the race when cloud-init
finishes within the window; this PR adds the diagnostics for when it does not.
Retry SSH connection setup to fix intermittent publickey errors python-osism#2282 (ANSIBLE_SSH_RETRIES=3) addressed a related but
distinct SSH "Permission denied (publickey)" race on the production
osism apply path (a per-task ControlPath cold-burst) — a different
manifestation, noted here for context.

Testing

YAML validated locally; full ansible-lint / syntax check runs in CI. The
capture path only runs on a real readiness failure, so it is best validated by
the next occurrence (or a forced failure).

🤖 Generated with Claude Code

When the manager never accepts SSH public-key authentication during bring-up, the "Wait until ssh public key authentication to the manager works" probe exhausts its 60 retries and the job fails on first contact with the manager. The usual cause is a cloud-init timing race: sshd is up and the host key is scannable, but cloud-init has not yet written the image user's authorized_keys. Diagnosing it needs the manager's cloud-init log, which cannot be fetched over SSH precisely because SSH is what failed -- and post.yml then runs cleanup.py unconditionally, destroying the manager (and its console) before anything reads it. Wrap the port-22 wait, host-key scan, and the auth probe in a block, and on failure capture the manager's serial console log out-of-band via "openstack console log show" using the orchestrator's existing cloud credentials (OS_CLOUD), no manager SSH required. The log is printed into the job output for triage, then the play re-fails so the build still reports the failure. The capture runs in the run phase, before the post-run cleanup tears the manager down. The rescue covers all three readiness probes, so the final failure message reports which task actually failed (ansible_failed_task.name / ansible_failed_result.msg) rather than assuming the auth probe. The manager instance name is parameterised via manager_server_name ("{{ testbed_prefix | default('testbed') }}-manager") instead of being hard-coded. The capture step is failed_when: false so an instrumentation hiccup cannot mask the real failure. The block carries a "# noqa: osism-fqcn" to work around a rule bug on ansible-lint images predating osism/container-images#902: the rule mis-detects the block keyword as a non-FQCN action. The suppression is harmless once that fix ships. AI-assisted: Claude Code Signed-off-by: Roger Luethi <luethi@osism.tech>

osism-agent added this to Human Board May 29, 2026

github-project-automation Bot moved this to Ready in Human Board May 29, 2026

ideaship force-pushed the deploy-capture-manager-console-log branch from 61a7c5e to 6ecdda0 Compare May 29, 2026 11:31

ideaship marked this pull request as ready for review May 29, 2026 11:42

ideaship moved this from Ready to In progress in Human Board May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy: capture manager console log on ssh fail#2897

deploy: capture manager console log on ssh fail#2897
ideaship wants to merge 1 commit into
mainfrom
deploy-capture-manager-console-log

ideaship commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ideaship commented May 29, 2026

Problem

What this PR does

Earlier work on this problem

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants