Skip to content

RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima) #12

@abtreece

Description

@abtreece

RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima)

Author: @abtreece
Status: Draft for maintainer sign-off
Target repo: fullstaq-ruby/infra (this issue lives in umbrella because it requests cross-cutting policy approval before the PR lands)
Spike branch: abtreece/infra@spike/ansible-molecule-mvp (fork; two commits — 103026a MVP scenario + seam edits, 0a5d120 recurse:true idempotence fix)
Plan + results: posted alongside this RFC on request — full plan is ~660 lines, Phase 0 results is ~150 lines

Summary

I'm proposing we add a Molecule + Lima test scenario to infra/ansible/ so we can converge the Ansible playbook against a real VM locally and in CI. A working MVP has been built and proves the approach end-to-end. Before I open the PR, I'd like maintainer sign-off on the addition. There are no live Azure / GCS / GitHub calls and four small prod-file edits that the test seam requires.

Problem

infra/ansible/ is the sole configuration management for backend.fullstaqruby.org (~351 lines across 11 task files) and has zero automated test coverage. code-reviews.yml validates EditorConfig and Terraform but does not exercise Ansible. Today's loop is: edit → PR → reviewer reads diff → merge → hope it converges on the live VM. That loop misses undefined vars, deprecated module syntax, template render errors, idempotence drift, missing handler notifications, and broken systemd units.

Proposal

Add a single Molecule scenario (ansible/molecule/default/) using the delegated driver against a Lima-managed Debian 12 VM. The scenario:

  • Runs main.yml's task list with molecule_test: true, which causes four prod tasks to skip live external calls
  • Stages a minimal service-contract fixture (Sinatra+Puma) so the apiserver systemd unit can start
  • Asserts via verify.yml that all five services (caddy, prometheus, fail2ban, ssh, apiserver) are systemctl is-active, the apiserver Unix socket exists, caddy validate passes, UFW is active, SSH is hardened, and unattended-upgrade --dry-run succeeds
  • Passes molecule idempotence (changed=0 on second converge)

One scenario, one intended source of truth (main.yml) — with Phase 2 drift prevention required because converge.yml currently mirrors the task list (see "Known gaps" below) — one assertion file. Lima as the VM backend both locally and in CI (ubuntu-24.04 runners expose /dev/kvm since Apr 2024).

Proof

Phase 0 (test-seam spike) is complete. molecule test runs the full lifecycle clean end-to-end: destroy → create → prepare → converge → idempotence → verify → destroy. Zero live Azure / GCS / GitHub-release calls. All five services active against the fixture.

Full results doc with exit-criteria checklist, architecture, tooling, and reproduction steps will be attached as a follow-up comment on this issue (collapsed <details> block) so it's one click away rather than a separate request.

What this required (the load-bearing question)

Four prod files needed minimal edits to introduce a molecule_test test seam. Total: ~10 lines added, all behind when: not (molecule_test | default(false)) or a Jinja {% if molecule_test %} guard. Prod behavior is functionally identical when molecule_test is unset — the seam edits alone do not change prod runtime. (Ansible's default trim_blocks: true should make the caddy-env.j2 render byte-identical too, but I'll diff the rendered file against main before opening the productionization PR rather than rely on the default.)

File Change
templates/caddy-env.j2 Jinja conditional substitutes stub values for the two lookup('pipe', az keyvault ...) calls
tasks/caddy.yml Gates the repo-version query script and Caddy config copy (Molecule installs stubs at the same paths)
tasks/apiserver-deployer.yml Gates the deployer script copy (Molecule installs a no-op stub)
tasks/unattended-upgrades.yml Gates the GitHub-release deb (unattended-upgrades-prometheus-collector)

These four edits are the load-bearing change set this RFC asks maintainers to accept. The MVP collapses to lint-only without them.

CI plan (sketch)

Out of scope for the MVP PR — flagged here so maintainers can sign off on the direction without committing to specifics. Concrete .github/workflows/ansible.yml lands in the productionization PR.

  • Runner: ubuntu-24.04 (x64). GitHub-hosted x64 runners expose /dev/kvm since Apr 2024, which makes Lima viable in CI directly. Caveat: GitHub's docs label nested virtualization "experimental and unsupported" even though /dev/kvm is exposed — see fallback below.
  • Trigger: path-filtered pull_request and push on ansible/** and the workflow file itself. No always-on cost.
  • Two jobs:
    1. lintyamllint + ansible-lint against ansible/. Cheap, runs first.
    2. moleculemolecule test (full lifecycle) on ubuntu-24.04 against a Lima Debian 12 guest. needs: lint.
  • KVM access: the runner user lacks kvm group membership by default; the workflow installs a one-line udev rule (KERNEL=="kvm", GROUP="kvm", MODE="0666") before starting Lima.
  • Image cache: actions/cache keyed on lima.yaml hash for ~/.cache/lima (the Lima-actions setup helper does not cache by itself). Cuts steady-state runtime by avoiding a fresh image download per run.
  • No secrets: the four seam edits ensure no live Azure / GCS / GitHub-release calls and no secrets are needed, so this is safe on third-party PRs without secret exposure — consistent with the existing code-reviews.yml no-secrets-on-third-party-PRs constraint. (The scenario still needs general internet access for apt, Debian image download, Caddy binary download, and fixture bundle install — it isn't air-gapped.)
  • Fallback if Lima-in-CI proves unreliable: held in reserve, not built. Two options: (a) run the same scenario against an ephemeral Hetzner Cloud VM via a CI-only secret (loses the no-secrets property for third-party PRs), or (b) degrade CI to lint-only and keep molecule test as a contributor-only loop. Decision deferred until we have CI runtime data.
  • Estimated runtime (needs CI measurement): lint ~30s; molecule ~5–8 min cold (image download + apt installs). Warm-cache numbers are harder to predict — molecule test destroys the VM each run, so apt installs and fixture bundle install rerun even with a cached Lima image. I'll measure once the workflow lands and revise.

Known gaps and Phase 2 commitments

The MVP is deliberately minimal. Before merging the productionization PR (Phase 2 in the plan), I'll close these gaps that surfaced during the spike or in pre-RFC review:

  • Drift prevention between main.yml and converge.yml. The converge playbook currently mirrors main.yml's task-import list. That's drift waiting to happen. Phase 2 will either factor the shared task imports into a single included file used by both, or add a small CI check that asserts the two import lists match.
  • Caddy reverse-proxy wiring is not actually exercised end-to-end. Today verify.yml checks Caddy is active and caddy validate passes. Phase 2 will add a cheap curl -fsS http://127.0.0.1:8080/admin/health through the test Caddyfile to the apiserver Unix socket — proving the proxy wiring without testing app semantics.
  • Production Caddyfile validation. The runtime scenario installs a test Caddyfile (to avoid ACME and pin the proxy route deterministically), which means caddy validate is currently checking the test config — not ansible/files/Caddyfile. A broken prod Caddyfile could pass today. Phase 2 will add a separate caddy validate step against the production Caddyfile rendered with dummy env values, so syntax/adapter regressions in the prod config are still caught.
  • Collection pinning. requirements.txt pins Python deps; community.general and ansible.posix are currently installed manually. Phase 2 will add ansible/requirements.yml pinning the collections and wire ansible-galaxy collection install -r requirements.yml into both contributor docs and CI. yamllint and ansible-lint will be pinned alongside (Phase 1).
  • Categorize the bootstrap packages. The MVP installs cron, ufw, acl, rsyslog, ruby-dev, build-essential in prepare.yml. These split into three categories: server runtime deps that probably belong in tasks/essentials.yml (cron, ufw, rsyslog); Ansible transport/become deps that legitimately stay in prepare.yml (acl); and fixture-build-only deps that stay in prepare.yml because prod ships pre-built vendor/bundle (ruby-dev, build-essential). I'll classify each before opening the productionization PR.

Bonus: latent prod issues surfaced

Running the MVP against a fresh Debian 12 cloud image surfaced eight real gaps in the playbook, masked in prod by the specific Hetzner image's preinstalled packages or by manual setup history. The most significant:

  • tasks/apiserver-deployer.yml is permanently non-idempotent. Create apiserver deployment directory used recurse: true with mode: 0755. The recurse walked into vendor/bundle/ruby/*/bin/ (gem-bin shims) and the latest symlink — both report 0777 from stat() because they're symlinks. Linux can't actually chmod a symlink, but Ansible still reports changed: true. Every ansible-playbook run after the first apiserver release deploy has been reporting at least one task changed, forever. No idempotence test in CI to catch it. Fix is included in the spike branchrecurse: true removed; release tarballs are extracted by the deployer-as-itself, so ownership inside is already correct and the recurse was over-defensive.
  • Undeclared package dependencies surfaced on a fresh Debian cloud image. cron, ufw, rsyslog are server-runtime gaps the playbook assumes preinstalled; acl is an Ansible become-as-non-root transport dep; ruby-dev and build-essential are fixture-build-only (prod ships a pre-built vendor/bundle, so they aren't prod gaps). See "Categorize the bootstrap packages" above.
  • One ansible_architecture deprecation warning (will break in ansible-core 2.24)

Per the singleton-issue convention, I'd file these as separate focused issues in fullstaq-ruby/infra rather than bundle them into the MVP PR. Open to maintainer preference.

Non-goals (faithful to the plan)

  • No real ACME/TLS issuance (test Caddyfile uses auto_https off, listens :8080)
  • No real Azure Key Vault, GCS, or GitHub-release integration
  • No OIDC JWT verification (covered by infra/apiserver/ Ruby tests)
  • No live fail2ban banning (only verifies the unit starts and config parses)
  • No AppArmor profile management (today's task only installs the package)

Questions for maintainers

  1. Are the four molecule_test-gated edits in prod task files acceptable? They are no-ops in prod but introduce a test-mode concept into the playbook. This is the load-bearing approval the rest of the work depends on.
  2. Should bootstrap packages (cron, ufw, acl, rsyslog) be added to the production playbook (e.g. tasks/essentials.yml) so it's self-contained on a bare Debian image? The MVP installs them in Molecule's prepare.yml, preserving prod behavior but documenting the dependency.
  3. Bundle vs split the latent prod issues? Recommend: split (one issue per finding). Confirm preference.
  4. Two PRs, not one. I'll split the spike branch into (a) MVP scenario + the four seam edits and (b) the recurse: true idempotence fix. Distinct intent, distinct risk profile — either could land independently. Flagging in case a maintainer prefers a different grouping.
  5. Is the CI shape sketched above acceptable? Specifically: GitHub-hosted ubuntu-24.04 runners with Lima (vs. a self-hosted runner or a Hetzner-cloud fallback), and no-secrets-on-third-party-PRs preserved by the offline-friendly seam design. Direction-only sign-off is enough; concrete workflow lands in the productionization PR.

Proposed next steps

If maintainers approve the four test-seam edits:

  1. Open PR (a) against fullstaq-ruby/infra with the MVP scenario + the four seam edits
  2. Open PR (b) with the recurse: true idempotence fix (independent; can land in either order)
  3. File separate issues for the remaining latent findings (one per finding)
  4. Phases 1–4 from the plan (lint, productionize with the gaps above closed, CI integration, announce) — follow-up PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions