RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima)

# RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima)

**Author:** @abtreece
**Status:** Draft for maintainer sign-off
**Target repo:** `fullstaq-ruby/infra` (this issue lives in `umbrella` because it requests cross-cutting policy approval before the PR lands)
**Spike branch:** [`abtreece/infra@spike/ansible-molecule-mvp`](https://github.com/abtreece/infra/tree/spike/ansible-molecule-mvp) (fork; two commits — [`103026a`](https://github.com/abtreece/infra/commit/103026a) MVP scenario + seam edits, [`0a5d120`](https://github.com/abtreece/infra/commit/0a5d120) `recurse:true` idempotence fix)
**Plan + results:** posted alongside this RFC on request — full plan is ~660 lines, Phase 0 results is ~150 lines

## Summary

I'm proposing we add a Molecule + Lima test scenario to `infra/ansible/` so we can converge the Ansible playbook against a real VM locally and in CI. A working MVP has been built and proves the approach end-to-end. Before I open the PR, I'd like maintainer sign-off on the addition. There are no live Azure / GCS / GitHub calls and four small prod-file edits that the test seam requires.

## Problem

`infra/ansible/` is the sole configuration management for `backend.fullstaqruby.org` (~351 lines across 11 task files) and has zero automated test coverage. `code-reviews.yml` validates EditorConfig and Terraform but does not exercise Ansible. Today's loop is: edit → PR → reviewer reads diff → merge → hope it converges on the live VM. That loop misses undefined vars, deprecated module syntax, template render errors, idempotence drift, missing handler notifications, and broken systemd units.

## Proposal

Add a single Molecule scenario (`ansible/molecule/default/`) using the `delegated` driver against a Lima-managed Debian 12 VM. The scenario:

- Runs `main.yml`'s task list with `molecule_test: true`, which causes four prod tasks to skip live external calls
- Stages a minimal service-contract fixture (Sinatra+Puma) so the apiserver systemd unit can start
- Asserts via `verify.yml` that all five services (`caddy`, `prometheus`, `fail2ban`, `ssh`, `apiserver`) are `systemctl is-active`, the apiserver Unix socket exists, `caddy validate` passes, UFW is active, SSH is hardened, and `unattended-upgrade --dry-run` succeeds
- Passes `molecule idempotence` (changed=0 on second converge)

One scenario, one intended source of truth (`main.yml`) — with Phase 2 drift prevention required because `converge.yml` currently mirrors the task list (see "Known gaps" below) — one assertion file. Lima as the VM backend both locally and in CI (`ubuntu-24.04` runners expose `/dev/kvm` since Apr 2024).

## Proof

Phase 0 (test-seam spike) is **complete**. `molecule test` runs the full lifecycle clean end-to-end: destroy → create → prepare → converge → idempotence → verify → destroy. Zero live Azure / GCS / GitHub-release calls. All five services active against the fixture.

Full results doc with exit-criteria checklist, architecture, tooling, and reproduction steps will be attached as a follow-up comment on this issue (collapsed `<details>` block) so it's one click away rather than a separate request.

## What this required (the load-bearing question)

Four prod files needed minimal edits to introduce a `molecule_test` test seam. **Total: ~10 lines added, all behind `when: not (molecule_test | default(false))` or a Jinja `{% if molecule_test %}` guard. Prod behavior is functionally identical when `molecule_test` is unset** — the seam edits alone do not change prod runtime. (Ansible's default `trim_blocks: true` should make the `caddy-env.j2` render byte-identical too, but I'll diff the rendered file against `main` before opening the productionization PR rather than rely on the default.)

| File | Change |
|---|---|
| `templates/caddy-env.j2` | Jinja conditional substitutes stub values for the two `lookup('pipe', az keyvault ...)` calls |
| `tasks/caddy.yml` | Gates the repo-version query script and Caddy config copy (Molecule installs stubs at the same paths) |
| `tasks/apiserver-deployer.yml` | Gates the deployer script copy (Molecule installs a no-op stub) |
| `tasks/unattended-upgrades.yml` | Gates the GitHub-release deb (`unattended-upgrades-prometheus-collector`) |

These four edits are the load-bearing change set this RFC asks maintainers to accept. The MVP collapses to lint-only without them.

## CI plan (sketch)

Out of scope for the MVP PR — flagged here so maintainers can sign off on the direction without committing to specifics. Concrete `.github/workflows/ansible.yml` lands in the productionization PR.

- **Runner:** `ubuntu-24.04` (x64). GitHub-hosted x64 runners expose `/dev/kvm` since Apr 2024, which makes Lima viable in CI directly. Caveat: GitHub's docs label nested virtualization "experimental and unsupported" even though `/dev/kvm` is exposed — see fallback below.
- **Trigger:** path-filtered `pull_request` and `push` on `ansible/**` and the workflow file itself. No always-on cost.
- **Two jobs:**
  1. `lint` — `yamllint` + `ansible-lint` against `ansible/`. Cheap, runs first.
  2. `molecule` — `molecule test` (full lifecycle) on `ubuntu-24.04` against a Lima Debian 12 guest. `needs: lint`.
- **KVM access:** the runner user lacks `kvm` group membership by default; the workflow installs a one-line udev rule (`KERNEL=="kvm", GROUP="kvm", MODE="0666"`) before starting Lima.
- **Image cache:** `actions/cache` keyed on `lima.yaml` hash for `~/.cache/lima` (the Lima-actions setup helper does not cache by itself). Cuts steady-state runtime by avoiding a fresh image download per run.
- **No secrets:** the four seam edits ensure no live Azure / GCS / GitHub-release calls and no secrets are needed, so this is safe on third-party PRs without secret exposure — consistent with the existing `code-reviews.yml` no-secrets-on-third-party-PRs constraint. (The scenario still needs general internet access for apt, Debian image download, Caddy binary download, and fixture `bundle install` — it isn't air-gapped.)
- **Fallback if Lima-in-CI proves unreliable:** held in reserve, not built. Two options: (a) run the same scenario against an ephemeral Hetzner Cloud VM via a CI-only secret (loses the no-secrets property for third-party PRs), or (b) degrade CI to lint-only and keep `molecule test` as a contributor-only loop. Decision deferred until we have CI runtime data.
- **Estimated runtime (needs CI measurement):** lint ~30s; molecule ~5–8 min cold (image download + apt installs). Warm-cache numbers are harder to predict — `molecule test` destroys the VM each run, so apt installs and fixture `bundle install` rerun even with a cached Lima image. I'll measure once the workflow lands and revise.

## Known gaps and Phase 2 commitments

The MVP is deliberately minimal. Before merging the productionization PR (Phase 2 in the plan), I'll close these gaps that surfaced during the spike or in pre-RFC review:

- **Drift prevention between `main.yml` and `converge.yml`.** The converge playbook currently mirrors `main.yml`'s task-import list. That's drift waiting to happen. Phase 2 will either factor the shared task imports into a single included file used by both, or add a small CI check that asserts the two import lists match.
- **Caddy reverse-proxy wiring is not actually exercised end-to-end.** Today `verify.yml` checks Caddy is active and `caddy validate` passes. Phase 2 will add a cheap `curl -fsS http://127.0.0.1:8080/admin/health` through the test Caddyfile to the apiserver Unix socket — proving the proxy wiring without testing app semantics.
- **Production Caddyfile validation.** The runtime scenario installs a test Caddyfile (to avoid ACME and pin the proxy route deterministically), which means `caddy validate` is currently checking the test config — not `ansible/files/Caddyfile`. A broken prod Caddyfile could pass today. Phase 2 will add a separate `caddy validate` step against the production Caddyfile rendered with dummy env values, so syntax/adapter regressions in the prod config are still caught.
- **Collection pinning.** `requirements.txt` pins Python deps; `community.general` and `ansible.posix` are currently installed manually. Phase 2 will add `ansible/requirements.yml` pinning the collections and wire `ansible-galaxy collection install -r requirements.yml` into both contributor docs and CI. `yamllint` and `ansible-lint` will be pinned alongside (Phase 1).
- **Categorize the bootstrap packages.** The MVP installs `cron`, `ufw`, `acl`, `rsyslog`, `ruby-dev`, `build-essential` in `prepare.yml`. These split into three categories: server runtime deps that probably belong in `tasks/essentials.yml` (`cron`, `ufw`, `rsyslog`); Ansible transport/become deps that legitimately stay in `prepare.yml` (`acl`); and fixture-build-only deps that stay in `prepare.yml` because prod ships pre-built `vendor/bundle` (`ruby-dev`, `build-essential`). I'll classify each before opening the productionization PR.

## Bonus: latent prod issues surfaced

Running the MVP against a fresh Debian 12 cloud image surfaced eight real gaps in the playbook, masked in prod by the specific Hetzner image's preinstalled packages or by manual setup history. The most significant:

- **`tasks/apiserver-deployer.yml` is permanently non-idempotent.** `Create apiserver deployment directory` used `recurse: true` with `mode: 0755`. The recurse walked into `vendor/bundle/ruby/*/bin/` (gem-bin shims) and the `latest` symlink — both report `0777` from `stat()` because they're symlinks. Linux can't actually `chmod` a symlink, but Ansible still reports `changed: true`. **Every `ansible-playbook` run after the first apiserver release deploy has been reporting at least one task changed, forever.** No idempotence test in CI to catch it. **Fix is included in the spike branch** — `recurse: true` removed; release tarballs are extracted by the deployer-as-itself, so ownership inside is already correct and the recurse was over-defensive.
- Undeclared package dependencies surfaced on a fresh Debian cloud image. `cron`, `ufw`, `rsyslog` are server-runtime gaps the playbook assumes preinstalled; `acl` is an Ansible become-as-non-root transport dep; `ruby-dev` and `build-essential` are fixture-build-only (prod ships a pre-built `vendor/bundle`, so they aren't prod gaps). See "Categorize the bootstrap packages" above.
- One `ansible_architecture` deprecation warning (will break in ansible-core 2.24)

Per the singleton-issue convention, I'd file these as separate focused issues in `fullstaq-ruby/infra` rather than bundle them into the MVP PR. Open to maintainer preference.

## Non-goals (faithful to the plan)

- No real ACME/TLS issuance (test Caddyfile uses `auto_https off`, listens `:8080`)
- No real Azure Key Vault, GCS, or GitHub-release integration
- No OIDC JWT verification (covered by `infra/apiserver/` Ruby tests)
- No live `fail2ban` banning (only verifies the unit starts and config parses)
- No AppArmor profile management (today's task only installs the package)

## Questions for maintainers

1. **Are the four `molecule_test`-gated edits in prod task files acceptable?** They are no-ops in prod but introduce a test-mode concept into the playbook. This is the load-bearing approval the rest of the work depends on.
2. **Should bootstrap packages (`cron`, `ufw`, `acl`, `rsyslog`) be added to the production playbook (e.g. `tasks/essentials.yml`) so it's self-contained on a bare Debian image?** The MVP installs them in Molecule's `prepare.yml`, preserving prod behavior but documenting the dependency.
3. **Bundle vs split the latent prod issues?** Recommend: split (one issue per finding). Confirm preference.
4. **Two PRs, not one.** I'll split the spike branch into (a) MVP scenario + the four seam edits and (b) the `recurse: true` idempotence fix. Distinct intent, distinct risk profile — either could land independently. Flagging in case a maintainer prefers a different grouping.
5. **Is the CI shape sketched above acceptable?** Specifically: GitHub-hosted `ubuntu-24.04` runners with Lima (vs. a self-hosted runner or a Hetzner-cloud fallback), and no-secrets-on-third-party-PRs preserved by the offline-friendly seam design. Direction-only sign-off is enough; concrete workflow lands in the productionization PR.

## Proposed next steps

If maintainers approve the four test-seam edits:

1. Open PR (a) against `fullstaq-ruby/infra` with the MVP scenario + the four seam edits
2. Open PR (b) with the `recurse: true` idempotence fix (independent; can land in either order)
3. File separate issues for the remaining latent findings (one per finding)
4. Phases 1–4 from the plan (lint, productionize with the gaps above closed, CI integration, announce) — follow-up PRs


File	Change
`templates/caddy-env.j2`	Jinja conditional substitutes stub values for the two `lookup('pipe', az keyvault ...)` calls
`tasks/caddy.yml`	Gates the repo-version query script and Caddy config copy (Molecule installs stubs at the same paths)
`tasks/apiserver-deployer.yml`	Gates the deployer script copy (Molecule installs a no-op stub)
`tasks/unattended-upgrades.yml`	Gates the GitHub-release deb (`unattended-upgrades-prometheus-collector`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima) #12

RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima)

Summary

Problem

Proposal

Proof

What this required (the load-bearing question)

CI plan (sketch)

Known gaps and Phase 2 commitments

Bonus: latent prod issues surfaced

Non-goals (faithful to the plan)

Questions for maintainers

Proposed next steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima) #12

Description

RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima)

Summary

Problem

Proposal

Proof

What this required (the load-bearing question)

CI plan (sketch)

Known gaps and Phase 2 commitments

Bonus: latent prod issues surfaced

Non-goals (faithful to the plan)

Questions for maintainers

Proposed next steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions