RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima)
Author: @abtreece
Status: Draft for maintainer sign-off
Target repo: fullstaq-ruby/infra (this issue lives in umbrella because it requests cross-cutting policy approval before the PR lands)
Spike branch: abtreece/infra@spike/ansible-molecule-mvp (fork; two commits — 103026a MVP scenario + seam edits, 0a5d120 recurse:true idempotence fix)
Plan + results: posted alongside this RFC on request — full plan is ~660 lines, Phase 0 results is ~150 lines
Summary
I'm proposing we add a Molecule + Lima test scenario to infra/ansible/ so we can converge the Ansible playbook against a real VM locally and in CI. A working MVP has been built and proves the approach end-to-end. Before I open the PR, I'd like maintainer sign-off on the addition. There are no live Azure / GCS / GitHub calls and four small prod-file edits that the test seam requires.
Problem
infra/ansible/ is the sole configuration management for backend.fullstaqruby.org (~351 lines across 11 task files) and has zero automated test coverage. code-reviews.yml validates EditorConfig and Terraform but does not exercise Ansible. Today's loop is: edit → PR → reviewer reads diff → merge → hope it converges on the live VM. That loop misses undefined vars, deprecated module syntax, template render errors, idempotence drift, missing handler notifications, and broken systemd units.
Proposal
Add a single Molecule scenario (ansible/molecule/default/) using the delegated driver against a Lima-managed Debian 12 VM. The scenario:
- Runs
main.yml's task list with molecule_test: true, which causes four prod tasks to skip live external calls
- Stages a minimal service-contract fixture (Sinatra+Puma) so the apiserver systemd unit can start
- Asserts via
verify.yml that all five services (caddy, prometheus, fail2ban, ssh, apiserver) are systemctl is-active, the apiserver Unix socket exists, caddy validate passes, UFW is active, SSH is hardened, and unattended-upgrade --dry-run succeeds
- Passes
molecule idempotence (changed=0 on second converge)
One scenario, one intended source of truth (main.yml) — with Phase 2 drift prevention required because converge.yml currently mirrors the task list (see "Known gaps" below) — one assertion file. Lima as the VM backend both locally and in CI (ubuntu-24.04 runners expose /dev/kvm since Apr 2024).
Proof
Phase 0 (test-seam spike) is complete. molecule test runs the full lifecycle clean end-to-end: destroy → create → prepare → converge → idempotence → verify → destroy. Zero live Azure / GCS / GitHub-release calls. All five services active against the fixture.
Full results doc with exit-criteria checklist, architecture, tooling, and reproduction steps will be attached as a follow-up comment on this issue (collapsed <details> block) so it's one click away rather than a separate request.
What this required (the load-bearing question)
Four prod files needed minimal edits to introduce a molecule_test test seam. Total: ~10 lines added, all behind when: not (molecule_test | default(false)) or a Jinja {% if molecule_test %} guard. Prod behavior is functionally identical when molecule_test is unset — the seam edits alone do not change prod runtime. (Ansible's default trim_blocks: true should make the caddy-env.j2 render byte-identical too, but I'll diff the rendered file against main before opening the productionization PR rather than rely on the default.)
| File |
Change |
templates/caddy-env.j2 |
Jinja conditional substitutes stub values for the two lookup('pipe', az keyvault ...) calls |
tasks/caddy.yml |
Gates the repo-version query script and Caddy config copy (Molecule installs stubs at the same paths) |
tasks/apiserver-deployer.yml |
Gates the deployer script copy (Molecule installs a no-op stub) |
tasks/unattended-upgrades.yml |
Gates the GitHub-release deb (unattended-upgrades-prometheus-collector) |
These four edits are the load-bearing change set this RFC asks maintainers to accept. The MVP collapses to lint-only without them.
CI plan (sketch)
Out of scope for the MVP PR — flagged here so maintainers can sign off on the direction without committing to specifics. Concrete .github/workflows/ansible.yml lands in the productionization PR.
- Runner:
ubuntu-24.04 (x64). GitHub-hosted x64 runners expose /dev/kvm since Apr 2024, which makes Lima viable in CI directly. Caveat: GitHub's docs label nested virtualization "experimental and unsupported" even though /dev/kvm is exposed — see fallback below.
- Trigger: path-filtered
pull_request and push on ansible/** and the workflow file itself. No always-on cost.
- Two jobs:
lint — yamllint + ansible-lint against ansible/. Cheap, runs first.
molecule — molecule test (full lifecycle) on ubuntu-24.04 against a Lima Debian 12 guest. needs: lint.
- KVM access: the runner user lacks
kvm group membership by default; the workflow installs a one-line udev rule (KERNEL=="kvm", GROUP="kvm", MODE="0666") before starting Lima.
- Image cache:
actions/cache keyed on lima.yaml hash for ~/.cache/lima (the Lima-actions setup helper does not cache by itself). Cuts steady-state runtime by avoiding a fresh image download per run.
- No secrets: the four seam edits ensure no live Azure / GCS / GitHub-release calls and no secrets are needed, so this is safe on third-party PRs without secret exposure — consistent with the existing
code-reviews.yml no-secrets-on-third-party-PRs constraint. (The scenario still needs general internet access for apt, Debian image download, Caddy binary download, and fixture bundle install — it isn't air-gapped.)
- Fallback if Lima-in-CI proves unreliable: held in reserve, not built. Two options: (a) run the same scenario against an ephemeral Hetzner Cloud VM via a CI-only secret (loses the no-secrets property for third-party PRs), or (b) degrade CI to lint-only and keep
molecule test as a contributor-only loop. Decision deferred until we have CI runtime data.
- Estimated runtime (needs CI measurement): lint ~30s; molecule ~5–8 min cold (image download + apt installs). Warm-cache numbers are harder to predict —
molecule test destroys the VM each run, so apt installs and fixture bundle install rerun even with a cached Lima image. I'll measure once the workflow lands and revise.
Known gaps and Phase 2 commitments
The MVP is deliberately minimal. Before merging the productionization PR (Phase 2 in the plan), I'll close these gaps that surfaced during the spike or in pre-RFC review:
- Drift prevention between
main.yml and converge.yml. The converge playbook currently mirrors main.yml's task-import list. That's drift waiting to happen. Phase 2 will either factor the shared task imports into a single included file used by both, or add a small CI check that asserts the two import lists match.
- Caddy reverse-proxy wiring is not actually exercised end-to-end. Today
verify.yml checks Caddy is active and caddy validate passes. Phase 2 will add a cheap curl -fsS http://127.0.0.1:8080/admin/health through the test Caddyfile to the apiserver Unix socket — proving the proxy wiring without testing app semantics.
- Production Caddyfile validation. The runtime scenario installs a test Caddyfile (to avoid ACME and pin the proxy route deterministically), which means
caddy validate is currently checking the test config — not ansible/files/Caddyfile. A broken prod Caddyfile could pass today. Phase 2 will add a separate caddy validate step against the production Caddyfile rendered with dummy env values, so syntax/adapter regressions in the prod config are still caught.
- Collection pinning.
requirements.txt pins Python deps; community.general and ansible.posix are currently installed manually. Phase 2 will add ansible/requirements.yml pinning the collections and wire ansible-galaxy collection install -r requirements.yml into both contributor docs and CI. yamllint and ansible-lint will be pinned alongside (Phase 1).
- Categorize the bootstrap packages. The MVP installs
cron, ufw, acl, rsyslog, ruby-dev, build-essential in prepare.yml. These split into three categories: server runtime deps that probably belong in tasks/essentials.yml (cron, ufw, rsyslog); Ansible transport/become deps that legitimately stay in prepare.yml (acl); and fixture-build-only deps that stay in prepare.yml because prod ships pre-built vendor/bundle (ruby-dev, build-essential). I'll classify each before opening the productionization PR.
Bonus: latent prod issues surfaced
Running the MVP against a fresh Debian 12 cloud image surfaced eight real gaps in the playbook, masked in prod by the specific Hetzner image's preinstalled packages or by manual setup history. The most significant:
tasks/apiserver-deployer.yml is permanently non-idempotent. Create apiserver deployment directory used recurse: true with mode: 0755. The recurse walked into vendor/bundle/ruby/*/bin/ (gem-bin shims) and the latest symlink — both report 0777 from stat() because they're symlinks. Linux can't actually chmod a symlink, but Ansible still reports changed: true. Every ansible-playbook run after the first apiserver release deploy has been reporting at least one task changed, forever. No idempotence test in CI to catch it. Fix is included in the spike branch — recurse: true removed; release tarballs are extracted by the deployer-as-itself, so ownership inside is already correct and the recurse was over-defensive.
- Undeclared package dependencies surfaced on a fresh Debian cloud image.
cron, ufw, rsyslog are server-runtime gaps the playbook assumes preinstalled; acl is an Ansible become-as-non-root transport dep; ruby-dev and build-essential are fixture-build-only (prod ships a pre-built vendor/bundle, so they aren't prod gaps). See "Categorize the bootstrap packages" above.
- One
ansible_architecture deprecation warning (will break in ansible-core 2.24)
Per the singleton-issue convention, I'd file these as separate focused issues in fullstaq-ruby/infra rather than bundle them into the MVP PR. Open to maintainer preference.
Non-goals (faithful to the plan)
- No real ACME/TLS issuance (test Caddyfile uses
auto_https off, listens :8080)
- No real Azure Key Vault, GCS, or GitHub-release integration
- No OIDC JWT verification (covered by
infra/apiserver/ Ruby tests)
- No live
fail2ban banning (only verifies the unit starts and config parses)
- No AppArmor profile management (today's task only installs the package)
Questions for maintainers
- Are the four
molecule_test-gated edits in prod task files acceptable? They are no-ops in prod but introduce a test-mode concept into the playbook. This is the load-bearing approval the rest of the work depends on.
- Should bootstrap packages (
cron, ufw, acl, rsyslog) be added to the production playbook (e.g. tasks/essentials.yml) so it's self-contained on a bare Debian image? The MVP installs them in Molecule's prepare.yml, preserving prod behavior but documenting the dependency.
- Bundle vs split the latent prod issues? Recommend: split (one issue per finding). Confirm preference.
- Two PRs, not one. I'll split the spike branch into (a) MVP scenario + the four seam edits and (b) the
recurse: true idempotence fix. Distinct intent, distinct risk profile — either could land independently. Flagging in case a maintainer prefers a different grouping.
- Is the CI shape sketched above acceptable? Specifically: GitHub-hosted
ubuntu-24.04 runners with Lima (vs. a self-hosted runner or a Hetzner-cloud fallback), and no-secrets-on-third-party-PRs preserved by the offline-friendly seam design. Direction-only sign-off is enough; concrete workflow lands in the productionization PR.
Proposed next steps
If maintainers approve the four test-seam edits:
- Open PR (a) against
fullstaq-ruby/infra with the MVP scenario + the four seam edits
- Open PR (b) with the
recurse: true idempotence fix (independent; can land in either order)
- File separate issues for the remaining latent findings (one per finding)
- Phases 1–4 from the plan (lint, productionize with the gaps above closed, CI integration, announce) — follow-up PRs
RFC: Ansible test coverage for the Hetzner playbook (Molecule + Lima)
Author: @abtreece
Status: Draft for maintainer sign-off
Target repo:
fullstaq-ruby/infra(this issue lives inumbrellabecause it requests cross-cutting policy approval before the PR lands)Spike branch:
abtreece/infra@spike/ansible-molecule-mvp(fork; two commits —103026aMVP scenario + seam edits,0a5d120recurse:trueidempotence fix)Plan + results: posted alongside this RFC on request — full plan is ~660 lines, Phase 0 results is ~150 lines
Summary
I'm proposing we add a Molecule + Lima test scenario to
infra/ansible/so we can converge the Ansible playbook against a real VM locally and in CI. A working MVP has been built and proves the approach end-to-end. Before I open the PR, I'd like maintainer sign-off on the addition. There are no live Azure / GCS / GitHub calls and four small prod-file edits that the test seam requires.Problem
infra/ansible/is the sole configuration management forbackend.fullstaqruby.org(~351 lines across 11 task files) and has zero automated test coverage.code-reviews.ymlvalidates EditorConfig and Terraform but does not exercise Ansible. Today's loop is: edit → PR → reviewer reads diff → merge → hope it converges on the live VM. That loop misses undefined vars, deprecated module syntax, template render errors, idempotence drift, missing handler notifications, and broken systemd units.Proposal
Add a single Molecule scenario (
ansible/molecule/default/) using thedelegateddriver against a Lima-managed Debian 12 VM. The scenario:main.yml's task list withmolecule_test: true, which causes four prod tasks to skip live external callsverify.ymlthat all five services (caddy,prometheus,fail2ban,ssh,apiserver) aresystemctl is-active, the apiserver Unix socket exists,caddy validatepasses, UFW is active, SSH is hardened, andunattended-upgrade --dry-runsucceedsmolecule idempotence(changed=0 on second converge)One scenario, one intended source of truth (
main.yml) — with Phase 2 drift prevention required becauseconverge.ymlcurrently mirrors the task list (see "Known gaps" below) — one assertion file. Lima as the VM backend both locally and in CI (ubuntu-24.04runners expose/dev/kvmsince Apr 2024).Proof
Phase 0 (test-seam spike) is complete.
molecule testruns the full lifecycle clean end-to-end: destroy → create → prepare → converge → idempotence → verify → destroy. Zero live Azure / GCS / GitHub-release calls. All five services active against the fixture.Full results doc with exit-criteria checklist, architecture, tooling, and reproduction steps will be attached as a follow-up comment on this issue (collapsed
<details>block) so it's one click away rather than a separate request.What this required (the load-bearing question)
Four prod files needed minimal edits to introduce a
molecule_testtest seam. Total: ~10 lines added, all behindwhen: not (molecule_test | default(false))or a Jinja{% if molecule_test %}guard. Prod behavior is functionally identical whenmolecule_testis unset — the seam edits alone do not change prod runtime. (Ansible's defaulttrim_blocks: trueshould make thecaddy-env.j2render byte-identical too, but I'll diff the rendered file againstmainbefore opening the productionization PR rather than rely on the default.)templates/caddy-env.j2lookup('pipe', az keyvault ...)callstasks/caddy.ymltasks/apiserver-deployer.ymltasks/unattended-upgrades.ymlunattended-upgrades-prometheus-collector)These four edits are the load-bearing change set this RFC asks maintainers to accept. The MVP collapses to lint-only without them.
CI plan (sketch)
Out of scope for the MVP PR — flagged here so maintainers can sign off on the direction without committing to specifics. Concrete
.github/workflows/ansible.ymllands in the productionization PR.ubuntu-24.04(x64). GitHub-hosted x64 runners expose/dev/kvmsince Apr 2024, which makes Lima viable in CI directly. Caveat: GitHub's docs label nested virtualization "experimental and unsupported" even though/dev/kvmis exposed — see fallback below.pull_requestandpushonansible/**and the workflow file itself. No always-on cost.lint—yamllint+ansible-lintagainstansible/. Cheap, runs first.molecule—molecule test(full lifecycle) onubuntu-24.04against a Lima Debian 12 guest.needs: lint.kvmgroup membership by default; the workflow installs a one-line udev rule (KERNEL=="kvm", GROUP="kvm", MODE="0666") before starting Lima.actions/cachekeyed onlima.yamlhash for~/.cache/lima(the Lima-actions setup helper does not cache by itself). Cuts steady-state runtime by avoiding a fresh image download per run.code-reviews.ymlno-secrets-on-third-party-PRs constraint. (The scenario still needs general internet access for apt, Debian image download, Caddy binary download, and fixturebundle install— it isn't air-gapped.)molecule testas a contributor-only loop. Decision deferred until we have CI runtime data.molecule testdestroys the VM each run, so apt installs and fixturebundle installrerun even with a cached Lima image. I'll measure once the workflow lands and revise.Known gaps and Phase 2 commitments
The MVP is deliberately minimal. Before merging the productionization PR (Phase 2 in the plan), I'll close these gaps that surfaced during the spike or in pre-RFC review:
main.ymlandconverge.yml. The converge playbook currently mirrorsmain.yml's task-import list. That's drift waiting to happen. Phase 2 will either factor the shared task imports into a single included file used by both, or add a small CI check that asserts the two import lists match.verify.ymlchecks Caddy is active andcaddy validatepasses. Phase 2 will add a cheapcurl -fsS http://127.0.0.1:8080/admin/healththrough the test Caddyfile to the apiserver Unix socket — proving the proxy wiring without testing app semantics.caddy validateis currently checking the test config — notansible/files/Caddyfile. A broken prod Caddyfile could pass today. Phase 2 will add a separatecaddy validatestep against the production Caddyfile rendered with dummy env values, so syntax/adapter regressions in the prod config are still caught.requirements.txtpins Python deps;community.generalandansible.posixare currently installed manually. Phase 2 will addansible/requirements.ymlpinning the collections and wireansible-galaxy collection install -r requirements.ymlinto both contributor docs and CI.yamllintandansible-lintwill be pinned alongside (Phase 1).cron,ufw,acl,rsyslog,ruby-dev,build-essentialinprepare.yml. These split into three categories: server runtime deps that probably belong intasks/essentials.yml(cron,ufw,rsyslog); Ansible transport/become deps that legitimately stay inprepare.yml(acl); and fixture-build-only deps that stay inprepare.ymlbecause prod ships pre-builtvendor/bundle(ruby-dev,build-essential). I'll classify each before opening the productionization PR.Bonus: latent prod issues surfaced
Running the MVP against a fresh Debian 12 cloud image surfaced eight real gaps in the playbook, masked in prod by the specific Hetzner image's preinstalled packages or by manual setup history. The most significant:
tasks/apiserver-deployer.ymlis permanently non-idempotent.Create apiserver deployment directoryusedrecurse: truewithmode: 0755. The recurse walked intovendor/bundle/ruby/*/bin/(gem-bin shims) and thelatestsymlink — both report0777fromstat()because they're symlinks. Linux can't actuallychmoda symlink, but Ansible still reportschanged: true. Everyansible-playbookrun after the first apiserver release deploy has been reporting at least one task changed, forever. No idempotence test in CI to catch it. Fix is included in the spike branch —recurse: trueremoved; release tarballs are extracted by the deployer-as-itself, so ownership inside is already correct and the recurse was over-defensive.cron,ufw,rsyslogare server-runtime gaps the playbook assumes preinstalled;aclis an Ansible become-as-non-root transport dep;ruby-devandbuild-essentialare fixture-build-only (prod ships a pre-builtvendor/bundle, so they aren't prod gaps). See "Categorize the bootstrap packages" above.ansible_architecturedeprecation warning (will break in ansible-core 2.24)Per the singleton-issue convention, I'd file these as separate focused issues in
fullstaq-ruby/infrarather than bundle them into the MVP PR. Open to maintainer preference.Non-goals (faithful to the plan)
auto_https off, listens:8080)infra/apiserver/Ruby tests)fail2banbanning (only verifies the unit starts and config parses)Questions for maintainers
molecule_test-gated edits in prod task files acceptable? They are no-ops in prod but introduce a test-mode concept into the playbook. This is the load-bearing approval the rest of the work depends on.cron,ufw,acl,rsyslog) be added to the production playbook (e.g.tasks/essentials.yml) so it's self-contained on a bare Debian image? The MVP installs them in Molecule'sprepare.yml, preserving prod behavior but documenting the dependency.recurse: trueidempotence fix. Distinct intent, distinct risk profile — either could land independently. Flagging in case a maintainer prefers a different grouping.ubuntu-24.04runners with Lima (vs. a self-hosted runner or a Hetzner-cloud fallback), and no-secrets-on-third-party-PRs preserved by the offline-friendly seam design. Direction-only sign-off is enough; concrete workflow lands in the productionization PR.Proposed next steps
If maintainers approve the four test-seam edits:
fullstaq-ruby/infrawith the MVP scenario + the four seam editsrecurse: trueidempotence fix (independent; can land in either order)