Skip to content

Stop cloning osism/testbed during deployment#2896

Draft
ideaship wants to merge 8 commits into
mainfrom
sp1-kill-self-clone
Draft

Stop cloning osism/testbed during deployment#2896
ideaship wants to merge 8 commits into
mainfrom
sp1-kill-self-clone

Conversation

@ideaship
Copy link
Copy Markdown
Contributor

@ideaship ideaship commented May 25, 2026

Problem

The testbed deploy flow has a subtle self-defeating step: setup-testbed.py --prepare (triggered by make bootstrap) clones a fresh copy of osism/testbed
from GitHub into <checkout>/.src/github.com/osism/testbed. The
manager-part-1.yml synchronize task then pushes that GitHub clone
not the developer's working checkout — to /opt/configuration on the manager.

The practical effect: local edits to the testbed checkout never reach the
manager. To test a local change you had to push it to GitHub first.

The Makefile worked around this by overriding basepath and repo_path
in the bootstrap deployment invocation, keeping local runs functional. But
this masked rather than fixed the problem: the override used $(PWD), which
GNU make resolves to the caller's directory, not the testbed checkout, so
make -C /path/to/testbed deploy from a different CWD would break. And it
meant the playbook defaults were wrong for anyone not using the Makefile.

Zuul was unaffected because it never runs --prepare; its basepath
resolved correctly via the old default (Zuul's checkout path matched the
hardcoded convention). So the problem was invisible in CI.

What changed

1. basepath derives from playbook_dir | dirname.

Every playbook used to compute basepath as
{{ ansible_user_dir }}/src/{{ repositories['testbed']['path'] }} — a
hardcoded path that only worked if the checkout lived at
~/src/github.com/osism/testbed. It now uses
{{ playbook_dir | dirname }}, Ansible's built-in variable for the
directory of the executing playbook. dirname strips one level to give
the checkout root.

This is correct for the intended invocation paths: Zuul's checkout, a
local developer's path, or a worktree at an arbitrary location.
playbook_dir resolves to where the playbook actually is. No convention,
symlink, or override needed. The sanity check (Change 4) catches
misconfigured invocations.

The Makefile's -e basepath=... and -e repo_path=... overrides are
removed; playbook_dir | dirname is now the single source of truth.
(deploy.yml's repo_path default is also fixed to derive from basepath
rather than ansible_user_dir, for the same reason.)

2. The testbed: entry is removed from playbooks/vars/repositories.yml.

setup-testbed.py --prepare clones every entry in repositories.yml into
<checkout>/.src/…. After Change 1, nothing reads repositories['testbed']
for its path, so the clone serves no purpose. Removing the entry stops
--prepare from pulling a fresh osism/testbed from GitHub on every run.

The remaining entries (ansible-collection-commons,
ansible-collection-services, terraform-base) are unaffected; they are
still needed for orchestrator-side ansible-galaxy collection install.

3. The synchronize task is replaced with an ephemeral worktree sync.

The old task pushed the contents of
{{ repo_path }}/osism/testbed/ — the stale GitHub clone — to
/opt/configuration on the manager. The replacement:

  1. Creates a temporary directory on the orchestrator.
  2. Adds an ephemeral detached git worktree at HEAD into that directory
    (git worktree add --detach).
  3. Rsyncs the worktree's contents to /opt/configuration (delete: true,
    --exclude=/.git).
  4. Removes the worktree in an always: block, even on sync failure.

A git worktree at HEAD materializes only tracked files in the state recorded
at that commit — no working-tree modifications, no untracked files, no
.git/ directory, no build artifacts. The source is always clean, so
delete: true is safe and leaves /opt/configuration in an exact, known
state.

Development iteration goes through git commit (or --amend). The
manager's deployed state corresponds to a specific SHA, which is useful for
correlating manager behaviour with a particular tree version in logs and bug
reports.

4. A fast-fail sanity check is added to all seven playbooks.

A shared task list (playbooks/tasks/_basepath_check.yml) checks that
basepath points to a directory containing ansible/manager-part-1.yml.
Each playbook imports it as a pre_task. A misconfigured basepath now
fails immediately with a clear message instead of silently mis-deploying.

Scope

This change covers only the osism/testbed self-clone. The other repos in
repositories.yml (ansible-collection-*, terraform-base) still follow
the existing path: setup-testbed.py --prepare clones them, and
deploy.yml installs them via ansible-galaxy. They are out of scope here.

Before / after

Before After
basepath source Hardcoded ~/src/github.com/osism/testbed (Makefile override for local) playbook_dir | dirname (always the invoking checkout)
Testbed content on manager GitHub main at clone time Committed state of invoking checkout at deploy time
setup-testbed.py --prepare clones testbed Yes No
.src/github.com/osism/testbed created Yes No
.git/ in /opt/configuration No No
Local edits visible without push No Yes (after git commit)
Zuul behaviour Unchanged Unchanged

Test plan

End-to-end verified on regiocloud (DEPLOY_MODE=manager, VERSION_MANAGER=latest):

  • Sanity check fires for all 7 playbooks when invoked with basepath=/tmp
  • setup-testbed.py --prepare no longer creates .src/github.com/osism/testbed
  • Run manager part 1 + 2 completed without error
  • A test commit visible in /opt/configuration/environments/manager/configuration.yml on manager (confirms worktree sync pushed committed content)
  • No .git/ directory on manager (test ! -d /opt/configuration/.git && echo OK)
  • No clone.*osism/testbed references in deploy log

🤖 Generated with Claude Code

@ideaship
Copy link
Copy Markdown
Contributor Author

ansible-lint check failed due to a transient registry pull failure (registry.osism.tech/osism/ansible-lint:latest could not be fetched after 3 retries), not a lint violation. Re-checking.

@ideaship
Copy link
Copy Markdown
Contributor Author

recheck

@ideaship ideaship marked this pull request as draft May 25, 2026 06:38
@ideaship ideaship added the zuul Release the dragons, run Zuul CI label May 25, 2026
@ideaship ideaship force-pushed the sp1-kill-self-clone branch from 3c89c87 to a65ca4a Compare May 25, 2026 10:16
@ideaship ideaship changed the title Kill GitHub-pull-of-self (SP-1) Stop cloning osism/testbed during deployment May 25, 2026
ideaship added 4 commits May 27, 2026 08:38
All seven playbooks derived basepath from
{{ ansible_user_dir }}/src/{{ repositories['testbed']['path'] }}, a
hardcoded path that only resolved correctly if the checkout lived at
~/src/github.com/osism/testbed. The Makefile worked around this with
-e basepath="$(PWD)", but $(PWD) resolves to the caller's directory,
not the testbed checkout, so make -C /path/to/testbed deploy from a
different CWD produced the wrong path.

Replace with {{ playbook_dir | dirname }}. playbook_dir is Ansible's
built-in for the directory of the executing playbook; dirname strips
one level to give the testbed checkout root. This resolves correctly
for local development, Zuul, and worktrees at arbitrary locations
without any convention or Makefile override.

Add playbooks/tasks/_basepath_check.yml, a shared task list that
verifies basepath contains ansible/manager-part-1.yml. Each playbook
imports it as a pre_task so a misconfigured basepath fails immediately
with a clear message instead of silently mis-deploying.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
contrib/setup-testbed.py --prepare iterates every entry in
repositories.yml and clones each into <testbed>/.src/{path}. For the
testbed entry this meant re-cloning osism/testbed from GitHub into
<checkout>/.src/github.com/osism/testbed on every bootstrap run.
manager-part-1.yml then pushed that GitHub clone — not the working
checkout — to /opt/configuration on the manager, making local edits
invisible to the deploy.

With basepath now derived from playbook_dir | dirname, nothing reads
repositories['testbed']. Removing the entry stops the redundant clone.

The ansible-collection-* and terraform-base entries stay; they still
feed setup-testbed.py --prepare and the orchestrator-side
ansible-galaxy install in deploy.yml.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
manager-part-1.yml needs the orchestrator-side testbed checkout path
to use as the source for syncing to /opt/configuration. Pass basepath
via -e basepath={{ basepath | quote }} alongside the existing
-e repo_path=... argument.

The | quote filter wraps the value in single quotes and escapes embedded
quotes so that checkout paths containing spaces survive the shell
expansion in ansible.builtin.shell's cmd:.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
The previous synchronize task pushed the contents of
{{ repo_path }}/{{ repo }}/ — the orchestrator's
<testbed>/.src/github.com/osism/testbed GitHub clone — to
/opt/configuration on the manager. Local edits to the testbed checkout
were therefore never deployed; only the clone's content reached the
manager.

Replace it with a block/always sequence that:

1. Creates a temp dir on the orchestrator.
2. Adds an ephemeral detached git worktree at HEAD into that dir.
3. Runs ansible.posix.synchronize with delete: true and --exclude=/.git
   to push the worktree contents to /opt/configuration on the manager.
4. Removes the worktree in always:, even on sync failure.

A git worktree at HEAD materializes only tracked files in the state
recorded at that commit. .git/, venv/, .tox/, .src/, .terraform/, and
other untracked paths never reach /opt/configuration. Stale files are
removed cleanly via rsync --delete since the source tree has no junk.

Dev iteration goes through git commit (or --amend). Working-tree edits
that aren't committed do not deploy. The manager's state corresponds to
a specific SHA, useful for correlating manager behaviour with a
particular tree version in logs and bug reports.

Also remove the two tasks (Get home directory of ansible user, Set
repo_path fact) that existed only to compute repo_path for the
synchronize task replaced above. Nothing in manager-part-1.yml now
consumes repo_path: osism.commons.repository runs before the fact was
set and is unrelated to git clones; the supplementary-repo sync lives
in manager-part-0.yml. The -e repo_path=... argument that deploy.yml
passes to manager-part-1 is now vestigial but harmless.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
@ideaship ideaship force-pushed the sp1-kill-self-clone branch from a65ca4a to 072baea Compare May 27, 2026 06:38
@ideaship ideaship added zuul Release the dragons, run Zuul CI and removed zuul Release the dragons, run Zuul CI labels May 28, 2026
@ideaship ideaship moved this from Ready to In progress in Human Board May 28, 2026
@ideaship ideaship added zuul Release the dragons, run Zuul CI and removed zuul Release the dragons, run Zuul CI labels May 28, 2026
ideaship added 4 commits May 28, 2026 16:50
The bootstrap target passed -e basepath="$(PWD)" and a derived
-e repo_path="$(PWD)/.src/github.com" to ansible-playbook. With
basepath now derived from playbook_dir | dirname in the playbooks,
these overrides defeat the new default and reintroduce hardcoded paths.

$(PWD) was also subtly wrong: GNU make resolves it to the caller's
directory, not the testbed checkout, so make -C /path/to/testbed deploy
from a different CWD set basepath to the wrong path.

Remove both overrides; playbook_dir | dirname is now the single source
of truth for basepath.

Removing the -e repo_path= override exposed the deploy.yml default,
which must account for two different supplementary-repo layouts:

- Local: setup-testbed.py --prepare clones ansible-collection-* and
  terraform-base into <checkout>/.src/github.com.
- Zuul: those repos arrive via required-projects under
  {{ ansible_user_dir }}/src/github.com; .src/ is never created.

A single static path cannot serve both, so repo_path selects by
context: {{ ansible_user_dir }}/src/github.com under Zuul (zuul is
defined), else {{ basepath }}/.src/github.com locally.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
The custom osism-fqcn ansible-lint rule checks that every task module
appears in an allowlist of approved FQCNs. The rule's skip guard only
applies when a task has no action at all; block constructs have their
action set to "block/always/rescue" internally, which is not present
in the allowlist (the list contains "block", "always", and "rescue"
as separate entries).

The result is that any bare block: task triggers a spurious osism-fqcn
violation. The established codebase pattern, already applied in
ansible/manager-part-0.yml and ansible/manager-part-3.yml, is to
annotate the affected task name with # noqa: osism-fqcn.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
Replace the executor-side basepath check (_basepath_check.yml) with a
resolver task (_basepath.yml) that derives basepath where it is
consumed: on the orchestrator node.

The old approach assumed basepath was already set and validated it with
delegate_to/run_once, which pinned the check to the executor. In Zuul
CI the executor and the orchestrator node are separate hosts, so an
executor-side path does not exist on the node and CI deployments fail.
Locally they happen to be the same host, which masked the problem.

The new task sets basepath via set_fact (no delegation):
- In Zuul: ansible_user_dir / zuul.project.src_dir
- Locally: git rev-parse --show-toplevel relative to playbook_dir

A stat check on ansible/manager-part-1.yml guards against a basepath
that does not point to a testbed checkout.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
Each deploy playbook previously defined basepath as a play var:

  basepath: "{{ playbook_dir | dirname }}"

This is an executor-side value, evaluated by Ansible on the machine
running the playbook. Under Zuul the executor is not the same host as
the orchestrator node where basepath is consumed, so the path was
wrong.

Task 1 introduced playbooks/tasks/_basepath.yml, which resolves
basepath node-side via set_fact (using zuul.project.src_dir in Zuul
environments, git rev-parse --show-toplevel locally) and then
verifies that the resulting path looks like a testbed checkout.

This commit wires the shared resolver into all 7 deploy playbooks:

- Remove the per-playbook `basepath: "{{ playbook_dir | dirname }}"`
  play var so the resolver's set_fact becomes the sole source of
  truth. Vars that reference {{ basepath }} (terraform_path,
  ansible_path, repo_path, etc.) are templated lazily at task time,
  after the set_fact in pre_tasks has run, so they continue to work.

- Replace the pre_tasks import of tasks/_basepath_check.yml with
  tasks/_basepath.yml in each playbook.

The -e basepath={{ basepath | quote }} argument passed to
manager-part-1 in deploy.yml is unaffected; basepath is now resolved
correctly before that task runs.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
@ideaship ideaship force-pushed the sp1-kill-self-clone branch from debe579 to 230bb63 Compare May 28, 2026 14:50
@ideaship ideaship added zuul Release the dragons, run Zuul CI and removed zuul Release the dragons, run Zuul CI labels May 28, 2026
@berendt
Copy link
Copy Markdown
Member

berendt commented May 29, 2026

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zuul Release the dragons, run Zuul CI

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

3 participants