Skip to content

docs: refresh infra docs for post-Hetzner architecture#57

Open
abtreece wants to merge 1 commit into
fullstaq-ruby:mainfrom
abtreece:docs/post-hetzner-refresh
Open

docs: refresh infra docs for post-Hetzner architecture#57
abtreece wants to merge 1 commit into
fullstaq-ruby:mainfrom
abtreece:docs/post-hetzner-refresh

Conversation

@abtreece
Copy link
Copy Markdown
Collaborator

@abtreece abtreece commented Apr 29, 2026

Closes #55.

Summary

Refreshes docs/ to describe the current infrastructure (single Hetzner VM running Caddy + Sinatra/Puma API server + Prometheus, provisioned by Ansible) instead of the pre-July-2024 architecture (GKE Autopilot + Nginx Ingress + Cloud Run apiserver). The infrastructure overview diagram is also refreshed: the stale infrastructure-overview.drawio.svg is replaced by a Mermaid block embedded in infrastructure-overview.md so future diagram changes are reviewable as text diffs.

Files changed

  • docs/infrastructure-overview.md — rewritten section by section against current IaC.

    • The two pre-existing GCP-service-account sections are folded into a single CI/CD authentication section, split per-caller:
      • fullstaq-ruby/server-edition → GCP via Workload Identity Federation (APT/YUM repo buckets + GCS CI artifacts bucket).
      • fullstaq-ruby/server-edition → Azure via Federated Identity Credentials (Azure Blob CI artifacts + CI cache containers + Key Vault GPG key).
      • fullstaq-ruby/infra → API server only, via a GitHub-issued OIDC JWT (audience backend.fullstaqruby.org) sent to POST /admin/upgrade_apiserver. The infra workflow does not authenticate to GCP or Azure APIs.
    • Caddy section: there is no backend.fullstaqruby.org vhost; both apt. and yum. vhosts handle /admin/* via reverse_proxy to the apiserver Unix socket (per ansible/files/Caddyfile). CI calls /admin/* via https://apt.fullstaqruby.org.
    • Google Cloud project section: corrected to a single project (fsruby-server-edition2, display name "Fullstaq Ruby Server Edition"), provisioned by terraform-hisec/gcloud_project.tf and populated by terraform/. The hisec/non-hisec boundary lives at the Terraform-state and access-group layer, not at a GCP project boundary.
    • API server section: Sinatra/Puma on a Unix socket under systemd; sibling apiserver-deployer.service performs self-update from a tarball attached to a GitHub Release.
    • VM (Hetzner) section: Terraform-managed forward DNS (backend.fullstaqruby.org, apt.fullstaqruby.org, yum.fullstaqruby.org) is distinguished from the manually-set Hetzner PTR record.
    • CI artifacts / cache sections: artifacts are dual-cloud (public GCS + private Azure container); cache is Azure-only.
    • Container registry section: dropped (no registry resources are managed in this repo).
    • GPG private key section: Key Vault name uses the templated form ${var.key_vault_prefix}infraowners (currently fsruby2infraowners).
  • docs/infrastructure-overview.drawio.svgdeleted. Replaced by the Mermaid block in infrastructure-overview.md.

  • docs/editing-diagrams.mddeleted. Mermaid is edited inline; no diagrams.net round-trip is needed.

  • docs/deploy.md — replaces the gcloud container clusters get-credentials + kubectl apply -k ../kubernetes steps with a single ansible-playbook step matching Step 11 of the bootstrapping guide. Adds a callout that apiserver code changes deploy via the GitHub Actions workflow.

  • docs/infrastructure-as-code.md — drops Kustomize and the kubernetes/ directory bullet; adds Ansible to the tools list and an ansible/ directory bullet.

  • docs/infrastructure-bootstrapping.md — intro updated to mention Terraform + Ansible (not Kubernetes/Kustomize). The rest of the file already reflected the post-migration setup.

  • docs/pull_request_template.md — diagram-update checkbox now points to the Mermaid block.

  • README.md — drops the link to the deleted editing-diagrams.md.

  • .editorconfig — removes the duplicate [config.ru] block (the tab/4 one); only the correct space/2 rule remains.

Verification

  • eclint check $(git ls-files) passes.
  • grep -rin 'kubernetes\|kustomize\|kubectl\|gke\|nginx\|cloud run' docs/ returns only intentional historical mentions (e.g. "the previous GKE Autopilot setup was replaced by this VM in the July 2024 rearchitecture").
  • Each rewritten claim is traceable to current IaC: terraform/{dns,gcloud_auth,backend,repo_buckets,ci_storage}.tf, terraform-hisec/{gcloud_project,key_vault,backend}.tf, ansible/main.yml, ansible/files/{Caddyfile,apiserver.service}, .github/workflows/apiserver.yml.

Note: PAT-based CI bot

The "Github CI bot account" section describes a PAT-based bot. Retiring/converting that account is already tracked in #18 ("Change fullstaq-ruby-ci-bot account into a Github app") and is therefore intentionally not in scope here. The text remains as-is so the doc reflects the current state until #18 lands.

@abtreece abtreece force-pushed the docs/post-hetzner-refresh branch from 66e915d to cfbe8a1 Compare April 29, 2026 02:56
@abtreece abtreece requested a review from FooBarWidget May 15, 2026 03:30
Closes fullstaq-ruby#55.

Brings docs/ in line with the post-July-2024 architecture (single
Hetzner VM running Caddy + Sinatra/Puma API server + Prometheus,
provisioned by Ansible), replacing references to the previous
GKE Autopilot + Nginx Ingress + Cloud Run apiserver setup.

Files changed:

- docs/infrastructure-overview.md — rewritten section by section.
  Every claim is grounded in current IaC. The two GCP-service-account
  sections are folded into a single "CI/CD authentication" section
  that splits per-caller: server-edition uses GCP WIF (APT/YUM repo
  buckets + GCS CI artifacts bucket) and Azure Federated Identity
  Credentials (Azure Blob CI artifacts + CI cache + Key Vault GPG
  key); infra repo's apiserver workflow only mints a GitHub OIDC JWT
  (audience backend.fullstaqruby.org) and POSTs to
  /admin/upgrade_apiserver — it does not authenticate to GCP or
  Azure APIs. The Caddy section is corrected: there is no
  backend.fullstaqruby.org vhost; both apt. and yum. vhosts handle
  /admin/* via reverse_proxy to the apiserver Unix socket. The
  "Google Cloud projects" claim of two projects is corrected — there
  is one project, fsruby-server-edition2, provisioned by
  terraform-hisec/gcloud_project.tf and populated by terraform/; the
  hisec/non-hisec separation lives at the Terraform-state and
  access-group layer. Container registry section dropped (no
  registry resources are managed in this repo). Key Vault name
  uses the templated form ${var.key_vault_prefix}infraowners
  (currently fsruby2infraowners). CI artifacts/cache split is now
  explicit (artifacts dual-cloud, cache Azure-only). VM section
  distinguishes Terraform-managed forward DNS from the
  manually-set Hetzner PTR record.

- docs/infrastructure-overview.drawio.svg — deleted. Replaced by an
  inline Mermaid diagram in infrastructure-overview.md so future
  diagram changes are reviewable as text diffs.

- docs/editing-diagrams.md — deleted (no longer needed without the
  drawio round-trip).

- docs/deploy.md — replaces the gcloud-clusters/kubectl steps with
  a single ansible-playbook step matching bootstrapping Step 11.
  Adds a callout that apiserver code changes deploy via the
  GitHub Actions workflow.

- docs/infrastructure-as-code.md — drops Kustomize and the
  kubernetes/ directory bullet; adds Ansible to the tool list and
  an ansible/ directory bullet.

- docs/infrastructure-bootstrapping.md — intro updated to mention
  Terraform + Ansible (not Kubernetes/Kustomize); the rest of the
  file already reflected the post-migration setup.

- docs/pull_request_template.md — diagram-update checkbox now
  points to the Mermaid block instead of the deleted drawio file.

- README.md — drops the link to the deleted editing-diagrams.md.

- .editorconfig — removes the duplicate [config.ru] block (the
  tab/4 one); only the correct space/2 rule remains.

Note: the "Github CI bot account" section is kept as-is. Retiring
that PAT-based bot is already tracked in fullstaq-ruby#18 and is therefore out
of scope here.
@abtreece abtreece force-pushed the docs/post-hetzner-refresh branch from cfbe8a1 to 85b83bf Compare May 15, 2026 03:48
@FooBarWidget
Copy link
Copy Markdown
Member

I'll have a good look. So far my first impression is that the new diagram lacks a lot of detail that was in the older diagram. I'm also not sure whether a detailed but automatically rendered diagram is still readable compared to a manually drawn one.

# Infrastructure bootstrapping

We try to codify infrastructure as much as possible using Terraform and Kubernetes YAML. However:
We try to codify infrastructure as much as possible using Terraform and Ansible. However:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an instruction step in this document for deploying the API server.

* `ansible/` — Configuration of the backend VM (Caddy, the API server, Prometheus, and OS hardening). Administered by [Infra Maintainers](roles.md) and applied manually; see [Deployment guide](deploy.md).

* `.github/workflows/apiserver.yml` — Deploys the API server.
* `.github/workflows/apiserver.yml` — Builds and deploys the API server.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays it's .github/workflows/ (multiple workflows that together do the build and deployment).

Comment thread docs/deploy.md
~~~bash
kubectl apply --context=gke_fullstaq-ruby_us-east4_fullstaq-ruby-autopilot -k ../kubernetes
~~~
> The API server itself is not deployed by this playbook. Code changes under `apiserver/` are released by the `.github/workflows/apiserver.yml` workflow, which packages a tarball, attaches it to a GitHub Release, and triggers `POST /admin/upgrade_apiserver` on the live host.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays it's the entire .github/workflows/ folder (multiple workflows that together do the build and deployment)


All Google Cloud resources live in a single project, `fsruby-server-edition2` (display name "Fullstaq Ruby Server Edition"). The `google_project` resource itself is provisioned in `terraform-hisec/gcloud_project.tf` so that creating/deleting the project requires Infra Owner access, but resources _inside_ the project (buckets, IAM, Workload Identity Federation) are managed in `terraform/` by Infra Maintainers.

The hisec / non-hisec separation is enforced at the **Terraform state and access-group layer**, not via separate GCP projects. See [Terraform state (normal)](#terraform-state-normal) and [Terraform state (hisec)](#terraform-state-hisec).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section is good but I don't really understand this latter sentence. Maybe it makes sense for a reader who was familiar with the previous situation (two GCP projects) but the docs should be optimized for readers who will only be familiar with the current situation going forward, with no regard to past state.

- Can deploy new versions of the API server.
- **`fullstaq-ruby/server-edition` → Google Cloud** uses [Workload Identity Federation](https://cloud.google.com/iam/docs/workload-identity-federation) (defined in `terraform/gcloud_auth.tf`). Two pools (`github-ci-test`, `github-ci-deploy`) gate access by GitHub repository owner and Actions environment. Through these pools, server-edition CI jobs gain write access to the APT/YUM repo buckets and the GCP CI artifacts bucket — see `terraform/repo_buckets.tf` and `terraform/ci_storage.tf`. The CI cache lives in Azure (see below), not on GCP.
- **`fullstaq-ruby/server-edition` → Azure** uses [Federated Identity Credentials](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation) on Entra ID applications (defined in `terraform-hisec/`). These authenticate workflows that read or write Azure Blob Storage (the CI artifacts and CI cache containers) and Azure Key Vault (the GPG signing key).
- **`fullstaq-ruby/infra` → API server** uses a GitHub-issued OIDC JWT (audience `backend.fullstaqruby.org`) sent as a bearer token to `POST /admin/upgrade_apiserver`. The infra repo's `apiserver.yml` workflow does **not** authenticate to GCP or Azure APIs — the rollout mechanism is entirely on the VM (see [API server](#api-server)). The same JWT mechanism is used by `server-edition` to call `/admin/restart_web_server` after a publish.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accurate section. But note apiserver.yml has changed.

- Administered by role: Infra Maintainers

The Kubernetes cluster runs our Nginx web server. This cluster is in Autopilot mode.
A single Ubuntu (≥ 24.04) VPS hosted at Hetzner runs every backend service (Caddy, the API server, the API server deployer, Prometheus + node_exporter, fail2ban, AppArmor, unattended-upgrades, ufw). Its forward DNS records (`backend.fullstaqruby.org`, `apt.fullstaqruby.org`, `yum.fullstaqruby.org`) are managed in `terraform/dns.tf`; its static IPs are referenced from `terraform/variables.tf`. The PTR record (`backend.fullstaqruby.org`) is set manually at the Hetzner provider during bootstrapping (see [bootstrapping](infrastructure-bootstrapping.md) Step 7), not via Terraform.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling it "every backend service" while including things like Prometheus, fail2ban, etc. is overstating it.

A single Ubuntu (≥ 24.04) VPS hosted at Hetzner runs every backend service (Caddy, the API server, the API server deployer, Prometheus + node_exporter, fail2ban, AppArmor, unattended-upgrades, ufw). Its forward DNS records (`backend.fullstaqruby.org`, `apt.fullstaqruby.org`, `yum.fullstaqruby.org`) are managed in `terraform/dns.tf`; its static IPs are referenced from `terraform/variables.tf`. The PTR record (`backend.fullstaqruby.org`) is set manually at the Hetzner provider during bootstrapping (see [bootstrapping](infrastructure-bootstrapping.md) Step 7), not via Terraform.

## DNS, static IPs, Ingresses
The VM is configured entirely by Ansible (`ansible/main.yml`). The playbook covers OS hardening (SSH, fail2ban, AppArmor, ufw, autoreboot, unattended-upgrades) and the service stack (Prometheus, Caddy, apiserver-deployer, apiserver). There is no Kubernetes — the previous GKE Autopilot setup was replaced by this VM in the July 2024 rearchitecture.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should exclude mentioning the specifics of OS hardening (SSH, fail2ban, etc) because that makes the text too easy to drift from the playbook. The specifics are also not that relevant in this document. As for the "service stack", just mentioning Caddy and API server are enough. Splitting the API server into 'apiserver' vs 'apiserver-deployer' is too fine-grained for this document. Should also not mention Kubernetes.

The Server Edition's CI/CD system stores artifacts in this bucket, for the purpose of implementing [resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md). Objects in this bucket only live for 30 days.
The Server Edition's CI/CD system stores artifacts for [CI/CD resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md) in two buckets (see `terraform/ci_storage.tf`):

- A **GCS bucket** (`${var.gcloud_bucket_prefix}-server-edition-ci-artifacts`) — publicly readable; the `test` environment writes via WIF, the `deploy` environment reads. Objects expire after 30 days.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just mention the interpolated name directly rather than putting the interpolation variable in this document.

WIF is not a commonly understood abbreviation so it should be spelled out.

The expiration policy should not be mentioned in detail to avoid drift from Terraform. Just saying that object do expire is enough. Important: on the Azure Blob container we expire based on access time, not modification time.

If I recall correctly, the Azure Blob container here is not used... yet. We're still writing artifacts to the GCS bucket. The idea was to one day migrate that away to Azure, but no work has been done on that front so far. The network bandwidth and latency from Github hosted runners to Azure are expected to be better than GCS, but unclear whether the tooling is fast enough. The Azure CLI's startup time is quite big.

- Administered by role: Infra Owners, Infra Maintainers

The GPG private key is used to sign APT and YUM repositories. We store the canonical copy in Secrets Manager in the `fullstaq-ruby-hisec` Google Cloud project. We store a secondary copy in the Secret Manager in the `fullstaq-ruby` Google Cloud project.
The GPG private key is used to sign APT and YUM repositories. It is stored in the Azure Key Vault for Infra Owners — `${var.key_vault_prefix}infraowners`, currently `fsruby2infraowners` (see `terraform-hisec/key_vault.tf`). GitHub Actions in the `test` and `deploy` environments are granted read access via Entra ID Federated Identity Credentials.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with earlier text we should call this Github OIDC.

## Google Cloud service account for Infrastructure CI/CD
The following diagram shows the major infrastructure components and how they relate to each other. The role that administers each component is given in the section heading below.

```mermaid
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rendered the Mermaid diagram but I don't think it's easier to read (nor better-looking) than a hand-drawn one, so I prefer an update based on the hand-drawn diagram.

- **`fullstaq-ruby/server-edition` → Azure** uses [Federated Identity Credentials](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation) on Entra ID applications (defined in `terraform-hisec/`). These authenticate workflows that read or write Azure Blob Storage (the CI artifacts and CI cache containers) and Azure Key Vault (the GPG signing key).
- **`fullstaq-ruby/infra` → API server** uses a GitHub-issued OIDC JWT (audience `backend.fullstaqruby.org`) sent as a bearer token to `POST /admin/upgrade_apiserver`. The infra repo's `apiserver.yml` workflow does **not** authenticate to GCP or Azure APIs — the rollout mechanism is entirely on the VM (see [API server](#api-server)). The same JWT mechanism is used by `server-edition` to call `/admin/restart_web_server` after a publish.

## API server
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apiserver/README.md also needs to be updated. It still mentions Google Cloud but it's no longer hosted there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update documentation to reflect current architecture

2 participants