docs: refresh infra docs for post-Hetzner architecture#57
Conversation
66e915d to
cfbe8a1
Compare
Closes fullstaq-ruby#55. Brings docs/ in line with the post-July-2024 architecture (single Hetzner VM running Caddy + Sinatra/Puma API server + Prometheus, provisioned by Ansible), replacing references to the previous GKE Autopilot + Nginx Ingress + Cloud Run apiserver setup. Files changed: - docs/infrastructure-overview.md — rewritten section by section. Every claim is grounded in current IaC. The two GCP-service-account sections are folded into a single "CI/CD authentication" section that splits per-caller: server-edition uses GCP WIF (APT/YUM repo buckets + GCS CI artifacts bucket) and Azure Federated Identity Credentials (Azure Blob CI artifacts + CI cache + Key Vault GPG key); infra repo's apiserver workflow only mints a GitHub OIDC JWT (audience backend.fullstaqruby.org) and POSTs to /admin/upgrade_apiserver — it does not authenticate to GCP or Azure APIs. The Caddy section is corrected: there is no backend.fullstaqruby.org vhost; both apt. and yum. vhosts handle /admin/* via reverse_proxy to the apiserver Unix socket. The "Google Cloud projects" claim of two projects is corrected — there is one project, fsruby-server-edition2, provisioned by terraform-hisec/gcloud_project.tf and populated by terraform/; the hisec/non-hisec separation lives at the Terraform-state and access-group layer. Container registry section dropped (no registry resources are managed in this repo). Key Vault name uses the templated form ${var.key_vault_prefix}infraowners (currently fsruby2infraowners). CI artifacts/cache split is now explicit (artifacts dual-cloud, cache Azure-only). VM section distinguishes Terraform-managed forward DNS from the manually-set Hetzner PTR record. - docs/infrastructure-overview.drawio.svg — deleted. Replaced by an inline Mermaid diagram in infrastructure-overview.md so future diagram changes are reviewable as text diffs. - docs/editing-diagrams.md — deleted (no longer needed without the drawio round-trip). - docs/deploy.md — replaces the gcloud-clusters/kubectl steps with a single ansible-playbook step matching bootstrapping Step 11. Adds a callout that apiserver code changes deploy via the GitHub Actions workflow. - docs/infrastructure-as-code.md — drops Kustomize and the kubernetes/ directory bullet; adds Ansible to the tool list and an ansible/ directory bullet. - docs/infrastructure-bootstrapping.md — intro updated to mention Terraform + Ansible (not Kubernetes/Kustomize); the rest of the file already reflected the post-migration setup. - docs/pull_request_template.md — diagram-update checkbox now points to the Mermaid block instead of the deleted drawio file. - README.md — drops the link to the deleted editing-diagrams.md. - .editorconfig — removes the duplicate [config.ru] block (the tab/4 one); only the correct space/2 rule remains. Note: the "Github CI bot account" section is kept as-is. Retiring that PAT-based bot is already tracked in fullstaq-ruby#18 and is therefore out of scope here.
cfbe8a1 to
85b83bf
Compare
|
I'll have a good look. So far my first impression is that the new diagram lacks a lot of detail that was in the older diagram. I'm also not sure whether a detailed but automatically rendered diagram is still readable compared to a manually drawn one. |
| # Infrastructure bootstrapping | ||
|
|
||
| We try to codify infrastructure as much as possible using Terraform and Kubernetes YAML. However: | ||
| We try to codify infrastructure as much as possible using Terraform and Ansible. However: |
There was a problem hiding this comment.
There should be an instruction step in this document for deploying the API server.
| * `ansible/` — Configuration of the backend VM (Caddy, the API server, Prometheus, and OS hardening). Administered by [Infra Maintainers](roles.md) and applied manually; see [Deployment guide](deploy.md). | ||
|
|
||
| * `.github/workflows/apiserver.yml` — Deploys the API server. | ||
| * `.github/workflows/apiserver.yml` — Builds and deploys the API server. |
There was a problem hiding this comment.
Nowadays it's .github/workflows/ (multiple workflows that together do the build and deployment).
| ~~~bash | ||
| kubectl apply --context=gke_fullstaq-ruby_us-east4_fullstaq-ruby-autopilot -k ../kubernetes | ||
| ~~~ | ||
| > The API server itself is not deployed by this playbook. Code changes under `apiserver/` are released by the `.github/workflows/apiserver.yml` workflow, which packages a tarball, attaches it to a GitHub Release, and triggers `POST /admin/upgrade_apiserver` on the live host. |
There was a problem hiding this comment.
Nowadays it's the entire .github/workflows/ folder (multiple workflows that together do the build and deployment)
|
|
||
| All Google Cloud resources live in a single project, `fsruby-server-edition2` (display name "Fullstaq Ruby Server Edition"). The `google_project` resource itself is provisioned in `terraform-hisec/gcloud_project.tf` so that creating/deleting the project requires Infra Owner access, but resources _inside_ the project (buckets, IAM, Workload Identity Federation) are managed in `terraform/` by Infra Maintainers. | ||
|
|
||
| The hisec / non-hisec separation is enforced at the **Terraform state and access-group layer**, not via separate GCP projects. See [Terraform state (normal)](#terraform-state-normal) and [Terraform state (hisec)](#terraform-state-hisec). |
There was a problem hiding this comment.
The section is good but I don't really understand this latter sentence. Maybe it makes sense for a reader who was familiar with the previous situation (two GCP projects) but the docs should be optimized for readers who will only be familiar with the current situation going forward, with no regard to past state.
| - Can deploy new versions of the API server. | ||
| - **`fullstaq-ruby/server-edition` → Google Cloud** uses [Workload Identity Federation](https://cloud.google.com/iam/docs/workload-identity-federation) (defined in `terraform/gcloud_auth.tf`). Two pools (`github-ci-test`, `github-ci-deploy`) gate access by GitHub repository owner and Actions environment. Through these pools, server-edition CI jobs gain write access to the APT/YUM repo buckets and the GCP CI artifacts bucket — see `terraform/repo_buckets.tf` and `terraform/ci_storage.tf`. The CI cache lives in Azure (see below), not on GCP. | ||
| - **`fullstaq-ruby/server-edition` → Azure** uses [Federated Identity Credentials](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation) on Entra ID applications (defined in `terraform-hisec/`). These authenticate workflows that read or write Azure Blob Storage (the CI artifacts and CI cache containers) and Azure Key Vault (the GPG signing key). | ||
| - **`fullstaq-ruby/infra` → API server** uses a GitHub-issued OIDC JWT (audience `backend.fullstaqruby.org`) sent as a bearer token to `POST /admin/upgrade_apiserver`. The infra repo's `apiserver.yml` workflow does **not** authenticate to GCP or Azure APIs — the rollout mechanism is entirely on the VM (see [API server](#api-server)). The same JWT mechanism is used by `server-edition` to call `/admin/restart_web_server` after a publish. |
There was a problem hiding this comment.
Accurate section. But note apiserver.yml has changed.
| - Administered by role: Infra Maintainers | ||
|
|
||
| The Kubernetes cluster runs our Nginx web server. This cluster is in Autopilot mode. | ||
| A single Ubuntu (≥ 24.04) VPS hosted at Hetzner runs every backend service (Caddy, the API server, the API server deployer, Prometheus + node_exporter, fail2ban, AppArmor, unattended-upgrades, ufw). Its forward DNS records (`backend.fullstaqruby.org`, `apt.fullstaqruby.org`, `yum.fullstaqruby.org`) are managed in `terraform/dns.tf`; its static IPs are referenced from `terraform/variables.tf`. The PTR record (`backend.fullstaqruby.org`) is set manually at the Hetzner provider during bootstrapping (see [bootstrapping](infrastructure-bootstrapping.md) Step 7), not via Terraform. |
There was a problem hiding this comment.
Calling it "every backend service" while including things like Prometheus, fail2ban, etc. is overstating it.
| A single Ubuntu (≥ 24.04) VPS hosted at Hetzner runs every backend service (Caddy, the API server, the API server deployer, Prometheus + node_exporter, fail2ban, AppArmor, unattended-upgrades, ufw). Its forward DNS records (`backend.fullstaqruby.org`, `apt.fullstaqruby.org`, `yum.fullstaqruby.org`) are managed in `terraform/dns.tf`; its static IPs are referenced from `terraform/variables.tf`. The PTR record (`backend.fullstaqruby.org`) is set manually at the Hetzner provider during bootstrapping (see [bootstrapping](infrastructure-bootstrapping.md) Step 7), not via Terraform. | ||
|
|
||
| ## DNS, static IPs, Ingresses | ||
| The VM is configured entirely by Ansible (`ansible/main.yml`). The playbook covers OS hardening (SSH, fail2ban, AppArmor, ufw, autoreboot, unattended-upgrades) and the service stack (Prometheus, Caddy, apiserver-deployer, apiserver). There is no Kubernetes — the previous GKE Autopilot setup was replaced by this VM in the July 2024 rearchitecture. |
There was a problem hiding this comment.
We should exclude mentioning the specifics of OS hardening (SSH, fail2ban, etc) because that makes the text too easy to drift from the playbook. The specifics are also not that relevant in this document. As for the "service stack", just mentioning Caddy and API server are enough. Splitting the API server into 'apiserver' vs 'apiserver-deployer' is too fine-grained for this document. Should also not mention Kubernetes.
| The Server Edition's CI/CD system stores artifacts in this bucket, for the purpose of implementing [resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md). Objects in this bucket only live for 30 days. | ||
| The Server Edition's CI/CD system stores artifacts for [CI/CD resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md) in two buckets (see `terraform/ci_storage.tf`): | ||
|
|
||
| - A **GCS bucket** (`${var.gcloud_bucket_prefix}-server-edition-ci-artifacts`) — publicly readable; the `test` environment writes via WIF, the `deploy` environment reads. Objects expire after 30 days. |
There was a problem hiding this comment.
We should just mention the interpolated name directly rather than putting the interpolation variable in this document.
WIF is not a commonly understood abbreviation so it should be spelled out.
The expiration policy should not be mentioned in detail to avoid drift from Terraform. Just saying that object do expire is enough. Important: on the Azure Blob container we expire based on access time, not modification time.
If I recall correctly, the Azure Blob container here is not used... yet. We're still writing artifacts to the GCS bucket. The idea was to one day migrate that away to Azure, but no work has been done on that front so far. The network bandwidth and latency from Github hosted runners to Azure are expected to be better than GCS, but unclear whether the tooling is fast enough. The Azure CLI's startup time is quite big.
| - Administered by role: Infra Owners, Infra Maintainers | ||
|
|
||
| The GPG private key is used to sign APT and YUM repositories. We store the canonical copy in Secrets Manager in the `fullstaq-ruby-hisec` Google Cloud project. We store a secondary copy in the Secret Manager in the `fullstaq-ruby` Google Cloud project. | ||
| The GPG private key is used to sign APT and YUM repositories. It is stored in the Azure Key Vault for Infra Owners — `${var.key_vault_prefix}infraowners`, currently `fsruby2infraowners` (see `terraform-hisec/key_vault.tf`). GitHub Actions in the `test` and `deploy` environments are granted read access via Entra ID Federated Identity Credentials. |
There was a problem hiding this comment.
For consistency with earlier text we should call this Github OIDC.
| ## Google Cloud service account for Infrastructure CI/CD | ||
| The following diagram shows the major infrastructure components and how they relate to each other. The role that administers each component is given in the section heading below. | ||
|
|
||
| ```mermaid |
There was a problem hiding this comment.
I rendered the Mermaid diagram but I don't think it's easier to read (nor better-looking) than a hand-drawn one, so I prefer an update based on the hand-drawn diagram.
| - **`fullstaq-ruby/server-edition` → Azure** uses [Federated Identity Credentials](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation) on Entra ID applications (defined in `terraform-hisec/`). These authenticate workflows that read or write Azure Blob Storage (the CI artifacts and CI cache containers) and Azure Key Vault (the GPG signing key). | ||
| - **`fullstaq-ruby/infra` → API server** uses a GitHub-issued OIDC JWT (audience `backend.fullstaqruby.org`) sent as a bearer token to `POST /admin/upgrade_apiserver`. The infra repo's `apiserver.yml` workflow does **not** authenticate to GCP or Azure APIs — the rollout mechanism is entirely on the VM (see [API server](#api-server)). The same JWT mechanism is used by `server-edition` to call `/admin/restart_web_server` after a publish. | ||
|
|
||
| ## API server |
There was a problem hiding this comment.
apiserver/README.md also needs to be updated. It still mentions Google Cloud but it's no longer hosted there.
Closes #55.
Summary
Refreshes
docs/to describe the current infrastructure (single Hetzner VM running Caddy + Sinatra/Puma API server + Prometheus, provisioned by Ansible) instead of the pre-July-2024 architecture (GKE Autopilot + Nginx Ingress + Cloud Run apiserver). The infrastructure overview diagram is also refreshed: the staleinfrastructure-overview.drawio.svgis replaced by a Mermaid block embedded ininfrastructure-overview.mdso future diagram changes are reviewable as text diffs.Files changed
docs/infrastructure-overview.md— rewritten section by section against current IaC.fullstaq-ruby/server-edition→ GCP via Workload Identity Federation (APT/YUM repo buckets + GCS CI artifacts bucket).fullstaq-ruby/server-edition→ Azure via Federated Identity Credentials (Azure Blob CI artifacts + CI cache containers + Key Vault GPG key).fullstaq-ruby/infra→ API server only, via a GitHub-issued OIDC JWT (audiencebackend.fullstaqruby.org) sent toPOST /admin/upgrade_apiserver. The infra workflow does not authenticate to GCP or Azure APIs.backend.fullstaqruby.orgvhost; bothapt.andyum.vhosts handle/admin/*via reverse_proxy to the apiserver Unix socket (peransible/files/Caddyfile). CI calls/admin/*viahttps://apt.fullstaqruby.org.fsruby-server-edition2, display name "Fullstaq Ruby Server Edition"), provisioned byterraform-hisec/gcloud_project.tfand populated byterraform/. The hisec/non-hisec boundary lives at the Terraform-state and access-group layer, not at a GCP project boundary.apiserver-deployer.serviceperforms self-update from a tarball attached to a GitHub Release.backend.fullstaqruby.org,apt.fullstaqruby.org,yum.fullstaqruby.org) is distinguished from the manually-set Hetzner PTR record.${var.key_vault_prefix}infraowners(currentlyfsruby2infraowners).docs/infrastructure-overview.drawio.svg— deleted. Replaced by the Mermaid block ininfrastructure-overview.md.docs/editing-diagrams.md— deleted. Mermaid is edited inline; no diagrams.net round-trip is needed.docs/deploy.md— replaces thegcloud container clusters get-credentials+kubectl apply -k ../kubernetessteps with a singleansible-playbookstep matching Step 11 of the bootstrapping guide. Adds a callout that apiserver code changes deploy via the GitHub Actions workflow.docs/infrastructure-as-code.md— drops Kustomize and thekubernetes/directory bullet; adds Ansible to the tools list and anansible/directory bullet.docs/infrastructure-bootstrapping.md— intro updated to mention Terraform + Ansible (not Kubernetes/Kustomize). The rest of the file already reflected the post-migration setup.docs/pull_request_template.md— diagram-update checkbox now points to the Mermaid block.README.md— drops the link to the deletedediting-diagrams.md..editorconfig— removes the duplicate[config.ru]block (thetab/4one); only the correctspace/2rule remains.Verification
eclint check $(git ls-files)passes.grep -rin 'kubernetes\|kustomize\|kubectl\|gke\|nginx\|cloud run' docs/returns only intentional historical mentions (e.g. "the previous GKE Autopilot setup was replaced by this VM in the July 2024 rearchitecture").terraform/{dns,gcloud_auth,backend,repo_buckets,ci_storage}.tf,terraform-hisec/{gcloud_project,key_vault,backend}.tf,ansible/main.yml,ansible/files/{Caddyfile,apiserver.service},.github/workflows/apiserver.yml.Note: PAT-based CI bot
The "Github CI bot account" section describes a PAT-based bot. Retiring/converting that account is already tracked in #18 ("Change fullstaq-ruby-ci-bot account into a Github app") and is therefore intentionally not in scope here. The text remains as-is so the doc reflects the current state until #18 lands.