From 85b83bf2fd771c10f32bd751172cae78eeea3fa5 Mon Sep 17 00:00:00 2001 From: abtreece Date: Tue, 28 Apr 2026 21:56:42 -0500 Subject: [PATCH] docs: refresh infra docs for post-Hetzner architecture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes #55. Brings docs/ in line with the post-July-2024 architecture (single Hetzner VM running Caddy + Sinatra/Puma API server + Prometheus, provisioned by Ansible), replacing references to the previous GKE Autopilot + Nginx Ingress + Cloud Run apiserver setup. Files changed: - docs/infrastructure-overview.md — rewritten section by section. Every claim is grounded in current IaC. The two GCP-service-account sections are folded into a single "CI/CD authentication" section that splits per-caller: server-edition uses GCP WIF (APT/YUM repo buckets + GCS CI artifacts bucket) and Azure Federated Identity Credentials (Azure Blob CI artifacts + CI cache + Key Vault GPG key); infra repo's apiserver workflow only mints a GitHub OIDC JWT (audience backend.fullstaqruby.org) and POSTs to /admin/upgrade_apiserver — it does not authenticate to GCP or Azure APIs. The Caddy section is corrected: there is no backend.fullstaqruby.org vhost; both apt. and yum. vhosts handle /admin/* via reverse_proxy to the apiserver Unix socket. The "Google Cloud projects" claim of two projects is corrected — there is one project, fsruby-server-edition2, provisioned by terraform-hisec/gcloud_project.tf and populated by terraform/; the hisec/non-hisec separation lives at the Terraform-state and access-group layer. Container registry section dropped (no registry resources are managed in this repo). Key Vault name uses the templated form ${var.key_vault_prefix}infraowners (currently fsruby2infraowners). CI artifacts/cache split is now explicit (artifacts dual-cloud, cache Azure-only). VM section distinguishes Terraform-managed forward DNS from the manually-set Hetzner PTR record. - docs/infrastructure-overview.drawio.svg — deleted. Replaced by an inline Mermaid diagram in infrastructure-overview.md so future diagram changes are reviewable as text diffs. - docs/editing-diagrams.md — deleted (no longer needed without the drawio round-trip). - docs/deploy.md — replaces the gcloud-clusters/kubectl steps with a single ansible-playbook step matching bootstrapping Step 11. Adds a callout that apiserver code changes deploy via the GitHub Actions workflow. - docs/infrastructure-as-code.md — drops Kustomize and the kubernetes/ directory bullet; adds Ansible to the tool list and an ansible/ directory bullet. - docs/infrastructure-bootstrapping.md — intro updated to mention Terraform + Ansible (not Kubernetes/Kustomize); the rest of the file already reflected the post-migration setup. - docs/pull_request_template.md — diagram-update checkbox now points to the Mermaid block instead of the deleted drawio file. - README.md — drops the link to the deleted editing-diagrams.md. - .editorconfig — removes the duplicate [config.ru] block (the tab/4 one); only the correct space/2 rule remains. Note: the "Github CI bot account" section is kept as-is. Retiring that PAT-based bot is already tracked in #18 and is therefore out of scope here. --- .editorconfig | 4 - README.md | 1 - docs/deploy.md | 18 +- docs/editing-diagrams.md | 12 - docs/infrastructure-as-code.md | 12 +- docs/infrastructure-bootstrapping.md | 2 +- docs/infrastructure-overview.drawio.svg | 1846 ----------------------- docs/infrastructure-overview.md | 162 +- docs/pull_request_template.md | 3 +- 9 files changed, 117 insertions(+), 1943 deletions(-) delete mode 100644 docs/editing-diagrams.md delete mode 100644 docs/infrastructure-overview.drawio.svg diff --git a/.editorconfig b/.editorconfig index 7a8a801..068b09a 100644 --- a/.editorconfig +++ b/.editorconfig @@ -16,10 +16,6 @@ indent_size = 2 indent_style = space indent_size = 2 -[config.ru] -indent_style = tab -indent_size = 4 - [*.rb] indent_style = space indent_size = 2 diff --git a/README.md b/README.md index 0d8720c..dbfba64 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,6 @@ Concepts: Tasks: * [Deployment guide](docs/deploy.md) - * [Editing diagrams](docs/editing-diagrams.md) Organizational (for team members): diff --git a/docs/deploy.md b/docs/deploy.md index b3bd8ec..4c942ff 100644 --- a/docs/deploy.md +++ b/docs/deploy.md @@ -56,20 +56,12 @@ This guide explains how to deploy infrastructure updates. This guide is not for cd .. ~~~ - 4. Get the credentials for the Kubernetes cluster: + 4. Apply Ansible to the backend VM: ~~~bash - gcloud container clusters get-credentials fullstaq-ruby-autopilot --configuration fullstaq-ruby --region us-east4 + cd ansible + ansible-playbook -i hosts.ini -v main.yml + cd .. ~~~ - 5. Set the default namespace: - - ~~~bash - kubectl config set-context --current --namespace=fullstaq-ruby - ~~~ - - 6. Apply the Kustomization: - - ~~~bash - kubectl apply --context=gke_fullstaq-ruby_us-east4_fullstaq-ruby-autopilot -k ../kubernetes - ~~~ +> The API server itself is not deployed by this playbook. Code changes under `apiserver/` are released by the `.github/workflows/apiserver.yml` workflow, which packages a tarball, attaches it to a GitHub Release, and triggers `POST /admin/upgrade_apiserver` on the live host. diff --git a/docs/editing-diagrams.md b/docs/editing-diagrams.md deleted file mode 100644 index c032294..0000000 --- a/docs/editing-diagrams.md +++ /dev/null @@ -1,12 +0,0 @@ -# Editing diagrams - -All diagrams' sources are stored in `.drawio.svg` files — _SVG files with embedded Diagrams.net content_. They are made using [Diagrams.net](https://diagrams.net) (formerly Draw.io). - -The recommended way to edit these files locally is with [Visual Studio Code's Draw.io integration](https://marketplace.visualstudio.com/items?itemName=hediet.vscode-drawio). - -If you want to use Diagram.net's web app instead of the Visual Studio Code integration, then be sure to save it by exporting the diagram as SVG, using the following settings: - - * Zoom: 110% - * Border width: 12 - * Check: Transparent Background - * Uncheck: Include a copy of my diagram diff --git a/docs/infrastructure-as-code.md b/docs/infrastructure-as-code.md index 3ccaf76..0f3159c 100644 --- a/docs/infrastructure-as-code.md +++ b/docs/infrastructure-as-code.md @@ -3,19 +3,19 @@ We define as much infrastructure as possible in the form of code, using: * [Terraform](https://terraform.io) - * Kubernetes YAML, managed with [Kustomize](https://kustomize.io/) + * [Ansible](https://www.ansible.com/) * Github Actions The infrastructure-as-code is stored in the following directories: - * `terraform/` — Infrastructure administered by [Infra Maintainers](roles.md), except for resources inside Kubernetes. Most of the infrastructure is defined here. + * `terraform/` — Infrastructure administered by [Infra Maintainers](roles.md). Most of the cloud-side infrastructure is defined here. - * `terraform-hisec/` — Infrastructure administered by [Infra Owners](roles.md). This covers for example resources in the `fullstaq-ruby-hisec` Google Cloud project. + * `terraform-hisec/` — Infrastructure administered by [Infra Owners](roles.md). This covers for example sensitive resources such as the GPG signing key in Azure Key Vault, and the high-security Terraform state backend. - Because we don't expect the infrastructure in this directory to change very often, we've chosen — for security reasons — not to run Terraform in a CI/CD pipeline. This way we don't have to worry about the security of the CI/CD pipeline's service account. Instead, an [Infra Owner](roles.md) runs Terraform manually, using that person's personal Google Cloud credentials. + Because we don't expect the infrastructure in this directory to change very often, we've chosen — for security reasons — not to run Terraform in a CI/CD pipeline. This way we don't have to worry about the security of any CI/CD pipeline credentials. Instead, an [Infra Owner](roles.md) runs Terraform manually, using their personal cloud credentials. - * `kubernetes/` — Kubernetes resources administered by [Infra Maintainers](roles.md). + * `ansible/` — Configuration of the backend VM (Caddy, the API server, Prometheus, and OS hardening). Administered by [Infra Maintainers](roles.md) and applied manually; see [Deployment guide](deploy.md). - * `.github/workflows/apiserver.yml` — Deploys the API server. + * `.github/workflows/apiserver.yml` — Builds and deploys the API server. Note that not all infrastructure can, or (for security reasons) should, be managed via code. Learn more at [Infrastructure bootstrapping](infrastructure-bootstrapping.md). diff --git a/docs/infrastructure-bootstrapping.md b/docs/infrastructure-bootstrapping.md index c8c8510..6efceda 100644 --- a/docs/infrastructure-bootstrapping.md +++ b/docs/infrastructure-bootstrapping.md @@ -1,6 +1,6 @@ # Infrastructure bootstrapping -We try to codify infrastructure as much as possible using Terraform and Kubernetes YAML. However: +We try to codify infrastructure as much as possible using Terraform and Ansible. However: - Not everything _can_ be automated. For example, we need to setup Azure Blob Storage for storing Terraform state, before we can use Terraform. - Not everything _should_ be automated. For example, the `fullstaq-ruby-hisec` project contains such sensitive data, that giving access to CI/CD systems would pose a security risk. diff --git a/docs/infrastructure-overview.drawio.svg b/docs/infrastructure-overview.drawio.svg deleted file mode 100644 index 6362db3..0000000 --- a/docs/infrastructure-overview.drawio.svg +++ /dev/null @@ -1,1846 +0,0 @@ - - - - - - - - - - -
-
-
- logs to -
-
-
-
- - logs to - -
-
- - - - - -
-
-
- Kubernetes Engine (autopilot) -
-
-
-
- - Kubernetes... - -
-
- - - - - -
-
-
- Deployment -
-
-
-
- - Deployment - -
-
- - - - -
-
-
- Nginx web server -
- ("gateway") -
-
-
-
- - Nginx web server... - -
-
- - - - - -
-
-
- Virtual host -
-
-
-
- - Virtual ho... - -
-
- - - -
-
-
- apt.fullstaqruby.org -
-
-
-
- - apt.fullstaqr... - -
-
- - - - -
-
-
-

- - - Static IP - - -
- gateway -

-
-
-
-
- - Static IP... - -
-
- - - - - - -
-
-
-

- - - A record - - -
- apt.fullstaqruby.org -

-
-
-
-
- - A record... - -
-
- - - - - -
-
-
-

- - - A record - - -
- yum.fullstaqruby.org -

-
-
-
-
- - A record... - -
-
- - - -
-
-
-

- - - - Cloud DNS -
-
-
-
- - DNS zone -
-
- - fullstaqruby.org - -

-
-
-
-
- - Cloud DNS... - -
-
- - - - - -
-
-
- HTTP redirect to path: -
- /versions/$LATEST_VERSION/public/$URI -
-
-
-
- - HTTP redirect to path:... - -
-
- - - - -
-
-
- Virtual host -
-
-
-
- - Virtual ho... - -
-
- - - -
-
-
- yum.fullstaqruby.org -
-
-
-
- - yum.fullstaqr... - -
-
- - - - - -
-
-
- HTTP redirect to path: -
- /versions/$LATEST_VERSION/public/$URI -
-
-
-
- - HTTP redirect to path:... - -
-
- - - -
-
-
-

- - Google Cloud project - -
- - fullstaq-ruby - -

-
-
-
-
- - Google Cloud project... - -
-
- - - - -
-
-
-

- - Google Cloud project - -
- - fullstaq-ruby-hisec - -

-
-
-
-
- - Google Cloud project... - -
-
- - - - -
-
-
-

- - Secret Manager - -
-

-
-
-
-
- - Secret Manager - -
-
- - - - -
-
-
- Resources inside the project are editable by - - Infra Maintainers - -
-
-
-
- - Resources inside the pro... - -
-
- - - - -
-
-
-

- - TransIP - -
- - Account: fullstaq - -

-
-
-
-
- - TransIP... - -
-
- - - - - -
-
-
-

-
-

-
-
-
-
- -
-
- - - -
-
-
-

- - Cloud Storage bucket -
-
-
- - - Terraform state - -
-
- - fullstaq-ruby-infra- -
- terraform-state -
-
-
-

-
-
-
-
- - Cloud Storage bucket... - -
-
- - - - -
-
-
-

- Domain name -
- - fullstaqruby.org - -
-

-
-
-
-
- - Domain name... - -
-
- - - - -
-
-
-

- - - A record - - -
- fullstaqruby.org -

-
-
-
-
- - A record... - -
-
- - - - -
-
-
- infra owners -
-
-
-
- - infra owners - -
-
- - - - -
-
-
- infra owners -
-
-
-
- - infra owners - -
-
- - - - -
-
-
- infra owners -
-
-
-
- - infra owners - -
-
- - - - -
-
-
- infra maintainers -
-
-
-
- - infra maintainers - -
-
- - - - - -
-
-
-

- - Github pages - -

-

- fullstaq-ruby-website -

-
-
-
-
- - Github pages... - -
-
- - - - - -
-
-
-

-
-

-
-
-
-
- -
-
- - - -
-
-
-

- - Container Registry - -

-
-
-
-
- - Container Regist... - -
-
- - - - - - - - - - - -
-
-
- can call -
-
-
-
- - can call - -
-
- - - -
-
-
- has object -
- admin access to -
-
-
-
- - has object... - -
-
- - - - - -
-
-
- has object -
- admin access to -
-
-
-
- - has object... - -
-
- - - - -
-
-
- - GPG private key - -
- - (canonical copy) - -
-
-
-
- - GPG private key... - -
-
- - - - - -
-
-
-

-
-

-
-
-
-
- -
-
- - - -
-
-
-

- - Cloud Storage bucket -
-
-
- - - fullstaq-ruby-server-edition-ci-artifacts - -
-
-

-
-
-
-
- - Cloud Storage bucket... - -
-
- - - - - -
-
-
- has r/w -
- access to -
-
-
-
- - has r/w... - -
-
- - - - -
-
-
-

- - Email: -
- fullstaq-ruby-ci-bot -
- @fullstaq.com -
-

-
-
-
-
- - Email:... - -
-
- - - - -
-
-
- - fullstaq-ruby-ci-bot Github password - -
-
-
-
- - fullstaq-ruby-ci-bot G... - -
-
- - - - -
-
-
- infra owners -
-
-
-
- - infra owners - -
-
- - - - - -
-
-
- is copy -
- of -
-
-
-
- - is copy... - -
-
- - - - -
-
-
- - Personal access token -
-
-
-
-
-
- - Personal... - -
-
- - - - - -
-
-
-

-
-

-
-
-
-
- -
-
- - - -
-
-
-

- - Cloud Storage bucket -
-
-
- - - - fullstaq-ruby-server-edition-apt-repo - - -
-
-

-
-
-
-
- - Cloud Storage bucket... - -
-
- - - - - -
-
-
-

-
-

-
-
-
-
- -
-
- - - -
-
-
-

- - Cloud Storage bucket -
-
-
- - fullstaq-ruby-server-edition-yum-repo -
-
-

-
-
-
-
- - Cloud Storage bucket... - -
-
- - - - -
-
-
- Ingress -
-
-
-
- - Ingress - -
-
- - - - -
-
-
- TLS enabled -
-
-
-
- - TLS enabled - -
-
- - - - -
-
-
- - fullstaq-ruby-ci-bot Github personal access token - -
-
-
-
- - fullstaq-ruby-ci-bot G... - -
-
- - - - - -
-
-
- is copy -
- of -
-
-
-
- - is copy... - -
-
- - - - -
-
-
- - Password - -
-
-
-
- - Password - -
-
- - - -
-
-
-

- - Github account - -
- - fullstaq-labs - -

-
-
-
-
- - Github account... - -
-
- - - - - -
-
-
-

- - Github account - - -
- fullstaq-ruby-ci-bot -
-
-

-
-
-
-
- - Github account... - -
-
- - - - - - - -
-
-
- can deploy -
-
-
-
- - can deploy - -
-
- - - - - -
-
-
- has object -
- admin access to -
-
-
-
- - has object... - -
-
- - - - - -
-
-
- has developer -
- access to -
-
-
-
- - has developer... - -
-
- - - - - -
-
-
- logs to -
-
-
-
- - logs to - -
-
- - - - - - - - - - -
-
-
- - apiserver - -
- Cloud Run -
-
-
-
- - apiserve... - -
-
- - - - - -
-
-
- - NS record - -
-
-
-
- - NS record - -
-
- - - - - - - -
-
-
-

- - - Service account - - -

-

- - infra-ci-bot - -

-
-
-
-
- - Service accou... - -
-
- - - - - - - -
-
-
- - Cloud Logging - -
-
-
-
- - Cloud Logging - -
-
- - - - - -
-
-
-

-
-

-
-
-
-
- -
-
- - - -
-
-
-

- - Cloud Storage bucket -
-
-
- - - Terraform state - -
-
- - fullstaq-ruby-infra- -
- hisec-terraform-state -
-
-
-

-
-
-
-
- - Cloud Storage bucket... - -
-
- - - - -
-
-
-

-
-

-
-
-
-
- -
-
- - - -
-
-
- - - - Github repo - - -
- fullstaq-ruby-server-edition -
-
-
-
-
-
- - Github repo... - -
-
- - - - - -
-
-
- Github Actions -
-
-
-
- - Github Actions - -
-
- - - - - -
-
-
- has private key of -
-
-
-
- - has private key of - -
-
- - - - - -
-
-
- has -
- copy of -
-
-
-
- - has... - -
-
- - - - -
-
-
-

-
-

-
-
-
-
- -
-
- - - -
-
-
- - - - Github repo - - -
- fullstaq-ruby-infra -
-
-
-
-
-
- - Github repo... - -
-
- - - - - -
-
-
- Github Actions -
-
-
-
- - Github Actions - -
-
- - - - - -
-
-
- has private key of -
-
-
-
- - has private key of - -
-
- - - - -
-
-
-

- - Secret Manager - -
-

-
-
-
-
- - Secret Manager - -
-
- - - - -
-
-
- - GPG private key - -
-
-
-
- - GPG private key - -
-
- - - - - -
-
-
- - has access to - -
-
-
-
- - has access to - -
-
- - - - -
-
-
-

- - - Service account - - -

-

- server-edition-ci-bot -

-
-
-
-
- - Service accou... - -
-
- - - - -
-
-
- - copy of each other - -
-
-
-
- - copy of each other - -
-
- - - - -
-
-
-

- - Azure tenant - -
- - Fullstaq B.V. - -

-
-
-
-
- - Azure tenant... - -
-
- - - - -
-
-
- infra owners -
-
-
-
- - infra owners - -
-
- - - - -
-
-
-

- - Resource group - -
- - fullstaq-ruby-server-edition - -

-
-
-
-
- - Resource group... - -
-
- - - - -
-
-
- infra maintainers -
-
-
-
- - infra maintainers - -
-
- - - - -
-
-
-

- - Storage account - -
- fsrubyseredci1 -
-

-
-
-
-
- - Storage account... - -
-
- - - - - -
-
-
-

-
-

-
-
-
-
- -
-
- - - -
-
-
-

- - Blob Storage container -
-
- - - server-edition-ci-cache - -
-
-

-
-
-
-
- - Blob Storage container... - -
-
- - - - -
-
-
-

- - Shared access key - -

-
-
-
-
- - Shared access... - -
-
- - - - - -
-
-
- has copy of -
-
-
-
- - has copy of - -
-
-
- - - - - Viewer does not support full SVG 1.1 - - - -
diff --git a/docs/infrastructure-overview.md b/docs/infrastructure-overview.md index f95bab9..938b337 100644 --- a/docs/infrastructure-overview.md +++ b/docs/infrastructure-overview.md @@ -1,82 +1,134 @@ ## Infrastructure overview -The following diagram shows which infrastructure components exist and how they relate to each other. The bubbles on the corner of a component shows which [role](roles.md) administers that component. - -![Infrastructure overview diagram](infrastructure-overview.drawio.svg) - -## Google Cloud projects - -- Administrated by role: Infra Owners - -All Google Cloud resources are contained in two projects: - -- `fullstaq-ruby` for normal resources. Infra Maintainers have full access to all resources _inside_ this project. -- `fullstaq-ruby-hisec` for especially sensitive resources. Only Infra Owners have access to the resources inside this project. - -## Google Cloud service account for Server Edition CI/CD - -- Administered by role: Infra Maintainers - -The `fullstaq-ruby` Google Cloud project has a service account. It's used by the Server Edition's CI/CD system. This service account: - -- Has full access to the Server Edition CI artifacts store. -- Has full access to the container registry, in order to access and store [build environment images](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/build-environments.md). -- Has full access to the APT and YUM repo buckets, in order to publish new packages. - -## Google Cloud service account for Infrastructure CI/CD +The following diagram shows the major infrastructure components and how they relate to each other. The role that administers each component is given in the section heading below. + +```mermaid +flowchart LR + U([End users
apt-get / yum]) + + subgraph GH["GitHub Actions"] + SE[server-edition CI] + INF[infra CI
apiserver workflow] + end + + subgraph HZ["Hetzner VM  ·  Ubuntu ≥ 24.04"] + Caddy[Caddy
apt./yum. vhosts] + API[API server
Sinatra + Puma] + DEPL[apiserver-deployer] + Prom[Prometheus
+ node_exporter] + end + + subgraph GCP["Google Cloud  ·  fsruby-server-edition2"] + APT[(APT repo bucket)] + YUM[(YUM repo bucket)] + CIA[(CI artifacts bucket
public read)] + WIF[Workload Identity
Federation pools] + end + + subgraph AZ["Azure  ·  Entra ID tenant"] + DNS[Azure DNS
fullstaqruby.org
+ delegated apt./yum. zones] + KV[Key Vault
GPG key, ACME SP creds] + BS[(Blob Storage
Terraform state,
CI artifacts, CI cache)] + FIC[Entra ID FIC apps] + end + + GP[GitHub Pages
fullstaqruby.org / www.] + TR[TransIP
domain registrar] + + U -->|HTTPS| Caddy + Caddy -->|redirect| APT + Caddy -->|redirect| YUM + Caddy -->|/admin/*| API + Caddy -. ACME DNS-01 .-> DNS + API --> DEPL + DEPL -. systemctl restart .-> API + DEPL -. systemctl restart .-> Caddy + + SE -->|OIDC| WIF + WIF --> APT + WIF --> YUM + WIF --> CIA + SE -->|OIDC JWT| Caddy + INF -->|OIDC JWT| Caddy + SE -->|OIDC| FIC + FIC --> KV + + TR -. NS delegation .-> DNS + GP -. A/AAAA .-> DNS + HZ -. A/AAAA .-> DNS +``` + +## Google Cloud project + +- Administered by role: Infra Owners (project resource); Infra Maintainers (resources within) + +All Google Cloud resources live in a single project, `fsruby-server-edition2` (display name "Fullstaq Ruby Server Edition"). The `google_project` resource itself is provisioned in `terraform-hisec/gcloud_project.tf` so that creating/deleting the project requires Infra Owner access, but resources _inside_ the project (buckets, IAM, Workload Identity Federation) are managed in `terraform/` by Infra Maintainers. + +The hisec / non-hisec separation is enforced at the **Terraform state and access-group layer**, not via separate GCP projects. See [Terraform state (normal)](#terraform-state-normal) and [Terraform state (hisec)](#terraform-state-hisec). + +## CI/CD authentication - Administered by role: Infra Maintainers -The `fullstaq-ruby` Google Cloud project has a service account. It's used by the Infrastructure repository's CI/CD system. This service account: +CI/CD authenticates using short-lived GitHub-issued OIDC tokens — there are no long-lived service-account keys. -- Has full access to the container registry, in order to publish new API server images. -- Can deploy new versions of the API server. +- **`fullstaq-ruby/server-edition` → Google Cloud** uses [Workload Identity Federation](https://cloud.google.com/iam/docs/workload-identity-federation) (defined in `terraform/gcloud_auth.tf`). Two pools (`github-ci-test`, `github-ci-deploy`) gate access by GitHub repository owner and Actions environment. Through these pools, server-edition CI jobs gain write access to the APT/YUM repo buckets and the GCP CI artifacts bucket — see `terraform/repo_buckets.tf` and `terraform/ci_storage.tf`. The CI cache lives in Azure (see below), not on GCP. +- **`fullstaq-ruby/server-edition` → Azure** uses [Federated Identity Credentials](https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation) on Entra ID applications (defined in `terraform-hisec/`). These authenticate workflows that read or write Azure Blob Storage (the CI artifacts and CI cache containers) and Azure Key Vault (the GPG signing key). +- **`fullstaq-ruby/infra` → API server** uses a GitHub-issued OIDC JWT (audience `backend.fullstaqruby.org`) sent as a bearer token to `POST /admin/upgrade_apiserver`. The infra repo's `apiserver.yml` workflow does **not** authenticate to GCP or Azure APIs — the rollout mechanism is entirely on the VM (see [API server](#api-server)). The same JWT mechanism is used by `server-edition` to call `/admin/restart_web_server` after a publish. ## API server - Administered by role: Infra Maintainers -The API server is a service that allows performing limited management operations on the infrastructure. It mainly exists to securely allow the Server Edition's CI to tell Nginx about the fact that new packages have been deployed. It's hosted on Google Cloud Run. +The API server is a small Sinatra service that exposes a narrow set of `/admin/*` endpoints used by CI/CD: notably `restart_web_server` (called by Server Edition's CI after a publish so Caddy picks up the new repo version) and `upgrade_apiserver` (called by this repo's CI to roll out a new API server build). All endpoints authenticate the caller via GitHub Actions OIDC JWTs. -The API server's source code lives in the Infrastructure repository, and is deployed by that repository's CI/CD system. +The API server runs on the backend VM (see [VM (Hetzner)](#vm-hetzner)) under systemd, with Puma listening on a Unix socket that Caddy proxies to. It is provisioned by Ansible (`ansible/tasks/apiserver.yml`, unit file `ansible/files/apiserver.service`). -The API server is not defined in Terraform, but is defined in the Infrastructure's CI/CD system in the form of a `gcloud` call. +Releases are built and published by `.github/workflows/apiserver.yml`: the workflow packages the source as a tarball, attaches it to a GitHub Release tagged `apiserver-N`, and then calls `POST /admin/upgrade_apiserver`. That request is handled by a sibling `apiserver-deployer.service` (provisioned alongside) which fetches the release tarball into `/opt/apiserver/versions/` and restarts the API server. The deployer exists so the API server can replace itself without leaving an unreachable gap. -## Nginx web server +## Caddy web server - Administered by role: Infra Maintainers -We run one Nginx instance, which serves these virtual hosts: +Caddy runs on the backend VM (see [VM (Hetzner)](#vm-hetzner)) and serves two virtual hosts: `apt.fullstaqruby.org` and `yum.fullstaqruby.org` (see `ansible/files/Caddyfile`). Each vhost has the same shape: + +- `/admin/*` is reverse-proxied over a local Unix socket to the API server. +- All other paths redirect to the corresponding GCS repo bucket, with the current published version baked into the redirect target. + +There is no separate `backend.fullstaqruby.org` virtual host — the hostname exists as a DNS record (and as the OIDC token audience claim) but is not terminated by Caddy. CI/CD calls the `/admin/*` endpoints via `https://apt.fullstaqruby.org/admin/*`. -- `apt.fullstaqruby.org` — redirects everything to the APT repo bucket. -- `yum.fullstaqruby.org` — redirects everything to the YUM repo bucket. +The redirect target's version is read from each bucket's `latest_version.txt` once at startup (via the `query-latest-repo-versions.rb` `ExecStartPre` in the systemd unit). Caddy must therefore be restarted after a publish so it picks up the new version — that restart is what the API server's `restart_web_server` endpoint exists to trigger. -Nginx's only purpose is to redirect traffic. Users interact primarily with `{apt,yum}.fullstaqruby.org` instead of with the APT and YUM repo buckets directly. This decouples users from our APT and YUM repos are actually hosted, allowing us to change the hosting mechanism without breaking users' URLs. +Users interact with `{apt,yum}.fullstaqruby.org` rather than the buckets directly. This decouples users from where the repos are actually hosted, allowing us to change the hosting mechanism without breaking users' repository URLs. > Historic note: our APT and YUM repos used to be hosted on Bintray. But Bintray shut down on March 1 2021. The HTTP redirection mechanism allowed us to move away from Bintray with minimal downtime, and without breaking users' repository URLs. -TLS is configured through the Ingress, using [Google-managed TLS certificates](https://cloud.google.com/kubernetes-engine/docs/how-to/managed-certs). +TLS certificates are obtained via the ACME DNS-01 challenge against Azure DNS. Caddy authenticates to Azure DNS using a service principal whose credentials are stored in Azure Key Vault and injected into the Caddy systemd unit via an environment file. -## Kubernetes cluster +## VM (Hetzner) - Administered by role: Infra Maintainers -The Kubernetes cluster runs our Nginx web server. This cluster is in Autopilot mode. +A single Ubuntu (≥ 24.04) VPS hosted at Hetzner runs every backend service (Caddy, the API server, the API server deployer, Prometheus + node_exporter, fail2ban, AppArmor, unattended-upgrades, ufw). Its forward DNS records (`backend.fullstaqruby.org`, `apt.fullstaqruby.org`, `yum.fullstaqruby.org`) are managed in `terraform/dns.tf`; its static IPs are referenced from `terraform/variables.tf`. The PTR record (`backend.fullstaqruby.org`) is set manually at the Hetzner provider during bootstrapping (see [bootstrapping](infrastructure-bootstrapping.md) Step 7), not via Terraform. -## DNS, static IPs, Ingresses +The VM is configured entirely by Ansible (`ansible/main.yml`). The playbook covers OS hardening (SSH, fail2ban, AppArmor, ufw, autoreboot, unattended-upgrades) and the service stack (Prometheus, Caddy, apiserver-deployer, apiserver). There is no Kubernetes — the previous GKE Autopilot setup was replaced by this VM in the July 2024 rearchitecture. + +## DNS - Administered by role: Infra Maintainers -All DNS entries for `fullstaqruby.org` are managed through Google Cloud DNS, inside the `fullstaq-ruby` project. +All DNS records for `fullstaqruby.org` are managed in Azure DNS (`terraform/dns.tf`). + +- `fullstaqruby.org` and `www.fullstaqruby.org` point to GitHub Pages, where we host the [website](https://github.com/fullstaq-ruby/website). +- `backend.fullstaqruby.org`, `apt.fullstaqruby.org`, and `yum.fullstaqruby.org` all resolve to the backend VM's IP addresses. -- `fullstaqruby.org` points to Github Pages, where we host the [website](https://github.com/fullstaq-ruby/website). -- `{apt,yum}.fullstaqruby.org` point to Nginx's two virtual hosts, via two different static IPs and two different Kubernetes Ingresses. +`apt.fullstaqruby.org` and `yum.fullstaqruby.org` are also delegated as their own Azure DNS zones so that Caddy's ACME service principal has DNS-write access scoped only to those subdomains, not the parent zone. ## Domain name - Administrated by role: Infra Owners -The `fullstaqruby.org` domain is registered at [TransIP](https://www.transip.nl/). It's registered using [Fullstaq](https://www.fullstaq.com)'s account. The DNS zone is not managed at TransIP, but at Google Cloud DNS. +The `fullstaqruby.org` domain is registered at [TransIP](https://www.transip.nl/). It's registered using [Fullstaq](https://www.fullstaq.com)'s account. The DNS zone is not managed at TransIP, but at Azure DNS (see [DNS](#dns)). ## Github CI bot account @@ -88,13 +140,13 @@ This Github bot account is used by the Server Edition's CI/CD system. The accoun - Administered by role: Infra Maintainers -The Terraform state for normal infrastructure is stored in a Google Cloud Storage bucket, inside the `fullstaq-ruby` project. +The Terraform state for normal infrastructure is stored in an Azure Blob Storage container in the `fullstaq-ruby-terraform` resource group (see `terraform/backend.tf`). ## Terraform state (hisec) - Administered by role: Infra Owners -The Terraform state for sensitive infrastructure is stored in a Google Cloud Storage bucket, inside the `fullstaq-ruby-hisec` project. +The Terraform state for sensitive infrastructure is stored in an Azure Blob Storage container in the `fullstaq-ruby-terraform-hisec` resource group (see `terraform-hisec/backend.tf`). ## Server Edition APT & YUM repo buckets @@ -102,31 +154,25 @@ The Terraform state for sensitive infrastructure is stored in a Google Cloud Sto The Server Edition's APT and YUM repositories are stored inside these buckets. These buckets are publicly readable. -Users don't access these buckets directly. Instead, they access `apt.fullstaqruby.org` and `yum.fullstaqruby.org` (served by the Nginx web servers), which redirect to these buckets. - -## Container registry - -- Administered by role: Infra Maintainers - -The `fullstaq-ruby` project has a container registry. This registry has two uses: - -- It's used by developers' CI/CD systems, for example to store [build environment images](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/build-environments.md). -- It's used to store the API server's images. +Users don't access these buckets directly. Instead, they access `apt.fullstaqruby.org` and `yum.fullstaqruby.org` (served by Caddy on the backend VM), which redirect to these buckets. ## Server Edition CI artifacts store - Administered by role: Infra Maintainers -The Server Edition's CI/CD system stores artifacts in this bucket, for the purpose of implementing [resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md). Objects in this bucket only live for 30 days. +The Server Edition's CI/CD system stores artifacts for [CI/CD resumption](https://github.com/fullstaq-ruby/server-edition/blob/main/dev-handbook/ci-cd-resumption.md) in two buckets (see `terraform/ci_storage.tf`): + +- A **GCS bucket** (`${var.gcloud_bucket_prefix}-server-edition-ci-artifacts`) — publicly readable; the `test` environment writes via WIF, the `deploy` environment reads. Objects expire after 30 days. +- An **Azure Blob container** (`server-edition-ci-artifacts` inside the `${var.storage_account_prefix}seredci1` storage account) — private; the `test` environment writes via Federated Identity Credentials, the `deploy` environment reads. Objects expire after 30 days. ## Server Edition CI cache store - Administered by role: Infra Maintainers -The Server Edition's CI/CD system stores caches in this bucket. Objects are automatically deleted 90 days after last access. +The Server Edition's CI/CD system stores caches in an **Azure Blob container** (`server-edition-ci-cache` inside the same `${var.storage_account_prefix}seredci1` storage account; see `terraform/ci_storage.tf`). Only the `test` environment writes; objects are automatically deleted 90 days after last access. There is no equivalent cache bucket on GCP. ## GPG private key - Administered by role: Infra Owners, Infra Maintainers -The GPG private key is used to sign APT and YUM repositories. We store the canonical copy in Secrets Manager in the `fullstaq-ruby-hisec` Google Cloud project. We store a secondary copy in the Secret Manager in the `fullstaq-ruby` Google Cloud project. +The GPG private key is used to sign APT and YUM repositories. It is stored in the Azure Key Vault for Infra Owners — `${var.key_vault_prefix}infraowners`, currently `fsruby2infraowners` (see `terraform-hisec/key_vault.tf`). GitHub Actions in the `test` and `deploy` environments are granted read access via Entra ID Federated Identity Credentials. diff --git a/docs/pull_request_template.md b/docs/pull_request_template.md index 96bd077..b7bb3ee 100644 --- a/docs/pull_request_template.md +++ b/docs/pull_request_template.md @@ -28,8 +28,7 @@ Check a box by filling in 'x', like so... --> - [ ] Documentation updated - - [ ] Infrastructure overview diagram updated - + - [ ] Infrastructure overview diagram updated (Mermaid block in `docs/infrastructure-overview.md`) - [ ] The following content/files are kept in sync with each other: + The Terraform version specified in `**/providers.tf`, `.github/workflows/code-reviews.yml` and `docs/required-devtools.md` + The list of responsibilities specified in `.github/ISSUE_TEMPLATE/apply_join_team.md` and `docs/responsibilities-expectations.md`