Azure · wanghaoran1988 · Apr 2, 2026 · Apr 7, 2026 · Apr 7, 2026 · Apr 7, 2026
@@ -0,0 +1,108 @@
+# Holmes Investigation API
+
+The Holmes investigation API is an admin endpoint that runs [HolmesGPT](https://github.com/robusta-dev/holmesgpt) diagnostic investigations on ARO clusters. It creates a short-lived pod on the Hive AKS cluster that connects to the target cluster, runs diagnostic queries, and streams the results back to the caller.
+
+**Endpoint:** `POST /admin/subscriptions/{subscriptionId}/resourcegroups/{resourceGroup}/providers/Microsoft.RedHatOpenShift/openShiftClusters/{clusterName}/investigate`
+
+## Configuration Reference
+
+| Config | Env Var | Key Vault Secret (prod) | Default | Required |
+|--------|---------|------------------------|---------|----------|
+| Azure OpenAI API key | `HOLMES_AZURE_API_KEY` | `holmes-azure-api-key` | — | Yes |
+| Azure OpenAI endpoint | `HOLMES_AZURE_API_BASE` | `holmes-azure-api-base` | — | Yes |
+| HolmesGPT container image | `HOLMES_IMAGE` | — | `quay.io/haoran/holmesgpt:latest` | No |
+| Azure OpenAI API version | `HOLMES_AZURE_API_VERSION` | — | `2025-04-01-preview` | No |
+| LLM model name | `HOLMES_MODEL` | — | `azure/gpt-5.2` | No |
+| Pod timeout (seconds) | `HOLMES_DEFAULT_TIMEOUT` | — | `600` | No |
+| Max concurrent investigations per RP | `HOLMES_MAX_CONCURRENT` | — | `20` | No |
+
+## Config Loading
+
+Configuration is loaded once at RP startup in `NewFrontend` (`pkg/frontend/frontend.go`).
+
+**Development mode** (`RP_MODE=development`): All values are read from environment variables via `NewHolmesConfigFromEnv()`.
+
+**Production mode**: Sensitive values (API key, API base) are read from the service Key Vault (`{KEYVAULT_PREFIX}-svc`). Non-secret values (image, model, timeout, concurrency) are read from environment variables. This uses `NewHolmesConfig(ctx, serviceKeyvault)`.
+
+**Soft-load behavior**: If loading fails (e.g., Key Vault secrets not provisioned), the RP logs a warning and starts normally. Only the investigate endpoint returns an error ("Holmes investigation is not configured"). This allows the RP to operate without Holmes configured.
+
+The loaded config is stored on the `frontend` struct as `holmesConfig *holmes.HolmesConfig` and reused for all investigation requests.
+
+## How Config Reaches the Pod
+
+When an investigation request arrives, the RP creates three Kubernetes resources in the cluster's Hive namespace:
+
+1. **Secret** (`holmes-kubeconfig-{id}`) — Contains:
+   - `config`: Short-lived (1h) kubeconfig for `system:aro-diagnostics` identity
+   - `azure-api-key`: From `holmesConfig.AzureAPIKey`
+   - `azure-api-base`: From `holmesConfig.AzureAPIBase`
+   - `azure-api-version`: From `holmesConfig.AzureAPIVersion`
+
+2. **ConfigMap** (`holmes-config-{id}`) — Embedded toolset config from `pkg/hive/staticresources/holmes-config.yaml` (defines which kubectl commands Holmes can use)
+
+3. **Pod** (`holmes-investigate-{id}`) — Runs:
+   ```
+   python holmes_cli.py ask "<question>" -n --model=<Model> --config=/etc/holmes/config.yaml
+   ```
+   - Image from `holmesConfig.Image`
+   - `ActiveDeadlineSeconds` from `holmesConfig.DefaultTimeout`
+   - Azure credentials injected as environment variables from the Secret
+   - Kubeconfig mounted at `/etc/kubeconfig/config`
+
+All three resources are cleaned up after the investigation completes (or fails).
+
+## Development Setup
+
+1. Ensure prerequisites: VPN connected, `secrets/env` generated, `aks.kubeconfig` generated
+
+2. Export Holmes environment variables:
+   ```bash
+   source env && source secrets/env
+   export HIVE_KUBE_CONFIG_PATH=$(realpath aks.kubeconfig)
+   export ARO_INSTALL_VIA_HIVE=true
+   export ARO_ADOPT_BY_HIVE=true
+   export HOLMES_IMAGE="quay.io/haoran/holmesgpt:latest"
+   export HOLMES_AZURE_API_KEY="<your-azure-openai-key>"
+   export HOLMES_AZURE_API_BASE="<your-azure-openai-endpoint>"
+   ```
+
+3. Start the local RP: `make runlocal-rp`
+
+4. Run an investigation:
+   ```bash
+   ./hack/test-holmes-investigate.sh <cluster-name> "what is the cluster health status?"
+   ```
+
+## Key Vault Provisioning (Staging/Production)
+
+Create the following secrets in the service Key Vault (`{KEYVAULT_PREFIX}-svc`):
+
+| Secret Name | Value |
+|-------------|-------|
+| `holmes-azure-api-key` | Azure OpenAI API key |
+| `holmes-azure-api-base` | Azure OpenAI endpoint URL (e.g., `https://<resource>.openai.azure.com`) |
+
+Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template.
-Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template.
+In this PR, only the secret Holmes settings are documented as coming from Key Vault. Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, `HOLMES_AZURE_API_VERSION`, `HOLMES_DEFAULT_TIMEOUT`, `HOLMES_MAX_CONCURRENT`) is **not yet wired through the deploy generator/templates via ARM deployment parameters**, so staging/production will use the application defaults unless explicit deployment wiring is added separately.
-Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template.
+In this PR, only the secret Holmes settings are documented as coming from Key Vault. Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, `HOLMES_AZURE_API_VERSION`, `HOLMES_DEFAULT_TIMEOUT`, `HOLMES_MAX_CONCURRENT`) is **not yet wired through the deploy generator/templates via ARM deployment parameters**, so staging/production will use the application defaults unless explicit deployment wiring is added separately.
+
+## Security
+
+- **Cluster access**: Investigation pods use a `system:aro-diagnostics` identity with read-only RBAC (get/list/watch only). The kubeconfig certificate expires after 1 hour.
+- **Pod security**: Runs as non-root (UID 1000), no privilege escalation, all capabilities dropped, service account token not mounted.
+- **Toolset restrictions**: Destructive commands (`kubectl delete`, `kubectl apply`, `kubectl exec`, `rm`) are blocked in the Holmes toolset config.
+- **Rate limiting**: Per-RP-instance atomic counter limits concurrent investigations (default 20).
+- **Input validation**: Question limited to 1000 characters, control characters rejected, model name validated against safe character pattern.
+
+## Code Locations
+
+| Component | File |
+|-----------|------|
+| Config struct and loaders | `pkg/util/holmes/config.go` |
+| Config loading at startup | `pkg/frontend/frontend.go` (search `holmesConfig`) |
+| Admin API handler | `pkg/frontend/admin_openshiftcluster_investigate.go` |
+| Kubeconfig generation | `pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go` |
+| Pod creation and streaming | `pkg/hive/investigate.go` |
+| Kubeconfig transformation (dev) | `pkg/util/holmes/kubeconfig.go` |
+| Holmes toolset config | `pkg/hive/staticresources/holmes-config.yaml` |
+| RBAC ClusterRole | `pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml` |
+| RBAC ClusterRoleBinding | `pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml` |
+| E2E test script | `hack/test-holmes-investigate.sh` |
@@ -0,0 +1,91 @@
+#!/bin/bash
+# Test script for the Holmes investigation admin API endpoint.
+#
+# Prerequisites:
+#   1. VPN connected to the dev environment
+#   2. secrets/ folder generated: SECRET_SA_ACCOUNT_NAME=rharosecretsdev make secrets
+#   3. AKS kubeconfig generated: make aks.kubeconfig
+#   4. A test cluster created via: CLUSTER=<name> go run ./hack/cluster create
+#   5. Local RP running with Hive enabled (see below)
+#
+# Usage:
+#   ./hack/test-holmes-investigate.sh <cluster-name> [question]
+#
+# Examples:
+#   ./hack/test-holmes-investigate.sh haowang-holmes-test
+#   ./hack/test-holmes-investigate.sh haowang-holmes-test "why is pod X crashing?"
+#   ./hack/test-holmes-investigate.sh haowang-holmes-test "check node memory usage"
+#
+# To start the local RP with Hive + Holmes enabled:
+#
+#   source env && source secrets/env
+#   export HIVE_KUBE_CONFIG_PATH=$(realpath aks.kubeconfig)
+#   export ARO_INSTALL_VIA_HIVE=true
+#   export ARO_ADOPT_BY_HIVE=true
+#   export ARO_PODMAN_SOCKET="unix://$(podman machine inspect --format '{{.ConnectionInfo.PodmanSocket.Path}}')"
+#   export HOLMES_IMAGE="quay.io/haoran/holmesgpt:latest"
+#   export HOLMES_AZURE_API_KEY="<your-azure-openai-key>"
+#   export HOLMES_AZURE_API_BASE="<your-azure-openai-endpoint>"
+#   export HOLMES_AZURE_API_VERSION="2025-04-01-preview"
+#   export HOLMES_MODEL="azure/gpt-5.2"
+#   make runlocal-rp
+
+set -euo pipefail
+
+CLUSTER_NAME="${1:-}"
+QUESTION="${2:-what is the cluster health status?}"
+
+if [[ -z "$CLUSTER_NAME" ]]; then
+    echo "Usage: $0 <cluster-name> [question]"
+    echo ""
+    echo "Examples:"
+    echo "  $0 haowang-holmes-test"
+    echo "  $0 haowang-holmes-test 'why is pod X crashing?'"
+    exit 1
+fi
+
+# Source env if not already loaded
+if [[ -z "${AZURE_SUBSCRIPTION_ID:-}" ]]; then
+    if [[ -f env ]] && [[ -f secrets/env ]]; then
+        source env
+        source secrets/env
+    else
+        echo "Error: AZURE_SUBSCRIPTION_ID not set and env files not found."
+        echo "Run from the repo root, or source env && source secrets/env first."
+        exit 1
+    fi
+fi
+
+RESOURCEGROUP="${RESOURCEGROUP:-v4-eastus}"
+RP_URL="https://localhost:8443"
+API_PATH="/admin/subscriptions/${AZURE_SUBSCRIPTION_ID}/resourcegroups/${RESOURCEGROUP}/providers/Microsoft.RedHatOpenShift/openShiftClusters/${CLUSTER_NAME}/investigate"
+
+echo "============================================"
+echo " Holmes Investigation Test"
+echo "============================================"
+echo " Cluster:  ${CLUSTER_NAME}"
+echo " RG:       ${RESOURCEGROUP}"
+echo " Question: ${QUESTION}"
+echo " Endpoint: POST ${RP_URL}${API_PATH}"
+echo "============================================"
+echo ""
+
+# Check RP is running
+if ! curl -sk -o /dev/null -w '' "${RP_URL}/healthz" 2>/dev/null; then
+    echo "Error: Local RP is not running at ${RP_URL}"
+    echo "Start it with: make runlocal-rp (see header comments for full env setup)"
+    exit 1
+fi
+
+echo "Sending investigation request..."
+echo "Streaming results (this may take 1-5 minutes):"
+echo "--------------------------------------------"
+
+curl -sk --no-buffer -X POST \
+    "${RP_URL}${API_PATH}" \
+    -H "Content-Type: application/json" \
+    -d "$(jq -n --arg q "${QUESTION}" '{question: $q}')"
+
+echo ""
+echo "--------------------------------------------"
+echo "Investigation complete."
@@ -25,13 +25,13 @@ import (
 // kubeconfig for the ARO service, based on the admin kubeconfig found in the
 // graph.
 func (m *manager) generateAROServiceKubeconfig(pg graph.PersistedGraph) ([]byte, error) {
-	return generateKubeconfig(pg, "system:aro-service", []string{"system:masters"}, installer.TenYears, true)
+	return GenerateKubeconfig(pg, "system:aro-service", []string{"system:masters"}, installer.TenYears, true)
 }
 
 // generateAROSREKubeconfig generates additional admin credentials and a
 // kubeconfig for ARO SREs, based on the admin kubeconfig found in the graph.
 func (m *manager) generateAROSREKubeconfig(pg graph.PersistedGraph) ([]byte, error) {
-	return generateKubeconfig(pg, "system:aro-sre", nil, installer.TenYears, true)
+	return GenerateKubeconfig(pg, "system:aro-sre", nil, installer.TenYears, true)
 }
 
 // checkUserAdminKubeconfigUpdated checks if the user kubeconfig is
@@ -82,7 +82,7 @@ func (m *manager) checkUserAdminKubeconfigUpdated() bool {
 // generateUserAdminKubeconfig generates additional admin credentials and a
 // kubeconfig for ARO User, based on the admin kubeconfig found in the graph.
 func (m *manager) generateUserAdminKubeconfig(pg graph.PersistedGraph) ([]byte, error) {
-	return generateKubeconfig(pg, "system:admin", nil, installer.OneYear, false)
+	return GenerateKubeconfig(pg, "system:admin", nil, installer.OneYear, false)
 }
 
 func (m *manager) generateKubeconfigs(ctx context.Context) error {
@@ -127,7 +127,8 @@ func (m *manager) generateKubeconfigs(ctx context.Context) error {
 	return err
 }
 
-func generateKubeconfig(pg graph.PersistedGraph, commonName string, organization []string, validity time.Duration, internal bool) ([]byte, error) {
+// GenerateKubeconfig generates a kubeconfig with a client certificate signed by the cluster CA.
+func GenerateKubeconfig(pg graph.PersistedGraph, commonName string, organization []string, validity time.Duration, internal bool) ([]byte, error) {
 	var ca *installer.AdminKubeConfigSignerCertKey
 	var adminInternalClient *installer.AdminInternalClient
 	err := pg.GetByName(false, "*tls.AdminKubeConfigSignerCertKey", &ca)

@@ -143,6 +143,136 @@ func TestGenerateAROServiceKubeconfig(t *testing.T) {
 	}
 }
 
+func TestGenerateDiagnosticsKubeconfig(t *testing.T) {
+	validCaKey, validCaCerts, err := utiltls.GenerateKeyAndCertificate("validca", nil, nil, true, false)
+	if err != nil {
+		t.Fatal(err)
+	}
+	encodedKey, err := utilpem.Encode(validCaKey)
+	if err != nil {
+		t.Fatal(err)
+	}
+	encodedCert, err := utilpem.Encode(validCaCerts[0])
+	if err != nil {
+		t.Fatal(err)
+	}
+	ca := &installer.AdminKubeConfigSignerCertKey{
+		SelfSignedCertKey: installer.SelfSignedCertKey{
+			CertKey: installer.CertKey{
+				CertRaw: encodedCert,
+				KeyRaw:  encodedKey,
+			},
+		},
+	}
+
+	apiserverURL := "https://api-int.hash.rg.mydomain:6443"
+	clusterName := "api-hash-rg-mydomain:6443"
+	diagnosticsName := "system:aro-diagnostics"
+
+	adminInternalClient := &installer.AdminInternalClient{}
+	adminInternalClient.Config = &clientcmdv1.Config{
+		Clusters: []clientcmdv1.NamedCluster{
+			{
+				Name: clusterName,
+				Cluster: clientcmdv1.Cluster{
+					Server:                   apiserverURL,
+					CertificateAuthorityData: []byte("internal API Cert"),
+				},
+			},
+		},
+		AuthInfos: []clientcmdv1.NamedAuthInfo{},
+		Contexts: []clientcmdv1.NamedContext{
+			{
+				Name: diagnosticsName,
+				Context: clientcmdv1.Context{
+					Cluster:  clusterName,
+					AuthInfo: diagnosticsName,
+				},
+			},
+		},
+		CurrentContext: diagnosticsName,
+	}
+
+	pg := graph.PersistedGraph{}
+
+	caData, err := json.Marshal(ca)
+	if err != nil {
+		t.Fatal(err)
+	}
+	clientData, err := json.Marshal(adminInternalClient)
+	if err != nil {
+		t.Fatal(err)
+	}
+	pg["*kubeconfig.AdminInternalClient"] = clientData
+	pg["*tls.AdminKubeConfigSignerCertKey"] = caData
+
+	// Generate a 1-hour kubeconfig for system:aro-diagnostics
+	aroDiagnosticsClient, err := GenerateKubeconfig(pg, diagnosticsName, nil, time.Hour, true)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	var got *clientcmdv1.Config
+	err = yaml.Unmarshal(aroDiagnosticsClient, &got)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	innerpem := string(got.AuthInfos[0].AuthInfo.ClientCertificateData) + string(got.AuthInfos[0].AuthInfo.ClientKeyData)
+	_, innercert, err := utilpem.Parse([]byte(innerpem))
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	err = innercert[0].CheckSignatureFrom(validCaCerts[0])
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	issuer := innercert[0].Issuer.String()
+	if issuer != "CN=validca" {
+		t.Error(issuer)
+	}
+
+	subject := innercert[0].Subject.String()
+	if subject != "CN=system:aro-diagnostics" {
+		t.Error(subject)
+	}
+
+	// Verify no organization (no system:masters group)
+	if len(innercert[0].Subject.Organization) != 0 {
+		t.Errorf("expected no organization, got %v", innercert[0].Subject.Organization)
+	}
+
+	// Verify ~1 hour validity (not 10 years)
+	expectedExpiry := time.Now().Add(time.Hour)
+	if innercert[0].NotAfter.After(expectedExpiry.Add(5 * time.Minute)) {
+		t.Errorf("certificate expires too far in the future: %v", innercert[0].NotAfter)
+	}
+	if innercert[0].NotAfter.Before(expectedExpiry.Add(-5 * time.Minute)) {
+		t.Errorf("certificate expires too soon: %v", innercert[0].NotAfter)
+	}
+
+	keyUsage := innercert[0].KeyUsage
+	expectedKeyUsage := x509.KeyUsageKeyEncipherment | x509.KeyUsageDigitalSignature
+	if keyUsage != expectedKeyUsage {
+		t.Error("Invalid keyUsage.")
+	}
+
+	// Verify internal URL is preserved (not rewritten to external)
+	if got.Clusters[0].Cluster.Server != apiserverURL {
+		t.Errorf("expected server %s, got %s", apiserverURL, got.Clusters[0].Cluster.Server)
+	}
+
+	// validate the rest of the struct
+	got.AuthInfos = []clientcmdv1.NamedAuthInfo{}
+	want := adminInternalClient.Config
+
+	if !reflect.DeepEqual(got, want) {
+		t.Fatal(cmp.Diff(got, want))
+	}
+}
+
 func TestGenerateUserAdminKubeconfig(t *testing.T) {
 	validCaKey, validCaCerts, err := utiltls.GenerateKeyAndCertificate("validca", nil, nil, true, false)
 	if err != nil {