Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions docs/holmes-investigation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Holmes Investigation API

The Holmes investigation API is an admin endpoint that runs [HolmesGPT](https://github.com/robusta-dev/holmesgpt) diagnostic investigations on ARO clusters. It creates a short-lived pod on the Hive AKS cluster that connects to the target cluster, runs diagnostic queries, and streams the results back to the caller.

**Endpoint:** `POST /admin/subscriptions/{subscriptionId}/resourcegroups/{resourceGroup}/providers/Microsoft.RedHatOpenShift/openShiftClusters/{clusterName}/investigate`

## Configuration Reference

| Config | Env Var | Key Vault Secret (prod) | Default | Required |
|--------|---------|------------------------|---------|----------|
| Azure OpenAI API key | `HOLMES_AZURE_API_KEY` | `holmes-azure-api-key` | — | Yes |
| Azure OpenAI endpoint | `HOLMES_AZURE_API_BASE` | `holmes-azure-api-base` | — | Yes |
| HolmesGPT container image | `HOLMES_IMAGE` | — | `quay.io/haoran/holmesgpt:latest` | No |
| Azure OpenAI API version | `HOLMES_AZURE_API_VERSION` | — | `2025-04-01-preview` | No |
| LLM model name | `HOLMES_MODEL` | — | `azure/gpt-5.2` | No |
| Pod timeout (seconds) | `HOLMES_DEFAULT_TIMEOUT` | — | `600` | No |
| Max concurrent investigations per RP | `HOLMES_MAX_CONCURRENT` | — | `20` | No |

## Config Loading

Configuration is loaded once at RP startup in `NewFrontend` (`pkg/frontend/frontend.go`).

**Development mode** (`RP_MODE=development`): All values are read from environment variables via `NewHolmesConfigFromEnv()`.

**Production mode**: Sensitive values (API key, API base) are read from the service Key Vault (`{KEYVAULT_PREFIX}-svc`). Non-secret values (image, model, timeout, concurrency) are read from environment variables. This uses `NewHolmesConfig(ctx, serviceKeyvault)`.

**Soft-load behavior**: If loading fails (e.g., Key Vault secrets not provisioned), the RP logs a warning and starts normally. Only the investigate endpoint returns an error ("Holmes investigation is not configured"). This allows the RP to operate without Holmes configured.

The loaded config is stored on the `frontend` struct as `holmesConfig *holmes.HolmesConfig` and reused for all investigation requests.

## How Config Reaches the Pod

When an investigation request arrives, the RP creates three Kubernetes resources in the cluster's Hive namespace:

1. **Secret** (`holmes-kubeconfig-{id}`) — Contains:
- `config`: Short-lived (1h) kubeconfig for `system:aro-diagnostics` identity
- `azure-api-key`: From `holmesConfig.AzureAPIKey`
- `azure-api-base`: From `holmesConfig.AzureAPIBase`
- `azure-api-version`: From `holmesConfig.AzureAPIVersion`

2. **ConfigMap** (`holmes-config-{id}`) — Embedded toolset config from `pkg/hive/staticresources/holmes-config.yaml` (defines which kubectl commands Holmes can use)

3. **Pod** (`holmes-investigate-{id}`) — Runs:
```
python holmes_cli.py ask "<question>" -n --model=<Model> --config=/etc/holmes/config.yaml
```
- Image from `holmesConfig.Image`
- `ActiveDeadlineSeconds` from `holmesConfig.DefaultTimeout`
- Azure credentials injected as environment variables from the Secret
- Kubeconfig mounted at `/etc/kubeconfig/config`

All three resources are cleaned up after the investigation completes (or fails).

## Development Setup

1. Ensure prerequisites: VPN connected, `secrets/env` generated, `aks.kubeconfig` generated

2. Export Holmes environment variables:
```bash
source env && source secrets/env
export HIVE_KUBE_CONFIG_PATH=$(realpath aks.kubeconfig)
export ARO_INSTALL_VIA_HIVE=true
export ARO_ADOPT_BY_HIVE=true
export HOLMES_IMAGE="quay.io/haoran/holmesgpt:latest"
export HOLMES_AZURE_API_KEY="<your-azure-openai-key>"
export HOLMES_AZURE_API_BASE="<your-azure-openai-endpoint>"
```

3. Start the local RP: `make runlocal-rp`

4. Run an investigation:
```bash
./hack/test-holmes-investigate.sh <cluster-name> "what is the cluster health status?"
```

## Key Vault Provisioning (Staging/Production)

Create the following secrets in the service Key Vault (`{KEYVAULT_PREFIX}-svc`):

| Secret Name | Value |
|-------------|-------|
| `holmes-azure-api-key` | Azure OpenAI API key |
| `holmes-azure-api-base` | Azure OpenAI endpoint URL (e.g., `https://<resource>.openai.azure.com`) |

Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc says non-secret Holmes config is set via ARM deployment parameters in pkg/deploy/generator/resources_rp.go, but there are currently no HOLMES_* references in the deploy generator/templates in this PR. Either add the deployment wiring for these env vars or adjust the wording to clarify how/where they are configured in staging/production today to avoid defaults being used unintentionally.

Suggested change
Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template.
In this PR, only the secret Holmes settings are documented as coming from Key Vault. Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, `HOLMES_AZURE_API_VERSION`, `HOLMES_DEFAULT_TIMEOUT`, `HOLMES_MAX_CONCURRENT`) is **not yet wired through the deploy generator/templates via ARM deployment parameters**, so staging/production will use the application defaults unless explicit deployment wiring is added separately.

Copilot uses AI. Check for mistakes.

## Security

- **Cluster access**: Investigation pods use a `system:aro-diagnostics` identity with read-only RBAC (get/list/watch only). The kubeconfig certificate expires after 1 hour.
- **Pod security**: Runs as non-root (UID 1000), no privilege escalation, all capabilities dropped, service account token not mounted.
- **Toolset restrictions**: Destructive commands (`kubectl delete`, `kubectl apply`, `kubectl exec`, `rm`) are blocked in the Holmes toolset config.
- **Rate limiting**: Per-RP-instance atomic counter limits concurrent investigations (default 20).
- **Input validation**: Question limited to 1000 characters, control characters rejected, model name validated against safe character pattern.

## Code Locations

| Component | File |
|-----------|------|
| Config struct and loaders | `pkg/util/holmes/config.go` |
| Config loading at startup | `pkg/frontend/frontend.go` (search `holmesConfig`) |
| Admin API handler | `pkg/frontend/admin_openshiftcluster_investigate.go` |
| Kubeconfig generation | `pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go` |
| Pod creation and streaming | `pkg/hive/investigate.go` |
| Kubeconfig transformation (dev) | `pkg/util/holmes/kubeconfig.go` |
| Holmes toolset config | `pkg/hive/staticresources/holmes-config.yaml` |
| RBAC ClusterRole | `pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml` |
| RBAC ClusterRoleBinding | `pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml` |
| E2E test script | `hack/test-holmes-investigate.sh` |
91 changes: 91 additions & 0 deletions hack/test-holmes-investigate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
#!/bin/bash
# Test script for the Holmes investigation admin API endpoint.
#
# Prerequisites:
# 1. VPN connected to the dev environment
# 2. secrets/ folder generated: SECRET_SA_ACCOUNT_NAME=rharosecretsdev make secrets
# 3. AKS kubeconfig generated: make aks.kubeconfig
# 4. A test cluster created via: CLUSTER=<name> go run ./hack/cluster create
# 5. Local RP running with Hive enabled (see below)
#
# Usage:
# ./hack/test-holmes-investigate.sh <cluster-name> [question]
#
# Examples:
# ./hack/test-holmes-investigate.sh haowang-holmes-test
# ./hack/test-holmes-investigate.sh haowang-holmes-test "why is pod X crashing?"
# ./hack/test-holmes-investigate.sh haowang-holmes-test "check node memory usage"
#
# To start the local RP with Hive + Holmes enabled:
#
# source env && source secrets/env
# export HIVE_KUBE_CONFIG_PATH=$(realpath aks.kubeconfig)
# export ARO_INSTALL_VIA_HIVE=true
# export ARO_ADOPT_BY_HIVE=true
# export ARO_PODMAN_SOCKET="unix://$(podman machine inspect --format '{{.ConnectionInfo.PodmanSocket.Path}}')"
# export HOLMES_IMAGE="quay.io/haoran/holmesgpt:latest"
# export HOLMES_AZURE_API_KEY="<your-azure-openai-key>"
# export HOLMES_AZURE_API_BASE="<your-azure-openai-endpoint>"
# export HOLMES_AZURE_API_VERSION="2025-04-01-preview"
# export HOLMES_MODEL="azure/gpt-5.2"
# make runlocal-rp

set -euo pipefail

CLUSTER_NAME="${1:-}"
QUESTION="${2:-what is the cluster health status?}"

if [[ -z "$CLUSTER_NAME" ]]; then
echo "Usage: $0 <cluster-name> [question]"
echo ""
echo "Examples:"
echo " $0 haowang-holmes-test"
echo " $0 haowang-holmes-test 'why is pod X crashing?'"
exit 1
fi

# Source env if not already loaded
if [[ -z "${AZURE_SUBSCRIPTION_ID:-}" ]]; then
if [[ -f env ]] && [[ -f secrets/env ]]; then
source env
source secrets/env
else
echo "Error: AZURE_SUBSCRIPTION_ID not set and env files not found."
echo "Run from the repo root, or source env && source secrets/env first."
exit 1
fi
fi

RESOURCEGROUP="${RESOURCEGROUP:-v4-eastus}"
RP_URL="https://localhost:8443"
API_PATH="/admin/subscriptions/${AZURE_SUBSCRIPTION_ID}/resourcegroups/${RESOURCEGROUP}/providers/Microsoft.RedHatOpenShift/openShiftClusters/${CLUSTER_NAME}/investigate"

echo "============================================"
echo " Holmes Investigation Test"
echo "============================================"
echo " Cluster: ${CLUSTER_NAME}"
echo " RG: ${RESOURCEGROUP}"
echo " Question: ${QUESTION}"
echo " Endpoint: POST ${RP_URL}${API_PATH}"
echo "============================================"
echo ""

# Check RP is running
if ! curl -sk -o /dev/null -w '' "${RP_URL}/healthz" 2>/dev/null; then
echo "Error: Local RP is not running at ${RP_URL}"
echo "Start it with: make runlocal-rp (see header comments for full env setup)"
exit 1
fi

echo "Sending investigation request..."
echo "Streaming results (this may take 1-5 minutes):"
echo "--------------------------------------------"

curl -sk --no-buffer -X POST \
"${RP_URL}${API_PATH}" \
-H "Content-Type: application/json" \
-d "$(jq -n --arg q "${QUESTION}" '{question: $q}')"

echo ""
echo "--------------------------------------------"
echo "Investigation complete."
9 changes: 5 additions & 4 deletions pkg/cluster/kubeconfig.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,13 @@ import (
// kubeconfig for the ARO service, based on the admin kubeconfig found in the
// graph.
func (m *manager) generateAROServiceKubeconfig(pg graph.PersistedGraph) ([]byte, error) {
return generateKubeconfig(pg, "system:aro-service", []string{"system:masters"}, installer.TenYears, true)
return GenerateKubeconfig(pg, "system:aro-service", []string{"system:masters"}, installer.TenYears, true)
}

// generateAROSREKubeconfig generates additional admin credentials and a
// kubeconfig for ARO SREs, based on the admin kubeconfig found in the graph.
func (m *manager) generateAROSREKubeconfig(pg graph.PersistedGraph) ([]byte, error) {
return generateKubeconfig(pg, "system:aro-sre", nil, installer.TenYears, true)
return GenerateKubeconfig(pg, "system:aro-sre", nil, installer.TenYears, true)
}

// checkUserAdminKubeconfigUpdated checks if the user kubeconfig is
Expand Down Expand Up @@ -82,7 +82,7 @@ func (m *manager) checkUserAdminKubeconfigUpdated() bool {
// generateUserAdminKubeconfig generates additional admin credentials and a
// kubeconfig for ARO User, based on the admin kubeconfig found in the graph.
func (m *manager) generateUserAdminKubeconfig(pg graph.PersistedGraph) ([]byte, error) {
return generateKubeconfig(pg, "system:admin", nil, installer.OneYear, false)
return GenerateKubeconfig(pg, "system:admin", nil, installer.OneYear, false)
}

func (m *manager) generateKubeconfigs(ctx context.Context) error {
Expand Down Expand Up @@ -127,7 +127,8 @@ func (m *manager) generateKubeconfigs(ctx context.Context) error {
return err
}

func generateKubeconfig(pg graph.PersistedGraph, commonName string, organization []string, validity time.Duration, internal bool) ([]byte, error) {
// GenerateKubeconfig generates a kubeconfig with a client certificate signed by the cluster CA.
func GenerateKubeconfig(pg graph.PersistedGraph, commonName string, organization []string, validity time.Duration, internal bool) ([]byte, error) {
var ca *installer.AdminKubeConfigSignerCertKey
var adminInternalClient *installer.AdminInternalClient
err := pg.GetByName(false, "*tls.AdminKubeConfigSignerCertKey", &ca)
Expand Down
130 changes: 130 additions & 0 deletions pkg/cluster/kubeconfig_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,136 @@ func TestGenerateAROServiceKubeconfig(t *testing.T) {
}
}

func TestGenerateDiagnosticsKubeconfig(t *testing.T) {
validCaKey, validCaCerts, err := utiltls.GenerateKeyAndCertificate("validca", nil, nil, true, false)
if err != nil {
t.Fatal(err)
}
encodedKey, err := utilpem.Encode(validCaKey)
if err != nil {
t.Fatal(err)
}
encodedCert, err := utilpem.Encode(validCaCerts[0])
if err != nil {
t.Fatal(err)
}
ca := &installer.AdminKubeConfigSignerCertKey{
SelfSignedCertKey: installer.SelfSignedCertKey{
CertKey: installer.CertKey{
CertRaw: encodedCert,
KeyRaw: encodedKey,
},
},
}

apiserverURL := "https://api-int.hash.rg.mydomain:6443"
clusterName := "api-hash-rg-mydomain:6443"
diagnosticsName := "system:aro-diagnostics"

adminInternalClient := &installer.AdminInternalClient{}
adminInternalClient.Config = &clientcmdv1.Config{
Clusters: []clientcmdv1.NamedCluster{
{
Name: clusterName,
Cluster: clientcmdv1.Cluster{
Server: apiserverURL,
CertificateAuthorityData: []byte("internal API Cert"),
},
},
},
AuthInfos: []clientcmdv1.NamedAuthInfo{},
Contexts: []clientcmdv1.NamedContext{
{
Name: diagnosticsName,
Context: clientcmdv1.Context{
Cluster: clusterName,
AuthInfo: diagnosticsName,
},
},
},
CurrentContext: diagnosticsName,
}

pg := graph.PersistedGraph{}

caData, err := json.Marshal(ca)
if err != nil {
t.Fatal(err)
}
clientData, err := json.Marshal(adminInternalClient)
if err != nil {
t.Fatal(err)
}
pg["*kubeconfig.AdminInternalClient"] = clientData
pg["*tls.AdminKubeConfigSignerCertKey"] = caData

// Generate a 1-hour kubeconfig for system:aro-diagnostics
aroDiagnosticsClient, err := GenerateKubeconfig(pg, diagnosticsName, nil, time.Hour, true)
if err != nil {
t.Fatal(err)
}

var got *clientcmdv1.Config
err = yaml.Unmarshal(aroDiagnosticsClient, &got)
if err != nil {
t.Fatal(err)
}

innerpem := string(got.AuthInfos[0].AuthInfo.ClientCertificateData) + string(got.AuthInfos[0].AuthInfo.ClientKeyData)
_, innercert, err := utilpem.Parse([]byte(innerpem))
if err != nil {
t.Fatal(err)
}

err = innercert[0].CheckSignatureFrom(validCaCerts[0])
if err != nil {
t.Fatal(err)
}

issuer := innercert[0].Issuer.String()
if issuer != "CN=validca" {
t.Error(issuer)
}

subject := innercert[0].Subject.String()
if subject != "CN=system:aro-diagnostics" {
t.Error(subject)
}

// Verify no organization (no system:masters group)
if len(innercert[0].Subject.Organization) != 0 {
t.Errorf("expected no organization, got %v", innercert[0].Subject.Organization)
}

// Verify ~1 hour validity (not 10 years)
expectedExpiry := time.Now().Add(time.Hour)
if innercert[0].NotAfter.After(expectedExpiry.Add(5 * time.Minute)) {
t.Errorf("certificate expires too far in the future: %v", innercert[0].NotAfter)
}
if innercert[0].NotAfter.Before(expectedExpiry.Add(-5 * time.Minute)) {
t.Errorf("certificate expires too soon: %v", innercert[0].NotAfter)
}

keyUsage := innercert[0].KeyUsage
expectedKeyUsage := x509.KeyUsageKeyEncipherment | x509.KeyUsageDigitalSignature
if keyUsage != expectedKeyUsage {
t.Error("Invalid keyUsage.")
}

// Verify internal URL is preserved (not rewritten to external)
if got.Clusters[0].Cluster.Server != apiserverURL {
t.Errorf("expected server %s, got %s", apiserverURL, got.Clusters[0].Cluster.Server)
}

// validate the rest of the struct
got.AuthInfos = []clientcmdv1.NamedAuthInfo{}
want := adminInternalClient.Config

if !reflect.DeepEqual(got, want) {
t.Fatal(cmp.Diff(got, want))
}
}

func TestGenerateUserAdminKubeconfig(t *testing.T) {
validCaKey, validCaCerts, err := utiltls.GenerateKeyAndCertificate("validca", nil, nil, true, false)
if err != nil {
Expand Down
Loading
Loading