-
Notifications
You must be signed in to change notification settings - Fork 194
feat: add Holmes investigation admin API endpoint (ARO-25791) #4754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wanghaoran1988
wants to merge
7
commits into
master
Choose a base branch
from
haowang/ARO-25791/holmes-admin-api
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,009
−9
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
0834253
feat: integrate Holmes investigation API into ARO-RP admin API
wanghaoran1988 a2b783a
fix: address code review findings for Holmes investigate API
wanghaoran1988 717b3f0
fix: regenerate bindata with correct octal literal formatting
wanghaoran1988 3184970
fix: address Copilot PR review feedback
wanghaoran1988 f3a7829
fix: address remaining Copilot PR review feedback
wanghaoran1988 13cc9c2
fix: address second round of Copilot PR review feedback
wanghaoran1988 b67adfa
fix: address tuxerrante's PR review feedback
wanghaoran1988 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| # Holmes Investigation API | ||
|
|
||
| The Holmes investigation API is an admin endpoint that runs [HolmesGPT](https://github.com/robusta-dev/holmesgpt) diagnostic investigations on ARO clusters. It creates a short-lived pod on the Hive AKS cluster that connects to the target cluster, runs diagnostic queries, and streams the results back to the caller. | ||
|
|
||
| **Endpoint:** `POST /admin/subscriptions/{subscriptionId}/resourcegroups/{resourceGroup}/providers/Microsoft.RedHatOpenShift/openShiftClusters/{clusterName}/investigate` | ||
|
|
||
| ## Configuration Reference | ||
|
|
||
| | Config | Env Var | Key Vault Secret (prod) | Default | Required | | ||
| |--------|---------|------------------------|---------|----------| | ||
| | Azure OpenAI API key | `HOLMES_AZURE_API_KEY` | `holmes-azure-api-key` | — | Yes | | ||
| | Azure OpenAI endpoint | `HOLMES_AZURE_API_BASE` | `holmes-azure-api-base` | — | Yes | | ||
| | HolmesGPT container image | `HOLMES_IMAGE` | — | `quay.io/haoran/holmesgpt:latest` | No | | ||
| | Azure OpenAI API version | `HOLMES_AZURE_API_VERSION` | — | `2025-04-01-preview` | No | | ||
| | LLM model name | `HOLMES_MODEL` | — | `azure/gpt-5.2` | No | | ||
| | Pod timeout (seconds) | `HOLMES_DEFAULT_TIMEOUT` | — | `600` | No | | ||
| | Max concurrent investigations per RP | `HOLMES_MAX_CONCURRENT` | — | `20` | No | | ||
|
|
||
| ## Config Loading | ||
|
|
||
| Configuration is loaded once at RP startup in `NewFrontend` (`pkg/frontend/frontend.go`). | ||
|
|
||
| **Development mode** (`RP_MODE=development`): All values are read from environment variables via `NewHolmesConfigFromEnv()`. | ||
|
|
||
| **Production mode**: Sensitive values (API key, API base) are read from the service Key Vault (`{KEYVAULT_PREFIX}-svc`). Non-secret values (image, model, timeout, concurrency) are read from environment variables. This uses `NewHolmesConfig(ctx, serviceKeyvault)`. | ||
|
|
||
| **Soft-load behavior**: If loading fails (e.g., Key Vault secrets not provisioned), the RP logs a warning and starts normally. Only the investigate endpoint returns an error ("Holmes investigation is not configured"). This allows the RP to operate without Holmes configured. | ||
|
|
||
| The loaded config is stored on the `frontend` struct as `holmesConfig *holmes.HolmesConfig` and reused for all investigation requests. | ||
|
|
||
| ## How Config Reaches the Pod | ||
|
|
||
| When an investigation request arrives, the RP creates three Kubernetes resources in the cluster's Hive namespace: | ||
|
|
||
| 1. **Secret** (`holmes-kubeconfig-{id}`) — Contains: | ||
| - `config`: Short-lived (1h) kubeconfig for `system:aro-diagnostics` identity | ||
| - `azure-api-key`: From `holmesConfig.AzureAPIKey` | ||
| - `azure-api-base`: From `holmesConfig.AzureAPIBase` | ||
| - `azure-api-version`: From `holmesConfig.AzureAPIVersion` | ||
|
|
||
| 2. **ConfigMap** (`holmes-config-{id}`) — Embedded toolset config from `pkg/hive/staticresources/holmes-config.yaml` (defines which kubectl commands Holmes can use) | ||
|
|
||
| 3. **Pod** (`holmes-investigate-{id}`) — Runs: | ||
| ``` | ||
| python holmes_cli.py ask "<question>" -n --model=<Model> --config=/etc/holmes/config.yaml | ||
| ``` | ||
| - Image from `holmesConfig.Image` | ||
| - `ActiveDeadlineSeconds` from `holmesConfig.DefaultTimeout` | ||
| - Azure credentials injected as environment variables from the Secret | ||
| - Kubeconfig mounted at `/etc/kubeconfig/config` | ||
|
|
||
| All three resources are cleaned up after the investigation completes (or fails). | ||
|
|
||
| ## Development Setup | ||
|
|
||
| 1. Ensure prerequisites: VPN connected, `secrets/env` generated, `aks.kubeconfig` generated | ||
|
|
||
| 2. Export Holmes environment variables: | ||
| ```bash | ||
| source env && source secrets/env | ||
| export HIVE_KUBE_CONFIG_PATH=$(realpath aks.kubeconfig) | ||
| export ARO_INSTALL_VIA_HIVE=true | ||
| export ARO_ADOPT_BY_HIVE=true | ||
| export HOLMES_IMAGE="quay.io/haoran/holmesgpt:latest" | ||
| export HOLMES_AZURE_API_KEY="<your-azure-openai-key>" | ||
| export HOLMES_AZURE_API_BASE="<your-azure-openai-endpoint>" | ||
| ``` | ||
|
|
||
| 3. Start the local RP: `make runlocal-rp` | ||
|
|
||
| 4. Run an investigation: | ||
| ```bash | ||
| ./hack/test-holmes-investigate.sh <cluster-name> "what is the cluster health status?" | ||
| ``` | ||
|
|
||
| ## Key Vault Provisioning (Staging/Production) | ||
|
|
||
| Create the following secrets in the service Key Vault (`{KEYVAULT_PREFIX}-svc`): | ||
|
|
||
| | Secret Name | Value | | ||
| |-------------|-------| | ||
| | `holmes-azure-api-key` | Azure OpenAI API key | | ||
| | `holmes-azure-api-base` | Azure OpenAI endpoint URL (e.g., `https://<resource>.openai.azure.com`) | | ||
|
|
||
| Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template. | ||
|
|
||
| ## Security | ||
|
|
||
| - **Cluster access**: Investigation pods use a `system:aro-diagnostics` identity with read-only RBAC (get/list/watch only). The kubeconfig certificate expires after 1 hour. | ||
| - **Pod security**: Runs as non-root (UID 1000), no privilege escalation, all capabilities dropped, service account token not mounted. | ||
| - **Toolset restrictions**: Destructive commands (`kubectl delete`, `kubectl apply`, `kubectl exec`, `rm`) are blocked in the Holmes toolset config. | ||
| - **Rate limiting**: Per-RP-instance atomic counter limits concurrent investigations (default 20). | ||
| - **Input validation**: Question limited to 1000 characters, control characters rejected, model name validated against safe character pattern. | ||
|
|
||
| ## Code Locations | ||
|
|
||
| | Component | File | | ||
| |-----------|------| | ||
| | Config struct and loaders | `pkg/util/holmes/config.go` | | ||
| | Config loading at startup | `pkg/frontend/frontend.go` (search `holmesConfig`) | | ||
| | Admin API handler | `pkg/frontend/admin_openshiftcluster_investigate.go` | | ||
| | Kubeconfig generation | `pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go` | | ||
| | Pod creation and streaming | `pkg/hive/investigate.go` | | ||
| | Kubeconfig transformation (dev) | `pkg/util/holmes/kubeconfig.go` | | ||
| | Holmes toolset config | `pkg/hive/staticresources/holmes-config.yaml` | | ||
| | RBAC ClusterRole | `pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml` | | ||
| | RBAC ClusterRoleBinding | `pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml` | | ||
| | E2E test script | `hack/test-holmes-investigate.sh` | | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| #!/bin/bash | ||
| # Test script for the Holmes investigation admin API endpoint. | ||
| # | ||
| # Prerequisites: | ||
| # 1. VPN connected to the dev environment | ||
| # 2. secrets/ folder generated: SECRET_SA_ACCOUNT_NAME=rharosecretsdev make secrets | ||
| # 3. AKS kubeconfig generated: make aks.kubeconfig | ||
| # 4. A test cluster created via: CLUSTER=<name> go run ./hack/cluster create | ||
| # 5. Local RP running with Hive enabled (see below) | ||
| # | ||
| # Usage: | ||
| # ./hack/test-holmes-investigate.sh <cluster-name> [question] | ||
| # | ||
| # Examples: | ||
| # ./hack/test-holmes-investigate.sh haowang-holmes-test | ||
| # ./hack/test-holmes-investigate.sh haowang-holmes-test "why is pod X crashing?" | ||
| # ./hack/test-holmes-investigate.sh haowang-holmes-test "check node memory usage" | ||
| # | ||
| # To start the local RP with Hive + Holmes enabled: | ||
| # | ||
| # source env && source secrets/env | ||
| # export HIVE_KUBE_CONFIG_PATH=$(realpath aks.kubeconfig) | ||
| # export ARO_INSTALL_VIA_HIVE=true | ||
| # export ARO_ADOPT_BY_HIVE=true | ||
| # export ARO_PODMAN_SOCKET="unix://$(podman machine inspect --format '{{.ConnectionInfo.PodmanSocket.Path}}')" | ||
| # export HOLMES_IMAGE="quay.io/haoran/holmesgpt:latest" | ||
| # export HOLMES_AZURE_API_KEY="<your-azure-openai-key>" | ||
| # export HOLMES_AZURE_API_BASE="<your-azure-openai-endpoint>" | ||
| # export HOLMES_AZURE_API_VERSION="2025-04-01-preview" | ||
| # export HOLMES_MODEL="azure/gpt-5.2" | ||
| # make runlocal-rp | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| CLUSTER_NAME="${1:-}" | ||
| QUESTION="${2:-what is the cluster health status?}" | ||
|
|
||
| if [[ -z "$CLUSTER_NAME" ]]; then | ||
| echo "Usage: $0 <cluster-name> [question]" | ||
| echo "" | ||
| echo "Examples:" | ||
| echo " $0 haowang-holmes-test" | ||
| echo " $0 haowang-holmes-test 'why is pod X crashing?'" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Source env if not already loaded | ||
| if [[ -z "${AZURE_SUBSCRIPTION_ID:-}" ]]; then | ||
| if [[ -f env ]] && [[ -f secrets/env ]]; then | ||
| source env | ||
| source secrets/env | ||
| else | ||
| echo "Error: AZURE_SUBSCRIPTION_ID not set and env files not found." | ||
| echo "Run from the repo root, or source env && source secrets/env first." | ||
| exit 1 | ||
| fi | ||
| fi | ||
|
|
||
| RESOURCEGROUP="${RESOURCEGROUP:-v4-eastus}" | ||
| RP_URL="https://localhost:8443" | ||
| API_PATH="/admin/subscriptions/${AZURE_SUBSCRIPTION_ID}/resourcegroups/${RESOURCEGROUP}/providers/Microsoft.RedHatOpenShift/openShiftClusters/${CLUSTER_NAME}/investigate" | ||
|
|
||
| echo "============================================" | ||
| echo " Holmes Investigation Test" | ||
| echo "============================================" | ||
| echo " Cluster: ${CLUSTER_NAME}" | ||
| echo " RG: ${RESOURCEGROUP}" | ||
| echo " Question: ${QUESTION}" | ||
| echo " Endpoint: POST ${RP_URL}${API_PATH}" | ||
| echo "============================================" | ||
| echo "" | ||
|
|
||
| # Check RP is running | ||
| if ! curl -sk -o /dev/null -w '' "${RP_URL}/healthz" 2>/dev/null; then | ||
| echo "Error: Local RP is not running at ${RP_URL}" | ||
| echo "Start it with: make runlocal-rp (see header comments for full env setup)" | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Sending investigation request..." | ||
| echo "Streaming results (this may take 1-5 minutes):" | ||
| echo "--------------------------------------------" | ||
|
|
||
| curl -sk --no-buffer -X POST \ | ||
| "${RP_URL}${API_PATH}" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d "$(jq -n --arg q "${QUESTION}" '{question: $q}')" | ||
|
|
||
| echo "" | ||
| echo "--------------------------------------------" | ||
| echo "Investigation complete." |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc says non-secret Holmes config is set via ARM deployment parameters in pkg/deploy/generator/resources_rp.go, but there are currently no HOLMES_* references in the deploy generator/templates in this PR. Either add the deployment wiring for these env vars or adjust the wording to clarify how/where they are configured in staging/production today to avoid defaults being used unintentionally.