feat: add Holmes investigation admin API endpoint (ARO-25791)#4754
feat: add Holmes investigation admin API endpoint (ARO-25791)#4754wanghaoran1988 wants to merge 7 commits intomasterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
6add5b5 to
e8ef255
Compare
Add POST /admin/.../openShiftClusters/{name}/investigate endpoint that
runs HolmesGPT diagnostic investigations on ARO clusters.
The endpoint:
- Generates a short-lived (1h) read-only kubeconfig for system:aro-diagnostics
on each request using the cluster CA from the persisted graph
- Creates an investigation pod on the Hive AKS cluster
- Mounts the ephemeral kubeconfig as a temporary secret
- Streams pod logs back to the client in real-time
- Cleans up pod, configmap, and secret after completion
The diagnostics identity uses a dedicated ClusterRole with read-only
(get/list/watch) permissions, following the principle of least privilege.
No long-lived credentials are stored in CosmosDB.
Relates: ARO-25791
e8ef255 to
0834253
Compare
- Wrap all errors with context in kubeconfig generation
- Move activeInvestigations counter from global to frontend struct
- Use pointerutils.ToPtr() instead of &[]bool{true}[0]
There was a problem hiding this comment.
Pull request overview
This PR introduces an admin endpoint to run an AI-powered “Holmes” investigation against an OpenShift cluster by launching a transient pod on the Hive cluster, streaming the pod logs back to the caller, and cleaning up resources afterward.
Changes:
- Adds
POST .../investigateadmin handler with basic request validation and a concurrent-investigation rate limit. - Adds Hive-side pod orchestration and log streaming, plus embedded Holmes toolset configuration.
- Adds support utilities for Holmes configuration and kubeconfig rewriting, along with new diagnostics RBAC resources.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/frontend/frontend.go | Adds an activeInvestigations counter and wires the new /investigate route. |
| pkg/frontend/admin_openshiftcluster_investigate.go | Implements the admin investigate handler (validation, rate limiting, streaming call into Hive). |
| pkg/frontend/admin_openshiftcluster_investigate_test.go | Adds handler unit tests (currently focused on error paths). |
| pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go | Generates a short-lived diagnostics kubeconfig from persisted graph data and optionally rewrites it. |
| pkg/hive/manager.go | Extends the Hive ClusterManager interface with InvestigateCluster. |
| pkg/hive/investigate.go | Creates Secret/ConfigMap/Pod, waits for start, streams logs, and performs cleanup. |
| pkg/hive/staticresources/holmes-config.yaml | Adds embedded Holmes toolset configuration (allow/deny lists). |
| pkg/util/holmes/config.go | Reads Holmes configuration from environment variables. |
| pkg/util/holmes/kubeconfig.go | Adds helper to rewrite kubeconfig server URL and adjust TLS settings. |
| pkg/util/holmes/kubeconfig_test.go | Unit tests for kubeconfig rewriting behavior. |
| pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml | Adds read-only ClusterRole for system:aro-diagnostics. |
| pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml | Binds the diagnostics ClusterRole to the diagnostics user. |
| pkg/operator/controllers/rbac/bindata.go | Updates generated bindata to include the new RBAC static resources. |
| pkg/cluster/kubeconfig.go | Exports GenerateKubeconfig for reuse (used by diagnostics kubeconfig generation). |
| pkg/cluster/kubeconfig_test.go | Adds a test for generating a short-lived diagnostics kubeconfig. |
| pkg/util/mocks/hive/hive.go | Updates mock Hive manager to include the new InvestigateCluster method. |
| hack/test-holmes-investigate.sh | Adds a manual script for exercising the new admin endpoint locally. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
gofumpt requires 0o755 octal prefix instead of 0755.
- Fix Content-Type handling: only set text/plain before streaming, let adminReply set JSON content-type on error paths - Only call adminReply on error to avoid corrupting streamed response - Add automountServiceAccountToken: false to investigation pod - Add 30s timeout to cleanup context to prevent hanging deletes - Add GoDoc comment on exported GenerateKubeconfig function
- Validate Holmes config (API key, base, image) before creating K8s resources - Only set InsecureSkipTLSVerify when api-int→api rewrite actually occurred - Set test env vars for Holmes config validation in unit tests
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml
Show resolved
Hide resolved
- Mount only kubeconfig key from secret volume (not Azure creds) - Use CAS loop for rate limiter to avoid inflating counter on rejection - Fix misleading comment about buffered reader in streamPodLogs - Escape question parameter in test script using jq
|
/hold |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c3050d2 to
7893e9b
Compare
7893e9b to
2b23ef2
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2b23ef2 to
f1915b4
Compare
f1915b4 to
70ca660
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
70ca660 to
a187c79
Compare
a187c79 to
fef41fe
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Reject control characters that could affect CLI argument parsing. | ||
| for _, ch := range req.Question { | ||
| if ch < 0x20 && ch != ' ' { |
There was a problem hiding this comment.
The control-character validation only rejects runes < 0x20 (except space). ASCII DEL (0x7f) is also a control character and is currently allowed, which can still cause terminal/log issues. Consider rejecting 0x7f as well (e.g., treat it as a control character in the same check).
| if ch < 0x20 && ch != ' ' { | |
| if (ch < 0x20 && ch != ' ') || ch == 0x7f { |
| | `holmes-azure-api-key` | Azure OpenAI API key | | ||
| | `holmes-azure-api-base` | Azure OpenAI endpoint URL (e.g., `https://<resource>.openai.azure.com`) | | ||
|
|
||
| Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template. |
There was a problem hiding this comment.
This doc says non-secret Holmes config is set via ARM deployment parameters in pkg/deploy/generator/resources_rp.go, but there are currently no HOLMES_* references in the deploy generator/templates in this PR. Either add the deployment wiring for these env vars or adjust the wording to clarify how/where they are configured in staging/production today to avoid defaults being used unintentionally.
| Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template. | |
| In this PR, only the secret Holmes settings are documented as coming from Key Vault. Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, `HOLMES_AZURE_API_VERSION`, `HOLMES_DEFAULT_TIMEOUT`, `HOLMES_MAX_CONCURRENT`) is **not yet wired through the deploy generator/templates via ARM deployment parameters**, so staging/production will use the application defaults unless explicit deployment wiring is added separately. |
- Add trackingResponseWriter to avoid adminReply after streaming starts - Document per-RP-instance rate limiter behavior - Set ImagePullPolicy: PullAlways explicitly on investigation pod - Replace manual polling loop with wait.PollImmediateUntil
fef41fe to
b67adfa
Compare
Summary
POST /admin/.../investigateto trigger AI-powered cluster investigations using HolmesGPTsystem:aro-diagnostics) with readonly permissions for the investigation identityInvestigation Flow
POST /investigatewith{"question": "..."}holmes_cli.py ask <question>against the customer clusterFiles
pkg/frontend/admin_openshiftcluster_investigate.go— HTTP handler with rate limiting and request validationpkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go— Kubeconfig retrieval and conversionpkg/hive/investigate.go— Pod orchestration, log streaming, cleanuppkg/hive/staticresources/holmes-config.yaml— HolmesGPT toolset configuration (bash whitelist/denylist)pkg/util/holmes/config.go— Configuration from environment variablespkg/util/holmes/kubeconfig.go— Convert internal kubeconfig (api-int.) to external (api.)pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml— Readonly RBAC forsystem:aro-diagnosticshack/test-holmes-investigate.sh— Manual e2e test scriptSecurity
system:aro-diagnostics— get/list/watch only)Test plan
holmes-e2e) with local RPJira
ARO-25791