Skip to content

feat: add Holmes investigation admin API endpoint (ARO-25791)#4754

Open
wanghaoran1988 wants to merge 7 commits intomasterfrom
haowang/ARO-25791/holmes-admin-api
Open

feat: add Holmes investigation admin API endpoint (ARO-25791)#4754
wanghaoran1988 wants to merge 7 commits intomasterfrom
haowang/ARO-25791/holmes-admin-api

Conversation

@wanghaoran1988
Copy link
Copy Markdown
Collaborator

@wanghaoran1988 wanghaoran1988 commented Apr 7, 2026

Summary

  • Add admin API endpoint POST /admin/.../investigate to trigger AI-powered cluster investigations using HolmesGPT
  • Investigation runs as a transient pod on the Hive AKS cluster with readonly kubeconfig mounted
  • Results are streamed back to the caller in real-time via text/plain response
  • Includes RBAC (ClusterRole system:aro-diagnostics) with readonly permissions for the investigation identity

Investigation Flow

  1. Admin API receives POST /investigate with {"question": "..."}
  2. RP retrieves cluster's kubeconfig from CosmosDB and converts to external URL
  3. Creates a Secret (kubeconfig + Azure OpenAI creds), ConfigMap (Holmes toolset config), and Pod on Hive cluster
  4. Pod runs holmes_cli.py ask <question> against the customer cluster
  5. Pod logs are streamed back to the HTTP response in real-time
  6. Pod, Secret, and ConfigMap are cleaned up after completion

Files

  • pkg/frontend/admin_openshiftcluster_investigate.go — HTTP handler with rate limiting and request validation
  • pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go — Kubeconfig retrieval and conversion
  • pkg/hive/investigate.go — Pod orchestration, log streaming, cleanup
  • pkg/hive/staticresources/holmes-config.yaml — HolmesGPT toolset configuration (bash whitelist/denylist)
  • pkg/util/holmes/config.go — Configuration from environment variables
  • pkg/util/holmes/kubeconfig.go — Convert internal kubeconfig (api-int.) to external (api.)
  • pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml — Readonly RBAC for system:aro-diagnostics
  • hack/test-holmes-investigate.sh — Manual e2e test script

Security

  • Investigation pod runs as non-root (UID 1000), no privilege escalation, all capabilities dropped
  • Kubeconfig is readonly (system:aro-diagnostics — get/list/watch only)
  • Bash commands restricted: kubectl read operations allowed, write operations (delete/apply/create/exec) denied
  • Secrets cleaned up immediately after pod completion
  • Rate limiting via atomic counter (configurable max concurrent investigations)

Test plan

  • E2E tested against real ARO cluster (holmes-e2e) with local RP
  • All 5/6 toolsets loaded successfully
  • Investigation returned meaningful cluster health analysis
  • Tool calls completed in <1.1s each
  • Pod cleanup verified
  • Unit tests for handler and kubeconfig conversion
  • CI pipeline validation

Jira

ARO-25791

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 7, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@wanghaoran1988 wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from 6add5b5 to e8ef255 Compare April 7, 2026 03:02
Add POST /admin/.../openShiftClusters/{name}/investigate endpoint that
runs HolmesGPT diagnostic investigations on ARO clusters.

The endpoint:
- Generates a short-lived (1h) read-only kubeconfig for system:aro-diagnostics
  on each request using the cluster CA from the persisted graph
- Creates an investigation pod on the Hive AKS cluster
- Mounts the ephemeral kubeconfig as a temporary secret
- Streams pod logs back to the client in real-time
- Cleans up pod, configmap, and secret after completion

The diagnostics identity uses a dedicated ClusterRole with read-only
(get/list/watch) permissions, following the principle of least privilege.
No long-lived credentials are stored in CosmosDB.

Relates: ARO-25791
@wanghaoran1988 wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from e8ef255 to 0834253 Compare April 7, 2026 03:25
- Wrap all errors with context in kubeconfig generation
- Move activeInvestigations counter from global to frontend struct
- Use pointerutils.ToPtr() instead of &[]bool{true}[0]
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an admin endpoint to run an AI-powered “Holmes” investigation against an OpenShift cluster by launching a transient pod on the Hive cluster, streaming the pod logs back to the caller, and cleaning up resources afterward.

Changes:

  • Adds POST .../investigate admin handler with basic request validation and a concurrent-investigation rate limit.
  • Adds Hive-side pod orchestration and log streaming, plus embedded Holmes toolset configuration.
  • Adds support utilities for Holmes configuration and kubeconfig rewriting, along with new diagnostics RBAC resources.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
pkg/frontend/frontend.go Adds an activeInvestigations counter and wires the new /investigate route.
pkg/frontend/admin_openshiftcluster_investigate.go Implements the admin investigate handler (validation, rate limiting, streaming call into Hive).
pkg/frontend/admin_openshiftcluster_investigate_test.go Adds handler unit tests (currently focused on error paths).
pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go Generates a short-lived diagnostics kubeconfig from persisted graph data and optionally rewrites it.
pkg/hive/manager.go Extends the Hive ClusterManager interface with InvestigateCluster.
pkg/hive/investigate.go Creates Secret/ConfigMap/Pod, waits for start, streams logs, and performs cleanup.
pkg/hive/staticresources/holmes-config.yaml Adds embedded Holmes toolset configuration (allow/deny lists).
pkg/util/holmes/config.go Reads Holmes configuration from environment variables.
pkg/util/holmes/kubeconfig.go Adds helper to rewrite kubeconfig server URL and adjust TLS settings.
pkg/util/holmes/kubeconfig_test.go Unit tests for kubeconfig rewriting behavior.
pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml Adds read-only ClusterRole for system:aro-diagnostics.
pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml Binds the diagnostics ClusterRole to the diagnostics user.
pkg/operator/controllers/rbac/bindata.go Updates generated bindata to include the new RBAC static resources.
pkg/cluster/kubeconfig.go Exports GenerateKubeconfig for reuse (used by diagnostics kubeconfig generation).
pkg/cluster/kubeconfig_test.go Adds a test for generating a short-lived diagnostics kubeconfig.
pkg/util/mocks/hive/hive.go Updates mock Hive manager to include the new InvestigateCluster method.
hack/test-holmes-investigate.sh Adds a manual script for exercising the new admin endpoint locally.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gofumpt requires 0o755 octal prefix instead of 0755.
- Fix Content-Type handling: only set text/plain before streaming,
  let adminReply set JSON content-type on error paths
- Only call adminReply on error to avoid corrupting streamed response
- Add automountServiceAccountToken: false to investigation pod
- Add 30s timeout to cleanup context to prevent hanging deletes
- Add GoDoc comment on exported GenerateKubeconfig function
Copilot AI review requested due to automatic review settings April 7, 2026 04:35
- Validate Holmes config (API key, base, image) before creating K8s resources
- Only set InsecureSkipTLSVerify when api-int→api rewrite actually occurred
- Set test env vars for Holmes config validation in unit tests
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Mount only kubeconfig key from secret volume (not Azure creds)
- Use CAS loop for rate limiter to avoid inflating counter on rejection
- Fix misleading comment about buffered reader in streamPodLogs
- Escape question parameter in test script using jq
@wanghaoran1988
Copy link
Copy Markdown
Collaborator Author

/hold
We need update the holmesgpt image after the image is built by our pipeline and pushed to acr.
We also need to update the RP configs after the new model is deployed.

Copilot AI review requested due to automatic review settings April 7, 2026 23:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wanghaoran1988 wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from c3050d2 to 7893e9b Compare April 9, 2026 00:02
Copilot AI review requested due to automatic review settings April 9, 2026 00:38
@wanghaoran1988 wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from 7893e9b to 2b23ef2 Compare April 9, 2026 00:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wanghaoran1988 wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from 2b23ef2 to f1915b4 Compare April 9, 2026 01:07
Copilot AI review requested due to automatic review settings April 9, 2026 01:39
@wanghaoran1988 wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from f1915b4 to 70ca660 Compare April 9, 2026 01:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wanghaoran1988 wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from 70ca660 to a187c79 Compare April 9, 2026 01:55
Copilot AI review requested due to automatic review settings April 9, 2026 02:09
@wanghaoran1988 wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from a187c79 to fef41fe Compare April 9, 2026 02:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// Reject control characters that could affect CLI argument parsing.
for _, ch := range req.Question {
if ch < 0x20 && ch != ' ' {
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The control-character validation only rejects runes < 0x20 (except space). ASCII DEL (0x7f) is also a control character and is currently allowed, which can still cause terminal/log issues. Consider rejecting 0x7f as well (e.g., treat it as a control character in the same check).

Suggested change
if ch < 0x20 && ch != ' ' {
if (ch < 0x20 && ch != ' ') || ch == 0x7f {

Copilot uses AI. Check for mistakes.
| `holmes-azure-api-key` | Azure OpenAI API key |
| `holmes-azure-api-base` | Azure OpenAI endpoint URL (e.g., `https://<resource>.openai.azure.com`) |

Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc says non-secret Holmes config is set via ARM deployment parameters in pkg/deploy/generator/resources_rp.go, but there are currently no HOLMES_* references in the deploy generator/templates in this PR. Either add the deployment wiring for these env vars or adjust the wording to clarify how/where they are configured in staging/production today to avoid defaults being used unintentionally.

Suggested change
Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template.
In this PR, only the secret Holmes settings are documented as coming from Key Vault. Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, `HOLMES_AZURE_API_VERSION`, `HOLMES_DEFAULT_TIMEOUT`, `HOLMES_MAX_CONCURRENT`) is **not yet wired through the deploy generator/templates via ARM deployment parameters**, so staging/production will use the application defaults unless explicit deployment wiring is added separately.

Copilot uses AI. Check for mistakes.
- Add trackingResponseWriter to avoid adminReply after streaming starts
- Document per-RP-instance rate limiter behavior
- Set ImagePullPolicy: PullAlways explicitly on investigation pod
- Replace manual polling loop with wait.PollImmediateUntil
@wanghaoran1988 wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from fef41fe to b67adfa Compare April 9, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants