feat: add Holmes investigation admin API endpoint (ARO-25791) by wanghaoran1988 · Pull Request #4754 · Azure/ARO-RP

wanghaoran1988 · 2026-04-07T02:42:17Z

Summary

Add admin API endpoint POST /admin/.../investigate to trigger AI-powered cluster investigations using HolmesGPT
Investigation runs as a transient pod on the Hive AKS cluster with readonly kubeconfig mounted
Results are streamed back to the caller in real-time via text/plain response
Includes RBAC (ClusterRole system:aro-diagnostics) with readonly permissions for the investigation identity

Investigation Flow

Admin API receives POST /investigate with {"question": "..."}
RP retrieves cluster's kubeconfig from CosmosDB and converts to external URL
Creates a Secret (kubeconfig + Azure OpenAI creds), ConfigMap (Holmes toolset config), and Pod on Hive cluster
Pod runs holmes_cli.py ask <question> against the customer cluster
Pod logs are streamed back to the HTTP response in real-time
Pod, Secret, and ConfigMap are cleaned up after completion

Files

pkg/frontend/admin_openshiftcluster_investigate.go — HTTP handler with rate limiting and request validation
pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go — Kubeconfig retrieval and conversion
pkg/hive/investigate.go — Pod orchestration, log streaming, cleanup
pkg/hive/staticresources/holmes-config.yaml — HolmesGPT toolset configuration (bash whitelist/denylist)
pkg/util/holmes/config.go — Configuration from environment variables
pkg/util/holmes/kubeconfig.go — Convert internal kubeconfig (api-int.) to external (api.)
pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml — Readonly RBAC for system:aro-diagnostics
hack/test-holmes-investigate.sh — Manual e2e test script

Security

Investigation pod runs as non-root (UID 1000), no privilege escalation, all capabilities dropped
Kubeconfig is readonly (system:aro-diagnostics — get/list/watch only)
Bash commands restricted: kubectl read operations allowed, write operations (delete/apply/create/exec) denied
Secrets cleaned up immediately after pod completion
Rate limiting via atomic counter (configurable max concurrent investigations)

Test plan

E2E tested against real ARO cluster (holmes-e2e) with local RP
All 5/6 toolsets loaded successfully
Investigation returned meaningful cluster health analysis
Tool calls completed in <1.1s each
Pod cleanup verified
Unit tests for handler and kubeconfig conversion
CI pipeline validation

Jira

ARO-25791

openshift-ci · 2026-04-07T02:42:20Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Add POST /admin/.../openShiftClusters/{name}/investigate endpoint that runs HolmesGPT diagnostic investigations on ARO clusters. The endpoint: - Generates a short-lived (1h) read-only kubeconfig for system:aro-diagnostics on each request using the cluster CA from the persisted graph - Creates an investigation pod on the Hive AKS cluster - Mounts the ephemeral kubeconfig as a temporary secret - Streams pod logs back to the client in real-time - Cleans up pod, configmap, and secret after completion The diagnostics identity uses a dedicated ClusterRole with read-only (get/list/watch) permissions, following the principle of least privilege. No long-lived credentials are stored in CosmosDB. Relates: ARO-25791

- Wrap all errors with context in kubeconfig generation - Move activeInvestigations counter from global to frontend struct - Use pointerutils.ToPtr() instead of &[]bool{true}[0]

Copilot

Pull request overview

This PR introduces an admin endpoint to run an AI-powered “Holmes” investigation against an OpenShift cluster by launching a transient pod on the Hive cluster, streaming the pod logs back to the caller, and cleaning up resources afterward.

Changes:

Adds POST .../investigate admin handler with basic request validation and a concurrent-investigation rate limit.
Adds Hive-side pod orchestration and log streaming, plus embedded Holmes toolset configuration.
Adds support utilities for Holmes configuration and kubeconfig rewriting, along with new diagnostics RBAC resources.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
pkg/frontend/frontend.go	Adds an `activeInvestigations` counter and wires the new `/investigate` route.
pkg/frontend/admin_openshiftcluster_investigate.go	Implements the admin investigate handler (validation, rate limiting, streaming call into Hive).
pkg/frontend/admin_openshiftcluster_investigate_test.go	Adds handler unit tests (currently focused on error paths).
pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go	Generates a short-lived diagnostics kubeconfig from persisted graph data and optionally rewrites it.
pkg/hive/manager.go	Extends the Hive `ClusterManager` interface with `InvestigateCluster`.
pkg/hive/investigate.go	Creates Secret/ConfigMap/Pod, waits for start, streams logs, and performs cleanup.
pkg/hive/staticresources/holmes-config.yaml	Adds embedded Holmes toolset configuration (allow/deny lists).
pkg/util/holmes/config.go	Reads Holmes configuration from environment variables.
pkg/util/holmes/kubeconfig.go	Adds helper to rewrite kubeconfig server URL and adjust TLS settings.
pkg/util/holmes/kubeconfig_test.go	Unit tests for kubeconfig rewriting behavior.
pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml	Adds read-only ClusterRole for `system:aro-diagnostics`.
pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml	Binds the diagnostics ClusterRole to the diagnostics user.
pkg/operator/controllers/rbac/bindata.go	Updates generated bindata to include the new RBAC static resources.
pkg/cluster/kubeconfig.go	Exports `GenerateKubeconfig` for reuse (used by diagnostics kubeconfig generation).
pkg/cluster/kubeconfig_test.go	Adds a test for generating a short-lived diagnostics kubeconfig.
pkg/util/mocks/hive/hive.go	Updates mock Hive manager to include the new InvestigateCluster method.
hack/test-holmes-investigate.sh	Adds a manual script for exercising the new admin endpoint locally.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_investigate.go

pkg/frontend/admin_openshiftcluster_investigate_test.go

pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go

pkg/hive/investigate.go

pkg/util/holmes/kubeconfig.go

pkg/util/holmes/config.go

pkg/cluster/kubeconfig.go

gofumpt requires 0o755 octal prefix instead of 0755.

- Fix Content-Type handling: only set text/plain before streaming, let adminReply set JSON content-type on error paths - Only call adminReply on error to avoid corrupting streamed response - Add automountServiceAccountToken: false to investigation pod - Add 30s timeout to cleanup context to prevent hanging deletes - Add GoDoc comment on exported GenerateKubeconfig function

- Validate Holmes config (API key, base, image) before creating K8s resources - Only set InsecureSkipTLSVerify when api-int→api rewrite actually occurred - Set test env vars for Holmes config validation in unit tests

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/hive/investigate.go

pkg/frontend/admin_openshiftcluster_investigate.go

pkg/frontend/admin_openshiftcluster_investigate_test.go

hack/test-holmes-investigate.sh

pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml

- Mount only kubeconfig key from secret volume (not Azure creds) - Use CAS loop for rate limiter to avoid inflating counter on rejection - Fix misleading comment about buffered reader in streamPodLogs - Escape question parameter in test script using jq

wanghaoran1988 · 2026-04-07T05:38:56Z

/hold
We need update the holmesgpt image after the image is built by our pipeline and pushed to acr.
We also need to update the RP configs after the new model is deployed.

pkg/frontend/admin_openshiftcluster_investigate.go

pkg/hive/investigate.go

pkg/frontend/admin_openshiftcluster_investigate.go

pkg/hive/investigate.go

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_investigate.go

pkg/util/holmes/config.go

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_investigate.go

pkg/util/holmes/config.go

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/util/holmes/config.go

pkg/hive/investigate.go

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T02:12:54Z

pkg/frontend/admin_openshiftcluster_investigate.go

+
+	// Reject control characters that could affect CLI argument parsing.
+	for _, ch := range req.Question {
+		if ch < 0x20 && ch != ' ' {


The control-character validation only rejects runes < 0x20 (except space). ASCII DEL (0x7f) is also a control character and is currently allowed, which can still cause terminal/log issues. Consider rejecting 0x7f as well (e.g., treat it as a control character in the same check).

Suggested change

if ch < 0x20 && ch != ' ' {

if (ch < 0x20 && ch != ' ') || ch == 0x7f {

Copilot · 2026-04-09T02:12:55Z

docs/holmes-investigation.md

+| `holmes-azure-api-key` | Azure OpenAI API key |
+| `holmes-azure-api-base` | Azure OpenAI endpoint URL (e.g., `https://<resource>.openai.azure.com`) |
+
+Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template.


This doc says non-secret Holmes config is set via ARM deployment parameters in pkg/deploy/generator/resources_rp.go, but there are currently no HOLMES_* references in the deploy generator/templates in this PR. Either add the deployment wiring for these env vars or adjust the wording to clarify how/where they are configured in staging/production today to avoid defaults being used unintentionally.

Suggested change

Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template.

In this PR, only the secret Holmes settings are documented as coming from Key Vault. Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, `HOLMES_AZURE_API_VERSION`, `HOLMES_DEFAULT_TIMEOUT`, `HOLMES_MAX_CONCURRENT`) is **not yet wired through the deploy generator/templates via ARM deployment parameters**, so staging/production will use the application defaults unless explicit deployment wiring is added separately.

- Add trackingResponseWriter to avoid adminReply after streaming starts - Document per-RP-instance rate limiter behavior - Set ImagePullPolicy: PullAlways explicitly on investigation pod - Replace manual polling loop with wait.PollImmediateUntil

wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from 6add5b5 to e8ef255 Compare April 7, 2026 03:02

wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from e8ef255 to 0834253 Compare April 7, 2026 03:25

fix: address code review findings for Holmes investigate API

a2b783a

- Wrap all errors with context in kubeconfig generation - Move activeInvestigations counter from global to frontend struct - Use pointerutils.ToPtr() instead of &[]bool{true}[0]

wanghaoran1988 marked this pull request as ready for review April 7, 2026 03:51

Copilot AI review requested due to automatic review settings April 7, 2026 03:51

wanghaoran1988 requested review from alcasim, bennerv, cadenmarchese, hawkowl, hlipsig, jharrington22, kevinobriendotca, kimorris27, mociarain, mrWinston, rogbas, sankur-codes, tiguelu, tsatam, tuxerrante, ventifus and yjst2012 as code owners April 7, 2026 03:51

Copilot started reviewing on behalf of wanghaoran1988 April 7, 2026 03:52 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

wanghaoran1988 added 2 commits April 7, 2026 14:32

fix: regenerate bindata with correct octal literal formatting

717b3f0

gofumpt requires 0o755 octal prefix instead of 0755.

Copilot AI review requested due to automatic review settings April 7, 2026 04:35

Copilot started reviewing on behalf of wanghaoran1988 April 7, 2026 04:36 View session

fix: address remaining Copilot PR review feedback

f3a7829

- Validate Holmes config (API key, base, image) before creating K8s resources - Only set InsecureSkipTLSVerify when api-int→api rewrite actually occurred - Set test env vars for Holmes config validation in unit tests

Copilot AI reviewed Apr 7, 2026

View reviewed changes

tuxerrante reviewed Apr 7, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_investigate.go Outdated Show resolved Hide resolved

tuxerrante reviewed Apr 7, 2026

View reviewed changes

pkg/hive/investigate.go Show resolved Hide resolved

tuxerrante reviewed Apr 7, 2026

View reviewed changes

pkg/hive/investigate.go Show resolved Hide resolved

tuxerrante reviewed Apr 7, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_investigate.go Outdated Show resolved Hide resolved

tuxerrante reviewed Apr 7, 2026

View reviewed changes

pkg/hive/investigate.go Outdated Show resolved Hide resolved

tuxerrante reviewed Apr 7, 2026

View reviewed changes

pkg/hive/investigate.go Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings April 7, 2026 23:38

Copilot started reviewing on behalf of wanghaoran1988 April 7, 2026 23:39 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_investigate.go Show resolved Hide resolved

pkg/util/holmes/config.go Show resolved Hide resolved

wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from c3050d2 to 7893e9b Compare April 9, 2026 00:02

Copilot AI review requested due to automatic review settings April 9, 2026 00:38

wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from 7893e9b to 2b23ef2 Compare April 9, 2026 00:38

Copilot started reviewing on behalf of wanghaoran1988 April 9, 2026 00:39 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from 2b23ef2 to f1915b4 Compare April 9, 2026 01:07

Copilot AI review requested due to automatic review settings April 9, 2026 01:39

wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from f1915b4 to 70ca660 Compare April 9, 2026 01:39

Copilot started reviewing on behalf of wanghaoran1988 April 9, 2026 01:40 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

pkg/util/holmes/config.go Outdated Show resolved Hide resolved

pkg/hive/investigate.go Show resolved Hide resolved

wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from 70ca660 to a187c79 Compare April 9, 2026 01:55

Copilot AI review requested due to automatic review settings April 9, 2026 02:09

wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from a187c79 to fef41fe Compare April 9, 2026 02:09

Copilot started reviewing on behalf of wanghaoran1988 April 9, 2026 02:09 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

wanghaoran1988 force-pushed the haowang/ARO-25791/holmes-admin-api branch from fef41fe to b67adfa Compare April 9, 2026 05:07

	if ch < 0x20 && ch != ' ' {
	if (ch < 0x20 && ch != ' ') \|\| ch == 0x7f {

Conversation

wanghaoran1988 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Investigation Flow

Files

Security

Test plan

Jira

Uh oh!

openshift-ci bot commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wanghaoran1988 commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

wanghaoran1988 commented Apr 7, 2026 •

edited

Loading