-
Notifications
You must be signed in to change notification settings - Fork 194
feat: add Holmes investigation admin API endpoint (ARO-25791) #4754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wanghaoran1988
wants to merge
7
commits into
master
Choose a base branch
from
haowang/ARO-25791/holmes-admin-api
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 6 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
0834253
feat: integrate Holmes investigation API into ARO-RP admin API
wanghaoran1988 a2b783a
fix: address code review findings for Holmes investigate API
wanghaoran1988 717b3f0
fix: regenerate bindata with correct octal literal formatting
wanghaoran1988 3184970
fix: address Copilot PR review feedback
wanghaoran1988 f3a7829
fix: address remaining Copilot PR review feedback
wanghaoran1988 13cc9c2
fix: address second round of Copilot PR review feedback
wanghaoran1988 b67adfa
fix: address tuxerrante's PR review feedback
wanghaoran1988 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| #!/bin/bash | ||
| # Test script for the Holmes investigation admin API endpoint. | ||
| # | ||
| # Prerequisites: | ||
| # 1. VPN connected to the dev environment | ||
| # 2. secrets/ folder generated: SECRET_SA_ACCOUNT_NAME=rharosecretsdev make secrets | ||
| # 3. AKS kubeconfig generated: make aks.kubeconfig | ||
| # 4. A test cluster created via: CLUSTER=<name> go run ./hack/cluster create | ||
| # 5. Local RP running with Hive enabled (see below) | ||
| # | ||
| # Usage: | ||
| # ./hack/test-holmes-investigate.sh <cluster-name> [question] | ||
| # | ||
| # Examples: | ||
| # ./hack/test-holmes-investigate.sh haowang-holmes-test | ||
| # ./hack/test-holmes-investigate.sh haowang-holmes-test "why is pod X crashing?" | ||
| # ./hack/test-holmes-investigate.sh haowang-holmes-test "check node memory usage" | ||
| # | ||
| # To start the local RP with Hive + Holmes enabled: | ||
| # | ||
| # source env && source secrets/env | ||
| # export HIVE_KUBE_CONFIG_PATH=$(realpath aks.kubeconfig) | ||
| # export ARO_INSTALL_VIA_HIVE=true | ||
| # export ARO_ADOPT_BY_HIVE=true | ||
| # export ARO_PODMAN_SOCKET="unix://$(podman machine inspect --format '{{.ConnectionInfo.PodmanSocket.Path}}')" | ||
| # export HOLMES_IMAGE="quay.io/haoran/holmesgpt:latest" | ||
| # export HOLMES_AZURE_API_KEY="<your-azure-openai-key>" | ||
| # export HOLMES_AZURE_API_BASE="<your-azure-openai-endpoint>" | ||
| # export HOLMES_AZURE_API_VERSION="2025-04-01-preview" | ||
| # export HOLMES_MODEL="azure/gpt-5.2" | ||
| # make runlocal-rp | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| CLUSTER_NAME="${1:-}" | ||
| QUESTION="${2:-what is the cluster health status?}" | ||
|
|
||
| if [[ -z "$CLUSTER_NAME" ]]; then | ||
| echo "Usage: $0 <cluster-name> [question]" | ||
| echo "" | ||
| echo "Examples:" | ||
| echo " $0 haowang-holmes-test" | ||
| echo " $0 haowang-holmes-test 'why is pod X crashing?'" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Source env if not already loaded | ||
| if [[ -z "${AZURE_SUBSCRIPTION_ID:-}" ]]; then | ||
| if [[ -f env ]] && [[ -f secrets/env ]]; then | ||
| source env | ||
| source secrets/env | ||
| else | ||
| echo "Error: AZURE_SUBSCRIPTION_ID not set and env files not found." | ||
| echo "Run from the repo root, or source env && source secrets/env first." | ||
| exit 1 | ||
| fi | ||
| fi | ||
|
|
||
| RESOURCEGROUP="${RESOURCEGROUP:-v4-eastus}" | ||
| RP_URL="https://localhost:8443" | ||
| API_PATH="/admin/subscriptions/${AZURE_SUBSCRIPTION_ID}/resourcegroups/${RESOURCEGROUP}/providers/Microsoft.RedHatOpenShift/openShiftClusters/${CLUSTER_NAME}/investigate" | ||
|
|
||
| echo "============================================" | ||
| echo " Holmes Investigation Test" | ||
| echo "============================================" | ||
| echo " Cluster: ${CLUSTER_NAME}" | ||
| echo " RG: ${RESOURCEGROUP}" | ||
| echo " Question: ${QUESTION}" | ||
| echo " Endpoint: POST ${RP_URL}${API_PATH}" | ||
| echo "============================================" | ||
| echo "" | ||
|
|
||
| # Check RP is running | ||
| if ! curl -sk -o /dev/null -w '' "${RP_URL}/healthz" 2>/dev/null; then | ||
| echo "Error: Local RP is not running at ${RP_URL}" | ||
| echo "Start it with: make runlocal-rp (see header comments for full env setup)" | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Sending investigation request..." | ||
| echo "Streaming results (this may take 1-5 minutes):" | ||
| echo "--------------------------------------------" | ||
|
|
||
| curl -sk --no-buffer -X POST \ | ||
| "${RP_URL}${API_PATH}" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d "$(jq -n --arg q "${QUESTION}" '{question: $q}')" | ||
|
|
||
| echo "" | ||
| echo "--------------------------------------------" | ||
| echo "Investigation complete." |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| package frontend | ||
|
|
||
| // Copyright (c) Microsoft Corporation. | ||
| // Licensed under the Apache License 2.0. | ||
|
|
||
| import ( | ||
| "context" | ||
| "encoding/json" | ||
| "fmt" | ||
| "net/http" | ||
| "path/filepath" | ||
| "strings" | ||
| "sync/atomic" | ||
|
|
||
| "github.com/go-chi/chi/v5" | ||
| "github.com/sirupsen/logrus" | ||
|
|
||
| "github.com/Azure/ARO-RP/pkg/api" | ||
| "github.com/Azure/ARO-RP/pkg/database/cosmosdb" | ||
| "github.com/Azure/ARO-RP/pkg/frontend/middleware" | ||
| "github.com/Azure/ARO-RP/pkg/util/holmes" | ||
| ) | ||
|
|
||
| type investigateRequest struct { | ||
| Question string `json:"question"` | ||
| } | ||
|
|
||
| func (f *frontend) postAdminOpenShiftClusterInvestigate(w http.ResponseWriter, r *http.Request) { | ||
| ctx := r.Context() | ||
| log := ctx.Value(middleware.ContextKeyLog).(*logrus.Entry) | ||
| r.URL.Path = filepath.Dir(r.URL.Path) | ||
|
|
||
| err := f._postAdminOpenShiftClusterInvestigate(ctx, r, log, w) | ||
| if err != nil { | ||
| // Only set Content-Type and call adminReply on error, since on success | ||
| // the response was already streamed as text/plain by InvestigateCluster. | ||
| adminReply(log, w, nil, nil, err) | ||
wanghaoran1988 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
wanghaoran1988 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
| } | ||
|
|
||
| func (f *frontend) _postAdminOpenShiftClusterInvestigate(ctx context.Context, r *http.Request, log *logrus.Entry, w http.ResponseWriter) error { | ||
| resType, resName, resGroupName := chi.URLParam(r, "resourceType"), chi.URLParam(r, "resourceName"), chi.URLParam(r, "resourceGroupName") | ||
|
|
||
| // Parse request body from context (middleware buffers the body). | ||
| body := r.Context().Value(middleware.ContextKeyBody).([]byte) | ||
| var req investigateRequest | ||
| err := json.Unmarshal(body, &req) | ||
| if err != nil { | ||
| return api.NewCloudError(http.StatusBadRequest, api.CloudErrorCodeInvalidRequestContent, "", fmt.Sprintf("The request body could not be parsed: %v.", err)) | ||
| } | ||
|
|
||
| if req.Question == "" { | ||
| return api.NewCloudError(http.StatusBadRequest, api.CloudErrorCodeInvalidParameter, "question", "The question parameter is required and must be non-empty.") | ||
| } | ||
|
|
||
| const maxQuestionLength = 1000 | ||
| if len(req.Question) > maxQuestionLength { | ||
| return api.NewCloudError(http.StatusBadRequest, api.CloudErrorCodeInvalidParameter, "question", fmt.Sprintf("The question must not exceed %d characters.", maxQuestionLength)) | ||
| } | ||
|
|
||
wanghaoran1988 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| holmesConfig := holmes.NewHolmesConfigFromEnv() | ||
| if err := holmesConfig.Validate(); err != nil { | ||
| return api.NewCloudError(http.StatusInternalServerError, api.CloudErrorCodeInternalServerError, "", fmt.Sprintf("Holmes configuration error: %v", err)) | ||
| } | ||
|
|
||
| // Rate limit: reject if too many concurrent investigations are running. | ||
| // Use CAS loop so rejected requests don't temporarily inflate the counter. | ||
| maxConcurrent := int64(holmesConfig.MaxConcurrentInvestigations) | ||
wanghaoran1988 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| for { | ||
| current := atomic.LoadInt64(&f.activeInvestigations) | ||
| if current >= maxConcurrent { | ||
| return api.NewCloudError(http.StatusTooManyRequests, api.CloudErrorCodeThrottlingLimitExceeded, "", fmt.Sprintf("Too many concurrent investigations (%d). Please try again later.", holmesConfig.MaxConcurrentInvestigations)) | ||
| } | ||
| if atomic.CompareAndSwapInt64(&f.activeInvestigations, current, current+1) { | ||
| break | ||
| } | ||
| } | ||
| defer atomic.AddInt64(&f.activeInvestigations, -1) | ||
|
|
||
| resourceID := strings.TrimPrefix(r.URL.Path, "/admin") | ||
|
|
||
| dbOpenShiftClusters, err := f.dbGroup.OpenShiftClusters() | ||
| if err != nil { | ||
| return api.NewCloudError(http.StatusInternalServerError, api.CloudErrorCodeInternalServerError, "", err.Error()) | ||
| } | ||
|
|
||
| doc, err := dbOpenShiftClusters.Get(ctx, resourceID) | ||
| switch { | ||
| case cosmosdb.IsErrorStatusCode(err, http.StatusNotFound): | ||
| return api.NewCloudError(http.StatusNotFound, api.CloudErrorCodeResourceNotFound, "", fmt.Sprintf("The Resource '%s/%s' under resource group '%s' was not found.", resType, resName, resGroupName)) | ||
| case err != nil: | ||
| return err | ||
| } | ||
|
|
||
| if f.hiveClusterManager == nil { | ||
| return api.NewCloudError(http.StatusInternalServerError, api.CloudErrorCodeInternalServerError, "", "hive is not enabled") | ||
| } | ||
|
|
||
| hiveNamespace := doc.OpenShiftCluster.Properties.HiveProfile.Namespace | ||
| if hiveNamespace == "" { | ||
| return api.NewCloudError(http.StatusInternalServerError, api.CloudErrorCodeInternalServerError, "", "cluster does not have a Hive namespace configured") | ||
| } | ||
|
|
||
| // Generate a short-lived (1h) read-only kubeconfig for the diagnostics identity. | ||
| // This uses the cluster CA from the persisted graph to sign a fresh client cert, | ||
| // then converts to the external API endpoint since Hive cannot resolve api-int.*. | ||
wanghaoran1988 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| kubeconfig, err := f.generateDiagnosticsKubeconfig(ctx, log, doc) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to generate diagnostics kubeconfig: %w", err) | ||
| } | ||
|
|
||
| log.Infof("starting Holmes investigation for cluster %s with question: %s", resourceID, req.Question) | ||
wanghaoran1988 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| // Set Content-Type before streaming begins. Once bytes are written to w, | ||
| // the response is committed and errors cannot be reported via adminReply. | ||
| w.Header().Set("Content-Type", "text/plain") | ||
|
|
||
| err = f.hiveClusterManager.InvestigateCluster(ctx, hiveNamespace, kubeconfig, holmesConfig, req.Question, w) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to investigate cluster: %w", err) | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.