diff --git a/docs/holmes-investigation.md b/docs/holmes-investigation.md new file mode 100644 index 00000000000..c104c4378b2 --- /dev/null +++ b/docs/holmes-investigation.md @@ -0,0 +1,108 @@ +# Holmes Investigation API + +The Holmes investigation API is an admin endpoint that runs [HolmesGPT](https://github.com/robusta-dev/holmesgpt) diagnostic investigations on ARO clusters. It creates a short-lived pod on the Hive AKS cluster that connects to the target cluster, runs diagnostic queries, and streams the results back to the caller. + +**Endpoint:** `POST /admin/subscriptions/{subscriptionId}/resourcegroups/{resourceGroup}/providers/Microsoft.RedHatOpenShift/openShiftClusters/{clusterName}/investigate` + +## Configuration Reference + +| Config | Env Var | Key Vault Secret (prod) | Default | Required | +|--------|---------|------------------------|---------|----------| +| Azure OpenAI API key | `HOLMES_AZURE_API_KEY` | `holmes-azure-api-key` | — | Yes | +| Azure OpenAI endpoint | `HOLMES_AZURE_API_BASE` | `holmes-azure-api-base` | — | Yes | +| HolmesGPT container image | `HOLMES_IMAGE` | — | `quay.io/haoran/holmesgpt:latest` | No | +| Azure OpenAI API version | `HOLMES_AZURE_API_VERSION` | — | `2025-04-01-preview` | No | +| LLM model name | `HOLMES_MODEL` | — | `azure/gpt-5.2` | No | +| Pod timeout (seconds) | `HOLMES_DEFAULT_TIMEOUT` | — | `600` | No | +| Max concurrent investigations per RP | `HOLMES_MAX_CONCURRENT` | — | `20` | No | + +## Config Loading + +Configuration is loaded once at RP startup in `NewFrontend` (`pkg/frontend/frontend.go`). + +**Development mode** (`RP_MODE=development`): All values are read from environment variables via `NewHolmesConfigFromEnv()`. + +**Production mode**: Sensitive values (API key, API base) are read from the service Key Vault (`{KEYVAULT_PREFIX}-svc`). Non-secret values (image, model, timeout, concurrency) are read from environment variables. This uses `NewHolmesConfig(ctx, serviceKeyvault)`. + +**Soft-load behavior**: If loading fails (e.g., Key Vault secrets not provisioned), the RP logs a warning and starts normally. Only the investigate endpoint returns an error ("Holmes investigation is not configured"). This allows the RP to operate without Holmes configured. + +The loaded config is stored on the `frontend` struct as `holmesConfig *holmes.HolmesConfig` and reused for all investigation requests. + +## How Config Reaches the Pod + +When an investigation request arrives, the RP creates three Kubernetes resources in the cluster's Hive namespace: + +1. **Secret** (`holmes-kubeconfig-{id}`) — Contains: + - `config`: Short-lived (1h) kubeconfig for `system:aro-diagnostics` identity + - `azure-api-key`: From `holmesConfig.AzureAPIKey` + - `azure-api-base`: From `holmesConfig.AzureAPIBase` + - `azure-api-version`: From `holmesConfig.AzureAPIVersion` + +2. **ConfigMap** (`holmes-config-{id}`) — Embedded toolset config from `pkg/hive/staticresources/holmes-config.yaml` (defines which kubectl commands Holmes can use) + +3. **Pod** (`holmes-investigate-{id}`) — Runs: + ``` + python holmes_cli.py ask "" -n --model= --config=/etc/holmes/config.yaml + ``` + - Image from `holmesConfig.Image` + - `ActiveDeadlineSeconds` from `holmesConfig.DefaultTimeout` + - Azure credentials injected as environment variables from the Secret + - Kubeconfig mounted at `/etc/kubeconfig/config` + +All three resources are cleaned up after the investigation completes (or fails). + +## Development Setup + +1. Ensure prerequisites: VPN connected, `secrets/env` generated, `aks.kubeconfig` generated + +2. Export Holmes environment variables: + ```bash + source env && source secrets/env + export HIVE_KUBE_CONFIG_PATH=$(realpath aks.kubeconfig) + export ARO_INSTALL_VIA_HIVE=true + export ARO_ADOPT_BY_HIVE=true + export HOLMES_IMAGE="quay.io/haoran/holmesgpt:latest" + export HOLMES_AZURE_API_KEY="" + export HOLMES_AZURE_API_BASE="" + ``` + +3. Start the local RP: `make runlocal-rp` + +4. Run an investigation: + ```bash + ./hack/test-holmes-investigate.sh "what is the cluster health status?" + ``` + +## Key Vault Provisioning (Staging/Production) + +Create the following secrets in the service Key Vault (`{KEYVAULT_PREFIX}-svc`): + +| Secret Name | Value | +|-------------|-------| +| `holmes-azure-api-key` | Azure OpenAI API key | +| `holmes-azure-api-base` | Azure OpenAI endpoint URL (e.g., `https://.openai.azure.com`) | + +Non-secret config (`HOLMES_IMAGE`, `HOLMES_MODEL`, etc.) is set via ARM deployment parameters in `pkg/deploy/generator/resources_rp.go` when added to the deployment template. + +## Security + +- **Cluster access**: Investigation pods use a `system:aro-diagnostics` identity with read-only RBAC (get/list/watch only). The kubeconfig certificate expires after 1 hour. +- **Pod security**: Runs as non-root (UID 1000), no privilege escalation, all capabilities dropped, service account token not mounted. +- **Toolset restrictions**: Destructive commands (`kubectl delete`, `kubectl apply`, `kubectl exec`, `rm`) are blocked in the Holmes toolset config. +- **Rate limiting**: Per-RP-instance atomic counter limits concurrent investigations (default 20). +- **Input validation**: Question limited to 1000 characters, control characters rejected, model name validated against safe character pattern. + +## Code Locations + +| Component | File | +|-----------|------| +| Config struct and loaders | `pkg/util/holmes/config.go` | +| Config loading at startup | `pkg/frontend/frontend.go` (search `holmesConfig`) | +| Admin API handler | `pkg/frontend/admin_openshiftcluster_investigate.go` | +| Kubeconfig generation | `pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go` | +| Pod creation and streaming | `pkg/hive/investigate.go` | +| Kubeconfig transformation (dev) | `pkg/util/holmes/kubeconfig.go` | +| Holmes toolset config | `pkg/hive/staticresources/holmes-config.yaml` | +| RBAC ClusterRole | `pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml` | +| RBAC ClusterRoleBinding | `pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml` | +| E2E test script | `hack/test-holmes-investigate.sh` | diff --git a/hack/test-holmes-investigate.sh b/hack/test-holmes-investigate.sh new file mode 100755 index 00000000000..1cb1948f3d9 --- /dev/null +++ b/hack/test-holmes-investigate.sh @@ -0,0 +1,91 @@ +#!/bin/bash +# Test script for the Holmes investigation admin API endpoint. +# +# Prerequisites: +# 1. VPN connected to the dev environment +# 2. secrets/ folder generated: SECRET_SA_ACCOUNT_NAME=rharosecretsdev make secrets +# 3. AKS kubeconfig generated: make aks.kubeconfig +# 4. A test cluster created via: CLUSTER= go run ./hack/cluster create +# 5. Local RP running with Hive enabled (see below) +# +# Usage: +# ./hack/test-holmes-investigate.sh [question] +# +# Examples: +# ./hack/test-holmes-investigate.sh haowang-holmes-test +# ./hack/test-holmes-investigate.sh haowang-holmes-test "why is pod X crashing?" +# ./hack/test-holmes-investigate.sh haowang-holmes-test "check node memory usage" +# +# To start the local RP with Hive + Holmes enabled: +# +# source env && source secrets/env +# export HIVE_KUBE_CONFIG_PATH=$(realpath aks.kubeconfig) +# export ARO_INSTALL_VIA_HIVE=true +# export ARO_ADOPT_BY_HIVE=true +# export ARO_PODMAN_SOCKET="unix://$(podman machine inspect --format '{{.ConnectionInfo.PodmanSocket.Path}}')" +# export HOLMES_IMAGE="quay.io/haoran/holmesgpt:latest" +# export HOLMES_AZURE_API_KEY="" +# export HOLMES_AZURE_API_BASE="" +# export HOLMES_AZURE_API_VERSION="2025-04-01-preview" +# export HOLMES_MODEL="azure/gpt-5.2" +# make runlocal-rp + +set -euo pipefail + +CLUSTER_NAME="${1:-}" +QUESTION="${2:-what is the cluster health status?}" + +if [[ -z "$CLUSTER_NAME" ]]; then + echo "Usage: $0 [question]" + echo "" + echo "Examples:" + echo " $0 haowang-holmes-test" + echo " $0 haowang-holmes-test 'why is pod X crashing?'" + exit 1 +fi + +# Source env if not already loaded +if [[ -z "${AZURE_SUBSCRIPTION_ID:-}" ]]; then + if [[ -f env ]] && [[ -f secrets/env ]]; then + source env + source secrets/env + else + echo "Error: AZURE_SUBSCRIPTION_ID not set and env files not found." + echo "Run from the repo root, or source env && source secrets/env first." + exit 1 + fi +fi + +RESOURCEGROUP="${RESOURCEGROUP:-v4-eastus}" +RP_URL="https://localhost:8443" +API_PATH="/admin/subscriptions/${AZURE_SUBSCRIPTION_ID}/resourcegroups/${RESOURCEGROUP}/providers/Microsoft.RedHatOpenShift/openShiftClusters/${CLUSTER_NAME}/investigate" + +echo "============================================" +echo " Holmes Investigation Test" +echo "============================================" +echo " Cluster: ${CLUSTER_NAME}" +echo " RG: ${RESOURCEGROUP}" +echo " Question: ${QUESTION}" +echo " Endpoint: POST ${RP_URL}${API_PATH}" +echo "============================================" +echo "" + +# Check RP is running +if ! curl -sk -o /dev/null -w '' "${RP_URL}/healthz" 2>/dev/null; then + echo "Error: Local RP is not running at ${RP_URL}" + echo "Start it with: make runlocal-rp (see header comments for full env setup)" + exit 1 +fi + +echo "Sending investigation request..." +echo "Streaming results (this may take 1-5 minutes):" +echo "--------------------------------------------" + +curl -sk --no-buffer -X POST \ + "${RP_URL}${API_PATH}" \ + -H "Content-Type: application/json" \ + -d "$(jq -n --arg q "${QUESTION}" '{question: $q}')" + +echo "" +echo "--------------------------------------------" +echo "Investigation complete." diff --git a/pkg/cluster/kubeconfig.go b/pkg/cluster/kubeconfig.go index 69e14eaef80..e87342dc272 100644 --- a/pkg/cluster/kubeconfig.go +++ b/pkg/cluster/kubeconfig.go @@ -25,13 +25,13 @@ import ( // kubeconfig for the ARO service, based on the admin kubeconfig found in the // graph. func (m *manager) generateAROServiceKubeconfig(pg graph.PersistedGraph) ([]byte, error) { - return generateKubeconfig(pg, "system:aro-service", []string{"system:masters"}, installer.TenYears, true) + return GenerateKubeconfig(pg, "system:aro-service", []string{"system:masters"}, installer.TenYears, true) } // generateAROSREKubeconfig generates additional admin credentials and a // kubeconfig for ARO SREs, based on the admin kubeconfig found in the graph. func (m *manager) generateAROSREKubeconfig(pg graph.PersistedGraph) ([]byte, error) { - return generateKubeconfig(pg, "system:aro-sre", nil, installer.TenYears, true) + return GenerateKubeconfig(pg, "system:aro-sre", nil, installer.TenYears, true) } // checkUserAdminKubeconfigUpdated checks if the user kubeconfig is @@ -82,7 +82,7 @@ func (m *manager) checkUserAdminKubeconfigUpdated() bool { // generateUserAdminKubeconfig generates additional admin credentials and a // kubeconfig for ARO User, based on the admin kubeconfig found in the graph. func (m *manager) generateUserAdminKubeconfig(pg graph.PersistedGraph) ([]byte, error) { - return generateKubeconfig(pg, "system:admin", nil, installer.OneYear, false) + return GenerateKubeconfig(pg, "system:admin", nil, installer.OneYear, false) } func (m *manager) generateKubeconfigs(ctx context.Context) error { @@ -127,7 +127,8 @@ func (m *manager) generateKubeconfigs(ctx context.Context) error { return err } -func generateKubeconfig(pg graph.PersistedGraph, commonName string, organization []string, validity time.Duration, internal bool) ([]byte, error) { +// GenerateKubeconfig generates a kubeconfig with a client certificate signed by the cluster CA. +func GenerateKubeconfig(pg graph.PersistedGraph, commonName string, organization []string, validity time.Duration, internal bool) ([]byte, error) { var ca *installer.AdminKubeConfigSignerCertKey var adminInternalClient *installer.AdminInternalClient err := pg.GetByName(false, "*tls.AdminKubeConfigSignerCertKey", &ca) diff --git a/pkg/cluster/kubeconfig_test.go b/pkg/cluster/kubeconfig_test.go index c006908dfc7..cf0f88890ba 100644 --- a/pkg/cluster/kubeconfig_test.go +++ b/pkg/cluster/kubeconfig_test.go @@ -143,6 +143,136 @@ func TestGenerateAROServiceKubeconfig(t *testing.T) { } } +func TestGenerateDiagnosticsKubeconfig(t *testing.T) { + validCaKey, validCaCerts, err := utiltls.GenerateKeyAndCertificate("validca", nil, nil, true, false) + if err != nil { + t.Fatal(err) + } + encodedKey, err := utilpem.Encode(validCaKey) + if err != nil { + t.Fatal(err) + } + encodedCert, err := utilpem.Encode(validCaCerts[0]) + if err != nil { + t.Fatal(err) + } + ca := &installer.AdminKubeConfigSignerCertKey{ + SelfSignedCertKey: installer.SelfSignedCertKey{ + CertKey: installer.CertKey{ + CertRaw: encodedCert, + KeyRaw: encodedKey, + }, + }, + } + + apiserverURL := "https://api-int.hash.rg.mydomain:6443" + clusterName := "api-hash-rg-mydomain:6443" + diagnosticsName := "system:aro-diagnostics" + + adminInternalClient := &installer.AdminInternalClient{} + adminInternalClient.Config = &clientcmdv1.Config{ + Clusters: []clientcmdv1.NamedCluster{ + { + Name: clusterName, + Cluster: clientcmdv1.Cluster{ + Server: apiserverURL, + CertificateAuthorityData: []byte("internal API Cert"), + }, + }, + }, + AuthInfos: []clientcmdv1.NamedAuthInfo{}, + Contexts: []clientcmdv1.NamedContext{ + { + Name: diagnosticsName, + Context: clientcmdv1.Context{ + Cluster: clusterName, + AuthInfo: diagnosticsName, + }, + }, + }, + CurrentContext: diagnosticsName, + } + + pg := graph.PersistedGraph{} + + caData, err := json.Marshal(ca) + if err != nil { + t.Fatal(err) + } + clientData, err := json.Marshal(adminInternalClient) + if err != nil { + t.Fatal(err) + } + pg["*kubeconfig.AdminInternalClient"] = clientData + pg["*tls.AdminKubeConfigSignerCertKey"] = caData + + // Generate a 1-hour kubeconfig for system:aro-diagnostics + aroDiagnosticsClient, err := GenerateKubeconfig(pg, diagnosticsName, nil, time.Hour, true) + if err != nil { + t.Fatal(err) + } + + var got *clientcmdv1.Config + err = yaml.Unmarshal(aroDiagnosticsClient, &got) + if err != nil { + t.Fatal(err) + } + + innerpem := string(got.AuthInfos[0].AuthInfo.ClientCertificateData) + string(got.AuthInfos[0].AuthInfo.ClientKeyData) + _, innercert, err := utilpem.Parse([]byte(innerpem)) + if err != nil { + t.Fatal(err) + } + + err = innercert[0].CheckSignatureFrom(validCaCerts[0]) + if err != nil { + t.Fatal(err) + } + + issuer := innercert[0].Issuer.String() + if issuer != "CN=validca" { + t.Error(issuer) + } + + subject := innercert[0].Subject.String() + if subject != "CN=system:aro-diagnostics" { + t.Error(subject) + } + + // Verify no organization (no system:masters group) + if len(innercert[0].Subject.Organization) != 0 { + t.Errorf("expected no organization, got %v", innercert[0].Subject.Organization) + } + + // Verify ~1 hour validity (not 10 years) + expectedExpiry := time.Now().Add(time.Hour) + if innercert[0].NotAfter.After(expectedExpiry.Add(5 * time.Minute)) { + t.Errorf("certificate expires too far in the future: %v", innercert[0].NotAfter) + } + if innercert[0].NotAfter.Before(expectedExpiry.Add(-5 * time.Minute)) { + t.Errorf("certificate expires too soon: %v", innercert[0].NotAfter) + } + + keyUsage := innercert[0].KeyUsage + expectedKeyUsage := x509.KeyUsageKeyEncipherment | x509.KeyUsageDigitalSignature + if keyUsage != expectedKeyUsage { + t.Error("Invalid keyUsage.") + } + + // Verify internal URL is preserved (not rewritten to external) + if got.Clusters[0].Cluster.Server != apiserverURL { + t.Errorf("expected server %s, got %s", apiserverURL, got.Clusters[0].Cluster.Server) + } + + // validate the rest of the struct + got.AuthInfos = []clientcmdv1.NamedAuthInfo{} + want := adminInternalClient.Config + + if !reflect.DeepEqual(got, want) { + t.Fatal(cmp.Diff(got, want)) + } +} + func TestGenerateUserAdminKubeconfig(t *testing.T) { validCaKey, validCaCerts, err := utiltls.GenerateKeyAndCertificate("validca", nil, nil, true, false) if err != nil { diff --git a/pkg/frontend/admin_openshiftcluster_investigate.go b/pkg/frontend/admin_openshiftcluster_investigate.go new file mode 100644 index 00000000000..da48d7fa0a3 --- /dev/null +++ b/pkg/frontend/admin_openshiftcluster_investigate.go @@ -0,0 +1,159 @@ +package frontend + +// Copyright (c) Microsoft Corporation. +// Licensed under the Apache License 2.0. + +import ( + "context" + "encoding/json" + "fmt" + "net/http" + "path/filepath" + "strings" + "sync/atomic" + + "github.com/go-chi/chi/v5" + "github.com/sirupsen/logrus" + + "github.com/Azure/ARO-RP/pkg/api" + "github.com/Azure/ARO-RP/pkg/database/cosmosdb" + "github.com/Azure/ARO-RP/pkg/frontend/middleware" +) + +type investigateRequest struct { + Question string `json:"question"` +} + +// trackingResponseWriter wraps http.ResponseWriter to track whether any bytes +// have been written. This is used to avoid calling adminReply (which writes +// JSON) after streaming has already started (which writes text/plain). +type trackingResponseWriter struct { + http.ResponseWriter + written int64 +} + +func (tw *trackingResponseWriter) Write(b []byte) (int, error) { + n, err := tw.ResponseWriter.Write(b) + atomic.AddInt64(&tw.written, int64(n)) + return n, err +} + +func (tw *trackingResponseWriter) Flush() { + if flusher, ok := tw.ResponseWriter.(http.Flusher); ok { + flusher.Flush() + } +} + +func (f *frontend) postAdminOpenShiftClusterInvestigate(w http.ResponseWriter, r *http.Request) { + ctx := r.Context() + log := ctx.Value(middleware.ContextKeyLog).(*logrus.Entry) + r.URL.Path = filepath.Dir(r.URL.Path) + + tw := &trackingResponseWriter{ResponseWriter: w} + err := f._postAdminOpenShiftClusterInvestigate(ctx, r, log, tw) + if err != nil { + if atomic.LoadInt64(&tw.written) > 0 { + // Streaming already started — can't send a JSON error response. + // Log the error server-side instead. + log.WithError(err).Warn("investigation failed after streaming started") + return + } + adminReply(log, tw, nil, nil, err) + } +} + +func (f *frontend) _postAdminOpenShiftClusterInvestigate(ctx context.Context, r *http.Request, log *logrus.Entry, w http.ResponseWriter) error { + resType, resName, resGroupName := chi.URLParam(r, "resourceType"), chi.URLParam(r, "resourceName"), chi.URLParam(r, "resourceGroupName") + + // Parse request body from context (middleware buffers the body). + body := r.Context().Value(middleware.ContextKeyBody).([]byte) + var req investigateRequest + err := json.Unmarshal(body, &req) + if err != nil { + return api.NewCloudError(http.StatusBadRequest, api.CloudErrorCodeInvalidRequestContent, "", fmt.Sprintf("The request body could not be parsed: %v.", err)) + } + + if req.Question == "" { + return api.NewCloudError(http.StatusBadRequest, api.CloudErrorCodeInvalidParameter, "question", "The question parameter is required and must be non-empty.") + } + + const maxQuestionLength = 1000 + if len(req.Question) > maxQuestionLength { + return api.NewCloudError(http.StatusBadRequest, api.CloudErrorCodeInvalidParameter, "question", fmt.Sprintf("The question must not exceed %d characters.", maxQuestionLength)) + } + + // Reject control characters that could affect CLI argument parsing. + for _, ch := range req.Question { + if ch < 0x20 && ch != ' ' { + return api.NewCloudError(http.StatusBadRequest, api.CloudErrorCodeInvalidParameter, "question", "The question must not contain control characters.") + } + } + + if f.holmesConfig == nil { + return api.NewCloudError(http.StatusInternalServerError, api.CloudErrorCodeInternalServerError, "", "Holmes investigation is not configured") + } + + // Rate limit: reject if too many concurrent investigations are running. + // Use CAS loop so rejected requests don't temporarily inflate the counter. + // NOTE: This limit is per-RP-instance (in-memory atomic counter). With N + // replicas, the effective global limit is N * MaxConcurrentInvestigations. + // A distributed limiter (e.g., CosmosDB-backed) can be added if global + // quota enforcement is needed. + maxConcurrent := int64(f.holmesConfig.MaxConcurrentInvestigations) + for { + current := atomic.LoadInt64(&f.activeInvestigations) + if current >= maxConcurrent { + return api.NewCloudError(http.StatusTooManyRequests, api.CloudErrorCodeThrottlingLimitExceeded, "", fmt.Sprintf("Too many concurrent investigations (%d). Please try again later.", f.holmesConfig.MaxConcurrentInvestigations)) + } + if atomic.CompareAndSwapInt64(&f.activeInvestigations, current, current+1) { + break + } + } + defer atomic.AddInt64(&f.activeInvestigations, -1) + + resourceID := strings.TrimPrefix(r.URL.Path, "/admin") + + dbOpenShiftClusters, err := f.dbGroup.OpenShiftClusters() + if err != nil { + return api.NewCloudError(http.StatusInternalServerError, api.CloudErrorCodeInternalServerError, "", err.Error()) + } + + doc, err := dbOpenShiftClusters.Get(ctx, resourceID) + switch { + case cosmosdb.IsErrorStatusCode(err, http.StatusNotFound): + return api.NewCloudError(http.StatusNotFound, api.CloudErrorCodeResourceNotFound, "", fmt.Sprintf("The Resource '%s/%s' under resource group '%s' was not found.", resType, resName, resGroupName)) + case err != nil: + return err + } + + if f.hiveClusterManager == nil { + return api.NewCloudError(http.StatusInternalServerError, api.CloudErrorCodeInternalServerError, "", "hive is not enabled") + } + + hiveNamespace := doc.OpenShiftCluster.Properties.HiveProfile.Namespace + if hiveNamespace == "" { + return api.NewCloudError(http.StatusInternalServerError, api.CloudErrorCodeInternalServerError, "", "cluster does not have a Hive namespace configured") + } + + // Generate a short-lived (1h) read-only kubeconfig for the diagnostics identity. + // This uses the cluster CA from the persisted graph to sign a fresh client cert. + // In development mode, the endpoint is rewritten from api-int.* to api.* since + // the Hive cluster cannot resolve private DNS there. + kubeconfig, err := f.generateDiagnosticsKubeconfig(ctx, log, doc) + if err != nil { + return fmt.Errorf("failed to generate diagnostics kubeconfig: %w", err) + } + + log.Infof("starting Holmes investigation for cluster %s (question_length=%d)", resourceID, len(req.Question)) + + // Set Content-Type before streaming begins. Once bytes are written to w, + // the response is committed and errors cannot be reported via adminReply. + w.Header().Set("Content-Type", "text/plain") + + err = f.hiveClusterManager.InvestigateCluster(ctx, hiveNamespace, kubeconfig, f.holmesConfig, req.Question, w) + if err != nil { + return fmt.Errorf("failed to investigate cluster: %w", err) + } + + return nil +} diff --git a/pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go b/pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go new file mode 100644 index 00000000000..c2ca492b949 --- /dev/null +++ b/pkg/frontend/admin_openshiftcluster_investigate_kubeconfig.go @@ -0,0 +1,80 @@ +package frontend + +// Copyright (c) Microsoft Corporation. +// Licensed under the Apache License 2.0. + +import ( + "context" + "fmt" + "time" + + "github.com/sirupsen/logrus" + + "github.com/Azure/ARO-RP/pkg/api" + "github.com/Azure/ARO-RP/pkg/cluster" + "github.com/Azure/ARO-RP/pkg/cluster/graph" + "github.com/Azure/ARO-RP/pkg/env" + "github.com/Azure/ARO-RP/pkg/util/encryption" + "github.com/Azure/ARO-RP/pkg/util/holmes" + "github.com/Azure/ARO-RP/pkg/util/storage" + "github.com/Azure/ARO-RP/pkg/util/stringutils" +) + +// generateDiagnosticsKubeconfig creates a short-lived (1 hour) kubeconfig for +// the system:aro-diagnostics identity. The kubeconfig is generated on each +// request using the cluster's CA from the persisted graph, so no long-lived +// credentials are stored in CosmosDB. +func (f *frontend) generateDiagnosticsKubeconfig(ctx context.Context, log *logrus.Entry, doc *api.OpenShiftClusterDocument) ([]byte, error) { + subscriptionDoc, err := f.getSubscriptionDocument(ctx, doc.Key) + if err != nil { + return nil, fmt.Errorf("failed to get subscription document: %w", err) + } + + credential, err := f.env.FPNewClientCertificateCredential(subscriptionDoc.Subscription.Properties.TenantID, nil) + if err != nil { + return nil, fmt.Errorf("failed to create FP credential: %w", err) + } + + options := f.env.Environment().ArmClientOptions() + storageManager, err := storage.NewManager( + subscriptionDoc.ID, + f.env.Environment().StorageEndpointSuffix, + credential, + doc.OpenShiftCluster.UsesWorkloadIdentity(), + options, + ) + if err != nil { + return nil, fmt.Errorf("failed to create storage manager: %w", err) + } + + clusterAead, err := encryption.NewMulti(ctx, f.env.ServiceKeyvault(), env.EncryptionSecretV2Name, env.EncryptionSecretName) + if err != nil { + return nil, fmt.Errorf("failed to create encryption client: %w", err) + } + + graphManager := graph.NewManager(f.env, log, clusterAead, storageManager) + resourceGroup := stringutils.LastTokenByte(doc.OpenShiftCluster.Properties.ClusterProfile.ResourceGroupID, '/') + account := "cluster" + doc.OpenShiftCluster.Properties.StorageSuffix + + pg, err := graphManager.LoadPersisted(ctx, resourceGroup, account) + if err != nil { + return nil, fmt.Errorf("failed to load persisted graph: %w", err) + } + + kubeconfig, err := cluster.GenerateKubeconfig(pg, "system:aro-diagnostics", nil, time.Hour, true) + if err != nil { + return nil, fmt.Errorf("failed to generate diagnostics kubeconfig: %w", err) + } + + // In development mode, the Hive cluster cannot resolve api-int.* private DNS + // names, so we rewrite to the external api.* endpoint. In production, the + // Hive cluster has proper network connectivity and should use api-int.* directly. + if f.env.IsLocalDevelopmentMode() { + kubeconfig, err = holmes.MakeExternalKubeconfig(kubeconfig) + if err != nil { + return nil, fmt.Errorf("failed to convert to external kubeconfig: %w", err) + } + } + + return kubeconfig, nil +} diff --git a/pkg/frontend/admin_openshiftcluster_investigate_test.go b/pkg/frontend/admin_openshiftcluster_investigate_test.go new file mode 100644 index 00000000000..34c8f31855a --- /dev/null +++ b/pkg/frontend/admin_openshiftcluster_investigate_test.go @@ -0,0 +1,256 @@ +package frontend + +// Copyright (c) Microsoft Corporation. +// Licensed under the Apache License 2.0. + +import ( + "context" + "encoding/json" + "fmt" + "io" + "net/http" + "net/http/httptest" + "strings" + "testing" + + "github.com/go-chi/chi/v5" + "github.com/sirupsen/logrus" + "github.com/stretchr/testify/require" + "go.uber.org/mock/gomock" + + "github.com/Azure/ARO-RP/pkg/api" + "github.com/Azure/ARO-RP/pkg/frontend/middleware" + "github.com/Azure/ARO-RP/pkg/metrics/noop" + "github.com/Azure/ARO-RP/pkg/util/holmes" + mock_hive "github.com/Azure/ARO-RP/pkg/util/mocks/hive" + testdatabase "github.com/Azure/ARO-RP/test/database" +) + +const ( + mockInvestigateSubID = "00000000-0000-0000-0000-000000000001" + mockInvestigateTenantID = "00000000-0000-0000-0000-000000000002" +) + +var testHolmesConfig = &holmes.HolmesConfig{ + Image: "quay.io/test/holmesgpt:latest", + AzureAPIKey: "test-key", + AzureAPIBase: "https://test.openai.azure.com", + AzureAPIVersion: "2025-04-01-preview", + Model: "azure/gpt-4o", + DefaultTimeout: 600, + MaxConcurrentInvestigations: 20, +} + +func investigateDatabaseFixture(dbFixture *testdatabase.Fixture) { + dbFixture.AddOpenShiftClusterDocuments(&api.OpenShiftClusterDocument{ + Key: strings.ToLower(testdatabase.GetResourcePath(mockInvestigateSubID, "testCluster")), + OpenShiftCluster: &api.OpenShiftCluster{ + ID: strings.ToLower(testdatabase.GetResourcePath(mockInvestigateSubID, "testCluster")), + Properties: api.OpenShiftClusterProperties{ + ClusterProfile: api.ClusterProfile{ + ResourceGroupID: fmt.Sprintf("/subscriptions/%s/resourceGroups/test-cluster", mockInvestigateSubID), + }, + HiveProfile: api.HiveProfile{ + Namespace: "aro-00000000-0000-0000-0000-000000000001", + }, + StorageSuffix: "abcdef", + }, + }, + }) + + dbFixture.AddSubscriptionDocuments(&api.SubscriptionDocument{ + ID: mockInvestigateSubID, + Subscription: &api.Subscription{ + State: api.SubscriptionStateRegistered, + Properties: &api.SubscriptionProperties{ + TenantID: mockInvestigateTenantID, + }, + }, + }) +} + +func investigateDatabaseFixtureNoHiveNamespace(dbFixture *testdatabase.Fixture) { + dbFixture.AddOpenShiftClusterDocuments(&api.OpenShiftClusterDocument{ + Key: strings.ToLower(testdatabase.GetResourcePath(mockInvestigateSubID, "testCluster")), + OpenShiftCluster: &api.OpenShiftCluster{ + ID: strings.ToLower(testdatabase.GetResourcePath(mockInvestigateSubID, "testCluster")), + Properties: api.OpenShiftClusterProperties{ + ClusterProfile: api.ClusterProfile{ + ResourceGroupID: fmt.Sprintf("/subscriptions/%s/resourceGroups/test-cluster", mockInvestigateSubID), + }, + }, + }, + }) + + dbFixture.AddSubscriptionDocuments(&api.SubscriptionDocument{ + ID: mockInvestigateSubID, + Subscription: &api.Subscription{ + State: api.SubscriptionStateRegistered, + Properties: &api.SubscriptionProperties{ + TenantID: mockInvestigateTenantID, + }, + }, + }) +} + +func TestPostAdminOpenShiftClusterInvestigate(t *testing.T) { + resourceID := strings.ToLower(testdatabase.GetResourcePath(mockInvestigateSubID, "testCluster")) + + tests := []struct { + name string + body string + resourceID string + fixture func(*testdatabase.Fixture) + hiveEnabled bool + holmesConfig *holmes.HolmesConfig + mocks func(*mock_hive.MockClusterManager) + wantStatusCode int + wantError string + }{ + { + name: "empty body returns bad request", + body: "", + resourceID: resourceID, + fixture: investigateDatabaseFixture, + hiveEnabled: true, + holmesConfig: testHolmesConfig, + wantStatusCode: http.StatusBadRequest, + wantError: "The request body could not be parsed", + }, + { + name: "empty question returns bad request", + body: `{"question":""}`, + resourceID: resourceID, + fixture: investigateDatabaseFixture, + hiveEnabled: true, + holmesConfig: testHolmesConfig, + wantStatusCode: http.StatusBadRequest, + wantError: "The question parameter is required", + }, + { + name: "question with control characters returns bad request", + body: `{"question":"what is\nthe status?"}`, + resourceID: resourceID, + fixture: investigateDatabaseFixture, + hiveEnabled: true, + holmesConfig: testHolmesConfig, + wantStatusCode: http.StatusBadRequest, + wantError: "must not contain control characters", + }, + { + name: "question too long returns bad request", + body: `{"question":"` + strings.Repeat("a", 1001) + `"}`, + resourceID: resourceID, + fixture: investigateDatabaseFixture, + hiveEnabled: true, + holmesConfig: testHolmesConfig, + wantStatusCode: http.StatusBadRequest, + wantError: "The question must not exceed 1000 characters", + }, + { + name: "holmes not configured returns internal error", + body: `{"question":"what is wrong?"}`, + resourceID: resourceID, + fixture: investigateDatabaseFixture, + hiveEnabled: true, + holmesConfig: nil, + wantStatusCode: http.StatusInternalServerError, + wantError: "Holmes investigation is not configured", + }, + { + name: "cluster not found returns not found", + body: `{"question":"what is wrong?"}`, + resourceID: strings.ToLower(testdatabase.GetResourcePath(mockInvestigateSubID, "nonexistent")), + fixture: investigateDatabaseFixture, + hiveEnabled: true, + holmesConfig: testHolmesConfig, + wantStatusCode: http.StatusNotFound, + wantError: "was not found", + }, + { + name: "hive not enabled returns internal error", + body: `{"question":"what is wrong?"}`, + resourceID: resourceID, + fixture: investigateDatabaseFixture, + hiveEnabled: false, + holmesConfig: testHolmesConfig, + wantStatusCode: http.StatusInternalServerError, + wantError: "hive is not enabled", + }, + { + name: "no hive namespace returns internal error", + body: `{"question":"what is wrong?"}`, + resourceID: resourceID, + fixture: investigateDatabaseFixtureNoHiveNamespace, + hiveEnabled: true, + holmesConfig: testHolmesConfig, + wantStatusCode: http.StatusInternalServerError, + wantError: "cluster does not have a Hive namespace configured", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + ti := newTestInfra(t).WithOpenShiftClusters().WithSubscriptions() + defer ti.done() + + err := ti.buildFixtures(tt.fixture) + if err != nil { + t.Fatal(err) + } + + var f *frontend + + if tt.hiveEnabled { + controller := gomock.NewController(t) + defer controller.Finish() + clusterManager := mock_hive.NewMockClusterManager(controller) + if tt.mocks != nil { + tt.mocks(clusterManager) + } + f, err = NewFrontend(context.Background(), ti.auditLog, ti.log, ti.otelAudit, ti.env, ti.dbGroup, api.APIs, &noop.Noop{}, &noop.Noop{}, nil, clusterManager, nil, nil, nil, nil, nil) + } else { + f, err = NewFrontend(context.Background(), ti.auditLog, ti.log, ti.otelAudit, ti.env, ti.dbGroup, api.APIs, &noop.Noop{}, &noop.Noop{}, nil, nil, nil, nil, nil, nil, nil) + } + if err != nil { + t.Fatal(err) + } + + // Override holmesConfig — NewFrontend soft-loads it (may be nil in test env). + f.holmesConfig = tt.holmesConfig + + recorder := httptest.NewRecorder() + // The URL must include /investigate — the outer handler strips it via filepath.Dir. + request := httptest.NewRequest(http.MethodPost, "/admin"+tt.resourceID+"/investigate", nil) + + ctx := context.WithValue(request.Context(), middleware.ContextKeyLog, logrus.NewEntry(logrus.StandardLogger())) + ctx = context.WithValue(ctx, middleware.ContextKeyBody, []byte(tt.body)) + ctx = context.WithValue(ctx, chi.RouteCtxKey, &chi.Context{ + URLParams: chi.RouteParams{ + Keys: []string{"resourceType", "resourceName", "resourceGroupName"}, + Values: []string{"openshiftcluster", "testCluster", "resourceGroup"}, + }, + }) + request = request.WithContext(ctx) + + f.postAdminOpenShiftClusterInvestigate(recorder, request) + + response := recorder.Result() + require.Equal(t, tt.wantStatusCode, response.StatusCode) + + if tt.wantError != "" { + bodyBytes, err := io.ReadAll(response.Body) + require.NoError(t, err) + + var cloudErr struct { + Error struct { + Message string `json:"message"` + } `json:"error"` + } + err = json.Unmarshal(bodyBytes, &cloudErr) + require.NoError(t, err) + require.Contains(t, cloudErr.Error.Message, tt.wantError) + } + }) + } +} diff --git a/pkg/frontend/frontend.go b/pkg/frontend/frontend.go index 7e2adfcf0bf..d0ae3e0130f 100644 --- a/pkg/frontend/frontend.go +++ b/pkg/frontend/frontend.go @@ -36,6 +36,7 @@ import ( "github.com/Azure/ARO-RP/pkg/util/clusterdata" "github.com/Azure/ARO-RP/pkg/util/encryption" "github.com/Azure/ARO-RP/pkg/util/heartbeat" + "github.com/Azure/ARO-RP/pkg/util/holmes" utillog "github.com/Azure/ARO-RP/pkg/util/log" "github.com/Azure/ARO-RP/pkg/util/log/audit" "github.com/Azure/ARO-RP/pkg/util/recover" @@ -94,6 +95,8 @@ type frontend struct { hiveClusterManager hive.ClusterManager hiveSyncSetManager hive.SyncSetManager + holmesConfig *holmes.HolmesConfig + activeInvestigations int64 kubeActionsFactory kubeActionsFactory azureActionsFactory azureActionsFactory appLensActionsFactory appLensActionsFactory @@ -202,6 +205,18 @@ func NewFrontend(ctx context.Context, streamResponder: defaultResponder{}, } + // Load Holmes config: secrets from Key Vault in prod, env vars in dev. + var holmesErr error + if _env.IsLocalDevelopmentMode() { + f.holmesConfig, holmesErr = holmes.NewHolmesConfigFromEnv() + } else { + f.holmesConfig, holmesErr = holmes.NewHolmesConfig(ctx, _env.ServiceKeyvault()) + } + if holmesErr != nil { + baseLog.WithError(holmesErr).Warning("Holmes config not available; investigations will be disabled") + f.holmesConfig = nil + } + l, err := f.env.Listen() if err != nil { return nil, err @@ -406,6 +421,8 @@ func (f *frontend) chiAuthenticatedRoutes(router chi.Router) { }) }) r.Get("/selectors", f.getAdminOpenShiftClusterSelectors) + + r.Post("/investigate", f.postAdminOpenShiftClusterInvestigate) }) }) diff --git a/pkg/frontend/security_test.go b/pkg/frontend/security_test.go index 60821da67c2..7fa4baf9e40 100644 --- a/pkg/frontend/security_test.go +++ b/pkg/frontend/security_test.go @@ -8,6 +8,7 @@ import ( "crypto/rsa" "crypto/tls" "crypto/x509" + "fmt" "net/http" "testing" "time" @@ -65,6 +66,7 @@ func TestSecurity(t *testing.T) { keyvault := mock_azsecrets.NewMockClient(controller) keyvault.EXPECT().GetSecret(gomock.Any(), env.RPServerSecretName, "", nil).AnyTimes().Return(azsecrets.GetSecretResponse{Secret: azsecrets.Secret{Value: pointerutils.ToPtr(string(serverPki))}}, nil) + keyvault.EXPECT().GetSecret(gomock.Any(), gomock.Not(gomock.Eq(env.RPServerSecretName)), gomock.Any(), gomock.Any()).AnyTimes().Return(azsecrets.GetSecretResponse{}, fmt.Errorf("secret not found")) _env := mock_env.NewMockInterface(controller) _env.EXPECT().IsLocalDevelopmentMode().AnyTimes().Return(false) diff --git a/pkg/frontend/shared_test.go b/pkg/frontend/shared_test.go index 120356e8fd2..d368943e6a0 100644 --- a/pkg/frontend/shared_test.go +++ b/pkg/frontend/shared_test.go @@ -114,6 +114,8 @@ func newTestInfraWithFeatures(t *testing.T, features map[env.Feature]bool) *test keyvault := mock_azsecrets.NewMockClient(controller) keyvault.EXPECT().GetSecret(gomock.Any(), env.RPServerSecretName, "", nil).AnyTimes().Return(azsecrets.GetSecretResponse{Secret: azsecrets.Secret{Value: pointerutils.ToPtr(string(serverPki))}}, nil) + // Return "not found" for any secret other than RPServerSecretName (e.g., Holmes config secrets). + keyvault.EXPECT().GetSecret(gomock.Any(), gomock.Not(gomock.Eq(env.RPServerSecretName)), gomock.Any(), gomock.Any()).AnyTimes().Return(azsecrets.GetSecretResponse{}, fmt.Errorf("secret not found")) log := logrus.NewEntry(logrus.StandardLogger()) diff --git a/pkg/hive/investigate.go b/pkg/hive/investigate.go new file mode 100644 index 00000000000..58cae1ff455 --- /dev/null +++ b/pkg/hive/investigate.go @@ -0,0 +1,320 @@ +package hive + +// Copyright (c) Microsoft Corporation. +// Licensed under the Apache License 2.0. + +import ( + "context" + "fmt" + "io" + "time" + + _ "embed" + + "github.com/google/uuid" + + corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/api/resource" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/util/wait" + + "github.com/Azure/ARO-RP/pkg/util/holmes" + "github.com/Azure/ARO-RP/pkg/util/pointerutils" +) + +//go:embed staticresources/holmes-config.yaml +var holmesConfigYAML string + +// InvestigateCluster creates an investigation pod on the Hive cluster, streams its logs, and cleans up. +// It accepts kubeconfig bytes, creates a temporary secret to hold them, and removes +// the secret (along with the pod and configmap) when the investigation completes. +func (hr *clusterManager) InvestigateCluster(ctx context.Context, hiveNamespace string, kubeconfig []byte, holmesConfig *holmes.HolmesConfig, question string, w io.Writer) error { + id := uuid.New().String()[:8] + configMapName := "holmes-config-" + id + podName := "holmes-investigate-" + id + kubeconfigSecretName := "holmes-kubeconfig-" + id + + hr.log.Infof("starting Holmes investigation %s in namespace %s", id, hiveNamespace) + + // Ensure cleanup of the secret, ConfigMap, and pod on exit. + defer func() { + cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + + hr.log.Infof("cleaning up investigation pod %s", podName) + err := hr.kubernetescli.CoreV1().Pods(hiveNamespace).Delete(cleanupCtx, podName, metav1.DeleteOptions{}) + if err != nil { + hr.log.Warningf("failed to delete investigation pod %s: %v", podName, err) + } + + hr.log.Infof("cleaning up investigation configmap %s", configMapName) + err = hr.kubernetescli.CoreV1().ConfigMaps(hiveNamespace).Delete(cleanupCtx, configMapName, metav1.DeleteOptions{}) + if err != nil { + hr.log.Warningf("failed to delete investigation configmap %s: %v", configMapName, err) + } + + hr.log.Infof("cleaning up investigation secret %s", kubeconfigSecretName) + err = hr.kubernetescli.CoreV1().Secrets(hiveNamespace).Delete(cleanupCtx, kubeconfigSecretName, metav1.DeleteOptions{}) + if err != nil { + hr.log.Warningf("failed to delete investigation secret %s: %v", kubeconfigSecretName, err) + } + }() + + // 0. Create the temporary secret holding the kubeconfig. + kubeconfigSecret := &corev1.Secret{ + ObjectMeta: metav1.ObjectMeta{ + Name: kubeconfigSecretName, + Namespace: hiveNamespace, + }, + Data: map[string][]byte{ + "config": kubeconfig, + "azure-api-key": []byte(holmesConfig.AzureAPIKey), + "azure-api-base": []byte(holmesConfig.AzureAPIBase), + "azure-api-version": []byte(holmesConfig.AzureAPIVersion), + }, + } + + _, err := hr.kubernetescli.CoreV1().Secrets(hiveNamespace).Create(ctx, kubeconfigSecret, metav1.CreateOptions{}) + if err != nil { + return fmt.Errorf("failed to create investigation kubeconfig secret: %w", err) + } + + // 1. Create the ConfigMap with Holmes toolsets config. + configMap := &corev1.ConfigMap{ + ObjectMeta: metav1.ObjectMeta{ + Name: configMapName, + Namespace: hiveNamespace, + }, + Data: map[string]string{ + "config.yaml": holmesConfigYAML, + }, + } + + _, err = hr.kubernetescli.CoreV1().ConfigMaps(hiveNamespace).Create(ctx, configMap, metav1.CreateOptions{}) + if err != nil { + return fmt.Errorf("failed to create investigation configmap: %w", err) + } + + // 2. Create the investigation pod. + activeDeadlineSeconds := int64(holmesConfig.DefaultTimeout) + runAsUser := int64(1000) + pod := &corev1.Pod{ + ObjectMeta: metav1.ObjectMeta{ + Name: podName, + Namespace: hiveNamespace, + }, + Spec: corev1.PodSpec{ + AutomountServiceAccountToken: pointerutils.ToPtr(false), + ActiveDeadlineSeconds: &activeDeadlineSeconds, + RestartPolicy: corev1.RestartPolicyNever, + Containers: []corev1.Container{ + { + Name: "holmes", + Image: holmesConfig.Image, + ImagePullPolicy: corev1.PullAlways, + Command: []string{"python", "holmes_cli.py"}, + Args: []string{"ask", question, "-n", "--model=" + holmesConfig.Model, "--config=/etc/holmes/config.yaml"}, + Env: []corev1.EnvVar{ + { + Name: "AZURE_API_KEY", + ValueFrom: &corev1.EnvVarSource{ + SecretKeyRef: &corev1.SecretKeySelector{ + LocalObjectReference: corev1.LocalObjectReference{Name: kubeconfigSecretName}, + Key: "azure-api-key", + }, + }, + }, + { + Name: "AZURE_API_BASE", + ValueFrom: &corev1.EnvVarSource{ + SecretKeyRef: &corev1.SecretKeySelector{ + LocalObjectReference: corev1.LocalObjectReference{Name: kubeconfigSecretName}, + Key: "azure-api-base", + }, + }, + }, + { + Name: "AZURE_API_VERSION", + ValueFrom: &corev1.EnvVarSource{ + SecretKeyRef: &corev1.SecretKeySelector{ + LocalObjectReference: corev1.LocalObjectReference{Name: kubeconfigSecretName}, + Key: "azure-api-version", + }, + }, + }, + { + Name: "KUBECONFIG", + Value: "/etc/kubeconfig/config", + }, + }, + VolumeMounts: []corev1.VolumeMount{ + { + Name: "kubeconfig", + MountPath: "/etc/kubeconfig", + ReadOnly: true, + }, + { + Name: "holmes-config", + MountPath: "/etc/holmes/config.yaml", + SubPath: "config.yaml", + ReadOnly: true, + }, + { + Name: "tmp", + MountPath: "/tmp", + }, + { + Name: "holmes-cache", + MountPath: "/.holmes", + }, + }, + SecurityContext: &corev1.SecurityContext{ + RunAsUser: &runAsUser, + RunAsNonRoot: pointerutils.ToPtr(true), + AllowPrivilegeEscalation: pointerutils.ToPtr(false), + Capabilities: &corev1.Capabilities{ + Drop: []corev1.Capability{"ALL"}, + }, + }, + Resources: corev1.ResourceRequirements{ + Requests: corev1.ResourceList{ + corev1.ResourceCPU: resource.MustParse("100m"), + corev1.ResourceMemory: resource.MustParse("256Mi"), + }, + Limits: corev1.ResourceList{ + corev1.ResourceCPU: resource.MustParse("1"), + corev1.ResourceMemory: resource.MustParse("2Gi"), + }, + }, + }, + }, + Volumes: []corev1.Volume{ + { + Name: "kubeconfig", + VolumeSource: corev1.VolumeSource{ + Secret: &corev1.SecretVolumeSource{ + SecretName: kubeconfigSecretName, + Items: []corev1.KeyToPath{ + { + Key: "config", + Path: "config", + }, + }, + }, + }, + }, + { + Name: "holmes-config", + VolumeSource: corev1.VolumeSource{ + ConfigMap: &corev1.ConfigMapVolumeSource{ + LocalObjectReference: corev1.LocalObjectReference{ + Name: configMapName, + }, + }, + }, + }, + { + Name: "tmp", + VolumeSource: corev1.VolumeSource{ + EmptyDir: &corev1.EmptyDirVolumeSource{}, + }, + }, + { + Name: "holmes-cache", + VolumeSource: corev1.VolumeSource{ + EmptyDir: &corev1.EmptyDirVolumeSource{}, + }, + }, + }, + }, + } + + _, err = hr.kubernetescli.CoreV1().Pods(hiveNamespace).Create(ctx, pod, metav1.CreateOptions{}) + if err != nil { + return fmt.Errorf("failed to create investigation pod: %w", err) + } + + // 3. Wait for the pod to be running. + err = hr.waitForPodRunning(ctx, hiveNamespace, podName, 60*time.Second) + if err != nil { + return fmt.Errorf("failed waiting for investigation pod to start: %w", err) + } + + // 4. Stream pod logs. + err = hr.streamPodLogs(ctx, hiveNamespace, podName, w) + if err != nil { + return fmt.Errorf("failed to stream investigation pod logs: %w", err) + } + + return nil +} + +func (hr *clusterManager) waitForPodRunning(ctx context.Context, namespace, name string, timeout time.Duration) error { + timeoutCtx, cancel := context.WithTimeout(ctx, timeout) + defer cancel() + + return wait.PollImmediateUntil(2*time.Second, func() (bool, error) { + pod, err := hr.kubernetescli.CoreV1().Pods(namespace).Get(timeoutCtx, name, metav1.GetOptions{}) + if err != nil { + return false, fmt.Errorf("failed to get pod %s: %w", name, err) + } + + switch pod.Status.Phase { + case corev1.PodRunning, corev1.PodSucceeded: + return true, nil + case corev1.PodFailed: + reason := pod.Status.Reason + message := pod.Status.Message + if len(pod.Status.ContainerStatuses) > 0 { + cs := pod.Status.ContainerStatuses[0] + if cs.State.Terminated != nil { + reason = cs.State.Terminated.Reason + message = cs.State.Terminated.Message + } else if cs.State.Waiting != nil { + reason = cs.State.Waiting.Reason + message = cs.State.Waiting.Message + } + } + return false, fmt.Errorf("pod %s failed: reason=%s message=%s", name, reason, message) + } + + return false, nil + }, timeoutCtx.Done()) +} + +func (hr *clusterManager) streamPodLogs(ctx context.Context, namespace, name string, w io.Writer) error { + req := hr.kubernetescli.CoreV1().Pods(namespace).GetLogs(name, &corev1.PodLogOptions{ + Follow: true, + }) + + stream, err := req.Stream(ctx) + if err != nil { + return fmt.Errorf("failed to open log stream for pod %s: %w", name, err) + } + defer stream.Close() + + // Read the log stream in chunks and flush after each write so the client + // sees output in real-time instead of only when the pod exits. + flusher, canFlush := w.(interface{ Flush() }) + buf := make([]byte, 4096) + for { + n, readErr := stream.Read(buf) + if n > 0 { + _, writeErr := w.Write(buf[:n]) + if writeErr != nil { + return fmt.Errorf("failed to write log stream for pod %s: %w", name, writeErr) + } + if canFlush { + flusher.Flush() + } + } + if readErr != nil { + if readErr == io.EOF { + break + } + return fmt.Errorf("failed to read log stream for pod %s: %w", name, readErr) + } + } + + return nil +} diff --git a/pkg/hive/manager.go b/pkg/hive/manager.go index 673692afc79..6200ce6bd72 100644 --- a/pkg/hive/manager.go +++ b/pkg/hive/manager.go @@ -7,6 +7,7 @@ import ( "context" "errors" "fmt" + "io" "reflect" "sort" "strings" @@ -31,6 +32,7 @@ import ( "github.com/Azure/ARO-RP/pkg/env" "github.com/Azure/ARO-RP/pkg/hive/failure" "github.com/Azure/ARO-RP/pkg/util/dynamichelper" + "github.com/Azure/ARO-RP/pkg/util/holmes" utillog "github.com/Azure/ARO-RP/pkg/util/log" ) @@ -53,6 +55,7 @@ type ClusterManager interface { GetClusterSync(ctx context.Context, oc *api.OpenShiftCluster) (*hivev1alpha1.ClusterSync, error) ListHiveK8sObjects(ctx context.Context, resource, namespace string) ([]byte, error) GetHiveK8sObject(ctx context.Context, resource, namespace, name string) ([]byte, error) + InvestigateCluster(ctx context.Context, hiveNamespace string, kubeconfig []byte, holmesConfig *holmes.HolmesConfig, question string, w io.Writer) error } type clusterManager struct { diff --git a/pkg/hive/staticresources/holmes-config.yaml b/pkg/hive/staticresources/holmes-config.yaml new file mode 100644 index 00000000000..b921e634508 --- /dev/null +++ b/pkg/hive/staticresources/holmes-config.yaml @@ -0,0 +1,128 @@ +toolsets: + # ========== ENABLED ========== + kubectl-run: + enabled: true + kubernetes/kube-prometheus-stack: + enabled: true + kubernetes/logs: + enabled: true + kubernetes/core: + enabled: true + kubernetes/live-metrics: + enabled: true + bash: + enabled: true + config: + builtin_allowlist: extended + allow: + - "kubectl get" + - "kubectl describe" + - "kubectl logs" + - "kubectl top" + - "kubectl cluster-info" + - "kubectl explain" + - "kubectl api-resources" + - "kubectl version" + - "egrep" + deny: + - "kubectl delete" + - "kubectl apply" + - "kubectl create" + - "kubectl edit" + - "kubectl exec" + - "kubectl patch" + - "kubectl scale" + - "kubectl drain" + - "kubectl cordon" + - "kubectl taint" + - "kubectl debug" + - "rm" + - "oc" + + # ========== DISABLED ========== + core_investigation: + enabled: false + openshift/core: + enabled: false + openshift/logs: + enabled: false + openshift/security: + enabled: false + openshift/live-metrics: + enabled: false + runbook: + enabled: false + internet: + enabled: false + connectivity_check: + enabled: false + robusta: + enabled: false + kubernetes/krew-extras: + enabled: false + kubernetes/kube-lineage-extras: + enabled: false + aks/core: + enabled: false + aks/node-health: + enabled: false + argocd/core: + enabled: false + cilium/core: + enabled: false + hubble/observability: + enabled: false + docker/core: + enabled: false + helm/core: + enabled: false + grafana/dashboards: + enabled: false + grafana/loki: + enabled: false + grafana/tempo: + enabled: false + prometheus/metrics: + enabled: false + datadog/logs: + enabled: false + datadog/general: + enabled: false + datadog/metrics: + enabled: false + datadog/traces: + enabled: false + elasticsearch/data: + enabled: false + elasticsearch/cluster: + enabled: false + opensearch/query_assist: + enabled: false + coralogix: + enabled: false + newrelic: + enabled: false + kafka/admin: + enabled: false + rabbitmq/core: + enabled: false + notion: + enabled: false + confluence: + enabled: false + slab: + enabled: false + servicenow/tables: + enabled: false + azure/sql: + enabled: false + MongoDBAtlas: + enabled: false + database/sql: + enabled: false + inspektor-gadget/node: + enabled: false + inspektor-gadget/tcpdump: + enabled: false + kubevela/core: + enabled: false diff --git a/pkg/operator/controllers/rbac/bindata.go b/pkg/operator/controllers/rbac/bindata.go index e4f8ca6b9b3..4fc4562c6c9 100644 --- a/pkg/operator/controllers/rbac/bindata.go +++ b/pkg/operator/controllers/rbac/bindata.go @@ -1,6 +1,8 @@ // Code generated for package rbac by go-bindata DO NOT EDIT. (@generated) // sources: +// staticresources/clusterrole-diagnostics.yaml // staticresources/clusterrole.yaml +// staticresources/clusterrolebinding-diagnostics.yaml // staticresources/clusterrolebinding.yaml package rbac @@ -78,6 +80,26 @@ func (fi bindataFileInfo) Sys() interface{} { return nil } +var _clusterroleDiagnosticsYaml = []byte("\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\xff\xa4\x56\x3d\x6f\xf3\x46\x0c\xde\xfd\x2b\x84\x77\x09\x50\x20\x0e\xba\x15\x5e\x3b\x74\xe9\x14\xa0\xdd\xe9\x13\x2d\xb1\x3e\x1d\xaf\x24\x4f\x81\xf3\xeb\x0b\xe9\x28\x3b\x8e\xe3\xfa\x85\xb2\x91\x3c\xea\xf8\x3c\xfc\x3a\x41\xa6\xbf\x51\x94\x38\xed\x1a\xd9\x43\xd8\x42\xb1\x9e\x85\xde\xc1\x88\xd3\xf6\xf8\x9b\x6e\x89\x5f\xc6\x5f\x37\x47\x4a\xed\xae\xf9\x3d\x16\x35\x94\x57\x8e\xb8\x19\xd0\xa0\x05\x83\xdd\xa6\x69\x12\x0c\xb8\x6b\xf4\xa4\x86\xc3\x0e\x84\x9f\x5b\x82\x2e\xb1\x1a\x05\xdd\x48\x89\xa8\xbb\xcd\x73\x03\x99\xfe\x10\x2e\x59\xa7\x4f\x9e\x9b\x1f\x3f\x36\x4d\x23\xa8\x5c\x24\xa0\xdb\x32\xb7\x7a\x16\x5e\x22\x77\xb3\x92\xb8\xc5\x6a\x56\x94\x91\x82\x2b\x98\xda\xcc\x94\xac\x6a\x81\xd3\x81\xba\x01\xb2\x1f\x8e\xb8\x9c\x4c\xe8\x34\xc3\xf2\x59\x9e\x08\xab\x61\xb2\x91\x63\x19\x30\x44\xa0\xa1\x1e\x09\xe6\x48\x61\xe6\x1e\x38\x99\x70\x8c\x28\xcb\x51\x05\xfa\x6f\x61\x83\x2b\x30\x10\x02\x97\x25\x56\xa4\x81\x4c\x20\x75\x73\xb0\x11\x65\xef\xcc\x3a\x34\x77\xd0\x2a\xbc\x81\x85\xfe\x36\x2b\x90\x67\x02\x9f\xf2\xd2\x62\x8e\x7c\x1a\xce\x94\x5a\xc0\x81\x93\xa2\xab\x6a\x60\x78\x28\xf1\x6c\x70\x22\xae\xaf\x80\xb1\x9f\xcd\x37\x38\xfe\xe1\xbd\x67\x5b\x38\xb9\xb2\xe2\xf6\x84\xf6\xc6\x72\xa4\xd4\x79\x8b\xdd\x46\x72\x97\xcc\x91\x02\x79\xe5\x28\x75\x82\xaa\x6b\x53\x3b\xdf\x75\xfa\xb2\xe9\x5a\x52\x29\x79\xaa\xfb\xbe\xb4\xdd\xea\xac\xa9\xb1\x40\x87\x77\x49\xf9\x79\x88\xe0\x2c\x6e\xbb\xb1\x5a\xab\x0c\x66\x10\xfa\x4b\xd5\x83\xd2\x65\x14\x82\x52\x2b\x34\xd6\xfe\x5c\xd3\x68\xc5\x58\x03\x44\x4a\xdd\x2d\xd0\x79\x07\x70\x32\x88\x99\xdb\xc5\x73\x75\xa8\xbb\x9b\xe5\x36\xb0\x70\x74\x7e\x93\xb4\xa7\xd4\x52\xea\x9c\x70\x5d\x3e\x17\x8f\x0f\x86\x0f\x8e\x2b\x67\x6e\xcb\x19\x93\xf6\x74\xb0\x2f\x71\x5d\x06\xb0\xee\x99\xb5\x99\xe0\x62\xf8\x20\xd4\xec\xb3\xf2\xfe\x01\x42\x4f\xe9\x51\x04\xf7\xd2\x2b\x65\x59\x1d\xae\xf7\x08\xd1\xfa\xd0\x63\x38\xae\xc4\x52\x13\xf5\x00\x8a\xd7\x70\xac\x8f\xd0\x55\x5d\x39\xa3\x80\xb1\x2c\xb3\x7f\x10\x50\x93\x12\xac\xc8\x37\xd3\x53\x91\x15\xa9\xad\xf8\x53\xb9\xba\x54\xfd\x93\x29\x33\xc7\xb5\x68\x38\x91\xb1\x4c\x5b\x30\xb0\x20\xeb\x36\xf0\xf0\xc5\x7a\x12\x1e\xd0\x7a\x2c\x3a\xbf\xa4\x1f\x9f\x1e\xbf\xa1\xda\xa6\x01\xb5\x01\x12\x74\xab\x07\x75\x49\xf9\x83\x9c\x3c\xfd\xf2\xf4\xbd\xfb\xf5\x7f\x09\x6b\xd9\x6b\x10\x9a\xd7\xf1\x55\x47\x38\xeb\xab\x66\xa1\xa4\x06\x31\xe6\x08\x6e\x58\x62\x74\x73\xdc\x95\x95\x41\x13\x0a\x7a\xff\x71\x3a\x6f\x61\xff\x61\xb9\x17\x63\x72\x4d\xaf\xfe\xf1\x5f\xaf\x7f\xba\xcf\x4b\x9d\xae\xf7\xaa\x44\x1a\xd1\x45\x41\x68\x4f\x2e\x3b\xcd\xaa\x38\xa2\xdb\x50\xff\x05\x00\x00\xff\xff\x1d\xa1\x45\xf4\xc2\x09\x00\x00") + +func clusterroleDiagnosticsYamlBytes() ([]byte, error) { + return bindataRead( + _clusterroleDiagnosticsYaml, + "clusterrole-diagnostics.yaml", + ) +} + +func clusterroleDiagnosticsYaml() (*asset, error) { + bytes, err := clusterroleDiagnosticsYamlBytes() + if err != nil { + return nil, err + } + + info := bindataFileInfo{name: "clusterrole-diagnostics.yaml", size: 0, mode: os.FileMode(0), modTime: time.Unix(0, 0)} + a := &asset{bytes: bytes, info: info} + return a, nil +} + var _clusterroleYaml = []byte("\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\xff\xbc\x5a\x5f\x8f\xe3\xb8\x0d\x7f\x9f\x4f\x61\x5c\x1f\x0e\x28\x30\x59\xf4\xad\x98\x3e\xde\x15\x45\x81\xa2\x07\x2c\xb6\x7d\x67\x64\xc6\xe6\x45\x16\xb5\x14\x95\xd9\xf4\xd3\x17\x92\x65\xc7\xce\x3f\x67\x9c\xdb\x7d\x1a\x9b\x92\xf9\x23\x29\x8a\xff\x32\x7f\xaa\x7e\xe1\x1a\xab\x06\x1d\x0a\x28\xd6\xd5\xf6\x58\xb5\x60\xf6\x9f\x1a\x74\x35\x05\xc3\x07\x94\xa3\x01\xd3\xe2\xdf\xaa\x5f\x7f\xab\xfe\xfd\xdb\x97\xea\xef\xbf\xfe\xf3\xcb\xe6\x05\x3c\xfd\x17\x25\x10\xbb\xb7\x4a\xb6\x60\x36\x10\xb5\x65\xa1\xff\x81\x12\xbb\xcd\xfe\xaf\x61\x43\xfc\xe9\xf0\x97\x97\x3d\xb9\xfa\xad\xfa\xc5\xc6\xa0\x28\x9f\xd9\xe2\x4b\x87\x0a\x35\x28\xbc\xbd\x54\x95\x11\xcc\x1f\x7c\xa1\x0e\x83\x42\xe7\xdf\x2a\x17\xad\x7d\xa9\x2a\x07\x1d\xbe\x55\xe1\x18\x14\xbb\x37\x10\x7e\x0d\x82\x2f\x12\x2d\x86\xb7\x97\xd7\x0a\x3c\xfd\x43\x38\xfa\x90\x98\xbc\x56\x3f\xfd\xf4\x52\x55\x82\x81\xa3\x18\x2c\x34\xc3\x9d\x67\x87\x4e\x83\x82\xc6\x80\xe1\xa5\xaa\x0e\x28\xdb\xb2\xdc\xa0\xe6\xbf\x96\x82\x3e\xca\xd0\xed\xa8\xe9\xc0\x87\xfc\x8a\xae\xf6\x4c\x4e\xcb\xdb\x01\x87\x47\x4b\x1d\xa9\x80\x6b\xb0\x7f\x4f\x9a\x04\x0f\x66\x78\xe5\xba\x3c\xf9\x64\xc0\xa0\xe8\xf4\xc0\x36\x76\x68\x2c\x50\x77\x7d\xa9\x50\xb9\x1e\x1f\x14\x3b\x6f\x41\xcb\x8a\xa0\xb7\x64\xb2\x29\x0d\x3b\x15\xb6\x16\x65\x58\xea\xb5\xf8\x1a\x59\xa1\x27\x05\x94\x03\x19\x04\x63\x38\x0e\x52\x17\xda\x3d\x2b\xa5\x87\x77\x50\xd3\x3e\x66\xaf\x24\xed\x27\xcb\xcd\x25\xc7\x8b\xcf\xa1\xee\x28\x24\x67\x12\x6c\x28\xa8\x4c\x9d\xe8\x92\x71\x17\x15\x94\x5c\xf3\x8e\xdb\x96\x79\xdf\x9f\x4b\xec\x3f\xea\x95\x39\x80\xa5\xfa\xee\x9e\x15\x3a\x82\x27\xfc\xa6\xe8\x92\x9c\xe1\xa6\x70\x26\x06\xe5\x6e\x20\xd6\xb8\x23\x47\xcf\x81\x3e\x64\x13\xf0\xf4\xdc\x09\x16\x06\x28\x1b\xf6\xe8\x42\x4b\x3b\xbd\x05\x24\xf8\x35\x62\xd0\xd1\x79\x56\xa1\xe5\x5b\x74\x79\xc3\x8a\xeb\x0a\x1e\x28\x8c\xc7\x59\x03\x76\xec\x02\x16\x57\xad\xd1\x5b\x3e\x76\xe3\x85\x2b\xce\x3f\xae\xa7\x0b\x8f\xbb\x68\x0b\x61\xa5\x78\x0b\x76\x38\x09\xd1\xfb\xd6\x0f\x44\x7a\xf4\x52\x09\x2f\x70\x36\x7d\x54\x5e\x2b\x7a\xd4\x16\x9d\x96\xb0\x73\xd3\x33\x95\xf7\xe8\xd2\x79\xe2\xfb\x19\x50\x0e\xfe\x78\x9d\xf1\x79\x2a\xb9\xe4\x1b\xd0\xee\x42\xdc\xfe\x8e\x46\xc1\x18\x0c\xe1\x84\x31\x5b\xcc\x39\x63\xb6\x76\xfd\xa3\x0f\x0b\xf6\x90\x6d\x85\x2d\x6e\xc9\xd5\xe4\x9a\x70\x4e\x2f\xde\x7b\xbe\x63\x58\x7a\x38\x59\x7d\x44\xac\xe1\xf5\x8a\xc9\x7e\x88\x59\x26\xda\x0a\x06\x15\x32\xcf\x04\xc7\xa8\x1c\x0c\x58\x72\xcd\x25\x52\x16\x89\x9d\x82\xf5\x5c\x0f\x3b\x9f\x71\xf6\x01\xea\xb1\x83\x9f\x23\xbe\x56\x1d\x98\x96\x1c\x3e\x2d\xc8\x36\x93\x2f\x51\x85\xdd\xef\xbc\xed\xb1\xca\xc3\x1a\xee\x91\x6c\xbd\xa0\x60\xde\x73\x0a\x7a\x85\xf0\xbd\x01\x1f\x8d\x7a\x06\x45\x69\x97\x82\x12\xde\x49\xd2\x93\x4d\xd4\xb8\xec\x8c\x39\xa7\xad\xd4\xc2\x58\x8e\xf5\xc6\xa1\xbe\xb3\xec\x17\xfd\x83\x63\xed\x85\x0e\xa0\x48\xfe\xa9\xe4\x91\x59\x19\xc1\x3a\x85\x61\xb0\x4b\xc0\xe3\xc6\xf0\xa4\xb6\x59\xe6\xe5\x3a\xa1\xaf\x27\xc2\x95\x64\x71\xe6\x37\xe3\xa5\x61\x9f\xda\x10\x96\x19\xf1\xd0\xf7\x19\x61\xc0\x0e\x63\xe8\xac\x5d\x28\x4f\x3b\x04\x8d\x82\xcd\x58\x10\x53\x07\x0d\xa6\x7a\x02\x9d\x7a\xb6\x64\x68\xba\x50\x1e\xdd\x4e\x20\xa8\x44\x93\xbe\x1d\x68\x29\x2e\x0d\x6c\xcb\x89\x9e\x17\xee\x9c\xb4\x29\x8f\x45\xe2\x36\x96\xbb\xe7\x85\x53\x18\x1d\x5f\xbe\x0d\xc0\xc1\xb4\x58\xc7\xf5\x17\xbf\x68\xbe\x74\xc8\xfd\x2e\x63\xa9\xe6\x77\x67\x19\xea\x99\xdd\x52\x01\x2b\x0e\xac\xe5\xc6\x92\xdb\xcf\xd6\x2e\x08\x8e\xcb\x25\x39\xb7\xbe\xb7\xb1\xa1\x39\xe9\x6b\x24\xb3\x0f\x0a\xa2\x33\xf2\x11\x3a\x1b\xa0\xf3\xf7\x33\xda\x7d\xad\x53\x49\xe8\x2d\xb8\xac\x7a\x36\xf6\x82\x0d\x3c\xd7\xe5\xe0\x0c\x3b\x87\x46\xe9\x40\x7a\x34\x2d\x9a\xfd\x6a\x29\x58\x6a\x72\xf7\x0b\x12\x8b\x70\xbf\xcb\xbc\x03\x30\xf6\xda\x37\xb9\x8f\xed\xa6\x5d\x5f\xe3\xf7\x4d\xea\x6d\x88\xa1\x87\x5d\xc1\x7a\x67\xf9\xbd\x9c\xd5\xe6\xd4\x4a\xdc\x42\x4a\xbb\xd3\x7d\xe8\x60\xb8\x27\xc4\x42\x7a\xb4\x78\x40\xfb\x47\xf4\x6a\x2d\xda\x6e\xc1\x4b\xd2\x16\xd3\x82\xa8\xa0\xe7\x40\xca\x32\xdc\xd4\x72\x85\x6f\x6d\x58\x21\x4e\x0e\x3a\x0b\xf2\x4c\x03\x53\x7e\x54\x41\xe8\xbe\x3b\x60\x46\x19\xb1\x97\xb2\xeb\x07\xf9\x2a\x34\x13\x8d\xca\xdb\xc3\x75\x6d\xfe\xa8\x74\xbf\xc7\x47\x2f\xff\xb4\x30\xc9\x0c\xbc\x44\xb7\x3a\xe8\x96\x74\xf0\x28\x78\xed\x82\xa0\x61\x59\x5b\x0d\xa5\xfb\x62\x1c\x6d\x8c\x33\xbb\xab\x00\x25\xae\xbd\x82\x2a\x98\x36\xf5\x84\xaf\x4f\x8f\x18\x12\x28\x1f\xdc\x86\xe5\x4a\x15\x8d\x59\xff\x1d\x09\xbe\x83\xb5\x61\x42\x23\x3f\x7d\xfb\xca\xab\x63\x5f\xa9\x8c\x17\x4c\x5b\x76\xb5\x08\x56\xdb\x31\x96\x8f\xf4\xf9\x0b\x9e\xb2\xd0\x98\x3d\xe6\x6b\xeb\xe5\x9c\x85\xa7\x65\x67\x54\x20\x87\x22\xd1\x29\x75\x38\x75\xce\xd3\xb0\x63\x4a\xdd\xc7\x2d\x5a\xd4\x29\x69\x86\xeb\x99\xed\x15\xf2\x5a\x95\x30\x75\x60\xb7\x33\xc2\x64\x56\xc9\x77\x7d\xfa\x92\x33\x35\x0b\x13\xab\xa0\x2c\x39\x50\x8c\x35\x5b\xa1\x94\x7a\x6f\xe4\xb0\x56\x37\x76\x39\x66\xbb\x66\x63\x58\x90\xc3\xc6\x70\x77\xa5\x4e\xb5\x28\xda\x81\x4b\xa1\x66\x6a\xf5\x29\x7d\x34\x41\xe1\x39\xa6\x89\xed\x29\x63\x74\xa8\x2d\xc6\x70\x41\xc8\xe3\x87\x5e\xbd\x7e\x3e\x37\xe3\xa1\x2d\x38\xce\x7b\xd6\x46\xa8\x49\xd3\xf1\x48\x84\xea\x6f\xab\x70\xd4\x41\xad\xe1\x43\xbf\xa7\xe7\x44\x48\x96\xbe\x75\xd8\x25\x8e\x1a\x0b\x63\x6d\x7d\xb5\xd2\x9e\x94\xea\x6b\x04\xe1\x1a\x6f\x8a\x30\x5c\xc1\x51\x84\x15\x00\x0f\x1a\xf9\x5a\xb3\x73\xd6\xab\xcd\x3a\x1c\x13\xa8\x16\x1a\x7b\xa5\xb3\x28\x71\xea\x77\x4c\xa0\xe0\xc0\x87\x96\xf5\x7c\xca\x7f\x6a\x85\x50\x4d\x7d\xd9\x03\xf5\xf2\xcd\x3b\xa1\x72\x26\x67\x9c\x52\x04\x3a\x6b\xde\x12\xe9\xb4\x6d\x76\x29\xd2\xd2\xac\xb3\x29\xa4\x2b\x57\x79\xf0\xf9\x59\x53\x35\x1a\xf2\x0c\x72\xa4\xdf\xc0\x2d\x97\xc9\xc0\x2c\x74\x3c\x79\xac\xe1\x6e\xa4\x30\xa0\x60\xb9\x29\xb4\xe9\xf9\x15\x61\x66\x8d\x2a\xb9\xa0\x60\x73\xea\x29\x1a\xd9\x6e\x7a\xb0\x03\xa6\x61\x57\xd3\xc9\x4d\x06\x72\x93\xc5\x9b\x4b\xd7\x6b\x1a\xb7\xc1\x08\xf9\x27\x22\xa3\x07\xb3\x4f\xc6\xda\x3c\xa6\x77\xd9\xde\x81\xa3\xdd\xc2\xd0\xe0\x12\x0a\x65\xc7\xe9\xe4\xcc\x52\x7e\x9f\xec\xf4\xc2\x3b\x5a\xdd\x31\x66\x1f\x3f\x5e\xed\x0a\x6b\x0a\x12\xb3\xe5\xb6\xb1\x6e\x86\x22\x21\xa5\x36\x34\x31\xf5\x1f\xcf\x85\x1f\xdf\x4f\x4d\x37\xcb\x83\xf7\xb2\xb3\x4c\x2d\x56\xc3\xe5\x36\x65\x11\x2a\xef\xba\x3e\xf1\xb9\x71\x68\x1f\x61\xbc\x52\xf8\xfc\x6b\xe4\xe2\x08\xc9\x5b\xc2\x7a\x98\x98\x9f\xff\x8e\xf9\xb0\x13\x3e\x82\xf5\x51\x90\x3b\xaa\xdd\xfc\x39\xfc\x47\xfe\x48\x70\x4f\xbe\x94\xfc\x17\xc7\xf4\x51\xd7\xf2\x2f\xf3\x9e\x8f\x37\x6c\x6b\xb0\xfa\xd4\x73\xaf\xf4\x18\x66\x0b\x4f\x25\xfe\x21\x40\x6c\xc8\xf5\x03\xb4\x25\xfb\x81\x6b\x10\xac\x65\xf3\x4c\x09\x3b\xa2\x7e\x18\xec\xf4\x6d\xce\xff\xdf\x52\x1a\x0d\x2a\x40\xab\x07\x3c\x43\xe1\xb1\x29\xa9\xf6\xa6\xbd\xcb\x3f\x2c\x0c\x75\xca\xa4\xe2\x3b\x5b\x29\x85\xc9\x95\xa5\xb5\x22\x2e\x48\x76\x5e\x66\x05\x3a\xf5\x37\xa9\xb4\xea\x3f\x37\xe0\xc1\x90\xd2\xbc\x29\xb9\xd4\xe3\xd4\x82\xaf\x14\x77\xf8\x0f\x8e\xa5\x5f\x3f\x84\xf7\x28\xc3\xe6\x5c\x56\xb8\xa1\xfe\xb8\x4f\x5d\x2b\x57\x74\xb8\xf4\x93\xcc\x24\x45\x97\x0f\x56\x82\x45\xdf\x08\xd4\xb8\xe9\x8b\xbb\x25\xd8\xb2\xfb\xa9\x90\x11\xc3\xe2\xff\x38\x4c\x8a\x2f\xca\xf5\xfa\xe8\x0d\xe9\xe3\xef\x87\x9b\xb6\x14\xc0\x63\x07\xde\x97\x68\xbf\x34\x94\x7b\x6f\x51\x10\xb6\x1c\x75\x61\x7a\x44\xfe\x34\x3e\xe0\x03\x8a\xed\x31\x72\xfc\x20\x2f\x98\xea\xd8\x0f\xc5\x2b\xc7\xee\x73\x81\xf8\xcf\xe7\x7f\x95\xdd\x3f\xff\xf9\xe7\xcb\xcf\xff\x1f\x00\x00\xff\xff\xc3\x85\xb5\xcb\x69\x26\x00\x00") func clusterroleYamlBytes() ([]byte, error) { @@ -98,6 +120,26 @@ func clusterroleYaml() (*asset, error) { return a, nil } +var _clusterrolebindingDiagnosticsYaml = []byte("\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\xff\x84\xcd\xb1\xaa\xc2\x50\x0c\x80\xe1\xfd\x3c\x45\x5e\xa0\xbd\xdc\x4d\xce\xa8\x83\x7b\x41\xf7\xb4\x27\xd6\xd8\x36\x29\x49\x8e\xa0\x4f\x2f\x82\x9b\x60\xf7\xff\xe7\xc3\x95\xcf\x64\xce\x2a\x19\xac\xc7\xa1\xc5\x1a\x57\x35\x7e\x62\xb0\x4a\x3b\xed\xbc\x65\xfd\xbb\xff\xa7\x89\xa5\x64\x38\xcc\xd5\x83\xac\xd3\x99\xf6\x2c\x85\x65\x4c\x0b\x05\x16\x0c\xcc\x09\x40\x70\xa1\x0c\xfe\xf0\xa0\x25\xa3\x69\x53\x18\x47\x51\x0f\x1e\x3c\x99\xce\xd4\xd1\xe5\xdd\xe1\xca\x47\xd3\xba\xfe\x30\x13\xc0\x17\xb9\x25\x78\xed\x6f\x34\x84\xe7\xd4\x7c\xe6\x93\x93\x6d\x5d\xaf\x00\x00\x00\xff\xff\xe6\x2c\x81\x7a\x03\x01\x00\x00") + +func clusterrolebindingDiagnosticsYamlBytes() ([]byte, error) { + return bindataRead( + _clusterrolebindingDiagnosticsYaml, + "clusterrolebinding-diagnostics.yaml", + ) +} + +func clusterrolebindingDiagnosticsYaml() (*asset, error) { + bytes, err := clusterrolebindingDiagnosticsYamlBytes() + if err != nil { + return nil, err + } + + info := bindataFileInfo{name: "clusterrolebinding-diagnostics.yaml", size: 0, mode: os.FileMode(0), modTime: time.Unix(0, 0)} + a := &asset{bytes: bytes, info: info} + return a, nil +} + var _clusterrolebindingYaml = []byte("\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\xff\x7c\xcd\xbd\x0a\xc2\x40\x0c\x07\xf0\xfd\x9e\x22\x2f\xd0\x8a\x9b\xdc\xa8\x83\x7b\x41\xf7\xb4\x8d\x1a\xdb\x26\x47\x92\x13\xf4\xe9\x45\x70\x93\x3a\xff\x3f\x7e\x58\xf8\x4c\xe6\xac\x92\xc1\x7a\x1c\x5a\xac\x71\x53\xe3\x17\x06\xab\xb4\xd3\xce\x5b\xd6\xcd\x63\x9b\x26\x96\x31\xc3\x61\xae\x1e\x64\x9d\xce\xb4\x67\x19\x59\xae\x69\xa1\xc0\x11\x03\x73\x02\x10\x5c\x28\x83\x3f\x3d\x68\xc9\x68\xda\xb8\x51\x32\x9d\xa9\xa3\xcb\x27\xc7\xc2\x47\xd3\x5a\xfe\x58\x09\xe0\x87\x5a\x7b\xf6\xda\xdf\x69\x08\xcf\xa9\xf9\x8e\x4e\x4e\xb6\xd6\x7e\x07\x00\x00\xff\xff\xc4\xb6\x1b\x05\xeb\x00\x00\x00") func clusterrolebindingYamlBytes() ([]byte, error) { @@ -170,8 +212,10 @@ func AssetNames() []string { // _bindata is a table, holding each asset generator, mapped to its name. var _bindata = map[string]func() (*asset, error){ - "clusterrole.yaml": clusterroleYaml, - "clusterrolebinding.yaml": clusterrolebindingYaml, + "clusterrole-diagnostics.yaml": clusterroleDiagnosticsYaml, + "clusterrole.yaml": clusterroleYaml, + "clusterrolebinding-diagnostics.yaml": clusterrolebindingDiagnosticsYaml, + "clusterrolebinding.yaml": clusterrolebindingYaml, } // AssetDir returns the file names below a certain @@ -217,8 +261,10 @@ type bintree struct { } var _bintree = &bintree{nil, map[string]*bintree{ - "clusterrole.yaml": {clusterroleYaml, map[string]*bintree{}}, - "clusterrolebinding.yaml": {clusterrolebindingYaml, map[string]*bintree{}}, + "clusterrole-diagnostics.yaml": {clusterroleDiagnosticsYaml, map[string]*bintree{}}, + "clusterrole.yaml": {clusterroleYaml, map[string]*bintree{}}, + "clusterrolebinding-diagnostics.yaml": {clusterrolebindingDiagnosticsYaml, map[string]*bintree{}}, + "clusterrolebinding.yaml": {clusterrolebindingYaml, map[string]*bintree{}}, }} // RestoreAsset restores an asset under the given directory @@ -231,7 +277,7 @@ func RestoreAsset(dir, name string) error { if err != nil { return err } - err = os.MkdirAll(_filePath(dir, filepath.Dir(name)), os.FileMode(0o755)) + err = os.MkdirAll(_filePath(dir, filepath.Dir(name)), os.FileMode(0755)) if err != nil { return err } diff --git a/pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml b/pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml new file mode 100644 index 00000000000..44f5993ba58 --- /dev/null +++ b/pkg/operator/controllers/rbac/staticresources/clusterrole-diagnostics.yaml @@ -0,0 +1,183 @@ +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: system:aro-diagnostics +rules: +- apiGroups: + - "" + resources: + - pods + - pods/log + - nodes + - services + - endpoints + - configmaps + - events + - namespaces + - persistentvolumeclaims + - replicationcontrollers + - resourcequotas + - serviceaccounts + - limitranges + verbs: + - get + - list + - watch +- apiGroups: + - apps + resources: + - deployments + - daemonsets + - statefulsets + - replicasets + verbs: + - get + - list + - watch +- apiGroups: + - batch + resources: + - jobs + - cronjobs + verbs: + - get + - list + - watch +- apiGroups: + - networking.k8s.io + resources: + - networkpolicies + - ingresses + verbs: + - get + - list + - watch +- apiGroups: + - policy + resources: + - poddisruptionbudgets + verbs: + - get + - list + - watch +- apiGroups: + - storage.k8s.io + resources: + - storageclasses + - persistentvolumes + - volumeattachments + - csinodes + - csidrivers + verbs: + - get + - list + - watch +- apiGroups: + - autoscaling + resources: + - horizontalpodautoscalers + verbs: + - get + - list + - watch +- apiGroups: + - rbac.authorization.k8s.io + resources: + - roles + - rolebindings + - clusterroles + - clusterrolebindings + verbs: + - get + - list + - watch +- apiGroups: + - apps.openshift.io + resources: + - deploymentconfigs + verbs: + - get + - list + - watch +- apiGroups: + - route.openshift.io + resources: + - routes + verbs: + - get + - list + - watch +- apiGroups: + - machine.openshift.io + resources: + - machines + - machinesets + - machinehealthchecks + verbs: + - get + - list + - watch +- apiGroups: + - config.openshift.io + resources: + - clusterversions + - clusteroperators + - infrastructures + verbs: + - get + - list + - watch +- apiGroups: + - machineconfiguration.openshift.io + resources: + - machineconfigs + - machineconfigpools + verbs: + - get + - list + - watch +- apiGroups: + - monitoring.coreos.com + resources: + - prometheusrules + - servicemonitors + - alertmanagers + verbs: + - get + - list + - watch +- apiGroups: + - operator.openshift.io + resources: + - '*' + verbs: + - get + - list + - watch +- apiGroups: + - operators.coreos.com + resources: + - subscriptions + - clusterserviceversions + - installplans + - operatorgroups + verbs: + - get + - list + - watch +- apiGroups: + - metrics.k8s.io + resources: + - nodes + - pods + verbs: + - get + - list +- nonResourceURLs: + - /healthz + - /livez + - /readyz + - /version + - /metrics + verbs: + - get diff --git a/pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml b/pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml new file mode 100644 index 00000000000..cbe674395bb --- /dev/null +++ b/pkg/operator/controllers/rbac/staticresources/clusterrolebinding-diagnostics.yaml @@ -0,0 +1,11 @@ +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: system:aro-diagnostics +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: system:aro-diagnostics +subjects: +- kind: User + name: system:aro-diagnostics diff --git a/pkg/util/holmes/config.go b/pkg/util/holmes/config.go new file mode 100644 index 00000000000..75fd6df73be --- /dev/null +++ b/pkg/util/holmes/config.go @@ -0,0 +1,142 @@ +package holmes + +// Copyright (c) Microsoft Corporation. +// Licensed under the Apache License 2.0. + +import ( + "context" + "fmt" + "os" + "regexp" + "strconv" + + "github.com/Azure/ARO-RP/pkg/util/azureclient/azuresdk/azsecrets" +) + +// modelPattern validates the model name contains only safe characters +// (alphanumeric, slashes, dots, colons, hyphens, underscores). +var modelPattern = regexp.MustCompile(`^[a-zA-Z0-9/.:_-]+$`) + +const ( + // Key Vault secret names for Holmes configuration. + holmesAzureAPIKeySecretName = "holmes-azure-api-key" + holmesAzureAPIBaseSecretName = "holmes-azure-api-base" +) + +// HolmesConfig holds configuration for HolmesGPT investigation pods. +type HolmesConfig struct { + Image string + AzureAPIKey string + AzureAPIBase string + AzureAPIVersion string + Model string + DefaultTimeout int + MaxConcurrentInvestigations int +} + +// NewHolmesConfigFromEnv loads all config from environment variables. +// Used in local development mode (RP_MODE=development). +func NewHolmesConfigFromEnv() (*HolmesConfig, error) { + c, err := newHolmesConfigBase() + if err != nil { + return nil, err + } + c.AzureAPIKey = os.Getenv("HOLMES_AZURE_API_KEY") + c.AzureAPIBase = os.Getenv("HOLMES_AZURE_API_BASE") + if err := c.Validate(); err != nil { + return nil, err + } + return c, nil +} + +// NewHolmesConfig loads non-secret config from env vars and secrets from Key Vault. +// Used in production mode. +func NewHolmesConfig(ctx context.Context, serviceKeyvault azsecrets.Client) (*HolmesConfig, error) { + apiKeyResp, err := serviceKeyvault.GetSecret(ctx, holmesAzureAPIKeySecretName, "", nil) + if err != nil { + return nil, fmt.Errorf("failed to get %s from keyvault: %w", holmesAzureAPIKeySecretName, err) + } + if apiKeyResp.Value == nil { + return nil, fmt.Errorf("keyvault secret %s has no value", holmesAzureAPIKeySecretName) + } + + apiBaseResp, err := serviceKeyvault.GetSecret(ctx, holmesAzureAPIBaseSecretName, "", nil) + if err != nil { + return nil, fmt.Errorf("failed to get %s from keyvault: %w", holmesAzureAPIBaseSecretName, err) + } + if apiBaseResp.Value == nil { + return nil, fmt.Errorf("keyvault secret %s has no value", holmesAzureAPIBaseSecretName) + } + + c, err := newHolmesConfigBase() + if err != nil { + return nil, err + } + c.AzureAPIKey = *apiKeyResp.Value + c.AzureAPIBase = *apiBaseResp.Value + if err := c.Validate(); err != nil { + return nil, err + } + return c, nil +} + +// newHolmesConfigBase loads the non-secret configuration from environment variables. +func newHolmesConfigBase() (*HolmesConfig, error) { + defaultTimeout, err := envOrDefaultInt("HOLMES_DEFAULT_TIMEOUT", 600) + if err != nil { + return nil, err + } + maxConcurrent, err := envOrDefaultInt("HOLMES_MAX_CONCURRENT", 20) + if err != nil { + return nil, err + } + return &HolmesConfig{ + Image: envOrDefault("HOLMES_IMAGE", "quay.io/haoran/holmesgpt:latest"), + AzureAPIVersion: envOrDefault("HOLMES_AZURE_API_VERSION", "2025-04-01-preview"), + Model: envOrDefault("HOLMES_MODEL", "azure/gpt-5.2"), + DefaultTimeout: defaultTimeout, + MaxConcurrentInvestigations: maxConcurrent, + }, nil +} + +// Validate checks that required configuration values are set. +func (c *HolmesConfig) Validate() error { + if c.AzureAPIKey == "" { + return fmt.Errorf("holmes Azure API key is required") + } + if c.AzureAPIBase == "" { + return fmt.Errorf("holmes Azure API base is required") + } + if c.Image == "" { + return fmt.Errorf("holmes image is required") + } + if !modelPattern.MatchString(c.Model) { + return fmt.Errorf("holmes model name contains invalid characters") + } + if c.DefaultTimeout <= 0 { + return fmt.Errorf("holmes default timeout must be greater than 0") + } + if c.MaxConcurrentInvestigations <= 0 { + return fmt.Errorf("holmes max concurrent investigations must be greater than 0") + } + return nil +} + +func envOrDefault(key, defaultValue string) string { + if v := os.Getenv(key); v != "" { + return v + } + return defaultValue +} + +func envOrDefaultInt(key string, defaultValue int) (int, error) { + v := os.Getenv(key) + if v == "" { + return defaultValue, nil + } + i, err := strconv.Atoi(v) + if err != nil { + return 0, fmt.Errorf("invalid integer value for %s: %w", key, err) + } + return i, nil +} diff --git a/pkg/util/holmes/config_test.go b/pkg/util/holmes/config_test.go new file mode 100644 index 00000000000..79569ac5860 --- /dev/null +++ b/pkg/util/holmes/config_test.go @@ -0,0 +1,166 @@ +package holmes + +// Copyright (c) Microsoft Corporation. +// Licensed under the Apache License 2.0. + +import ( + "context" + "fmt" + "testing" + + "github.com/stretchr/testify/require" + "go.uber.org/mock/gomock" + + "github.com/Azure/azure-sdk-for-go/sdk/security/keyvault/azsecrets" + + mock_azsecrets "github.com/Azure/ARO-RP/pkg/util/mocks/azureclient/azuresdk/azsecrets" +) + +func TestNewHolmesConfigFromEnv(t *testing.T) { + tests := []struct { + name string + envVars map[string]string + wantErr bool + }{ + { + name: "valid config with all required env vars", + envVars: map[string]string{ + "HOLMES_AZURE_API_KEY": "test-key", + "HOLMES_AZURE_API_BASE": "https://test.openai.azure.com", + }, + }, + { + name: "missing API key returns error", + envVars: map[string]string{ + "HOLMES_AZURE_API_BASE": "https://test.openai.azure.com", + }, + wantErr: true, + }, + { + name: "missing API base returns error", + envVars: map[string]string{ + "HOLMES_AZURE_API_KEY": "test-key", + }, + wantErr: true, + }, + { + name: "custom values override defaults", + envVars: map[string]string{ + "HOLMES_AZURE_API_KEY": "custom-key", + "HOLMES_AZURE_API_BASE": "https://custom.openai.azure.com", + "HOLMES_IMAGE": "custom-image:v1", + "HOLMES_MODEL": "azure/gpt-4o", + "HOLMES_DEFAULT_TIMEOUT": "300", + "HOLMES_MAX_CONCURRENT": "5", + "HOLMES_AZURE_API_VERSION": "2024-01-01", + }, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + // Clear all Holmes env vars, then set test values. + for _, key := range []string{ + "HOLMES_AZURE_API_KEY", "HOLMES_AZURE_API_BASE", "HOLMES_IMAGE", + "HOLMES_MODEL", "HOLMES_DEFAULT_TIMEOUT", "HOLMES_MAX_CONCURRENT", + "HOLMES_AZURE_API_VERSION", + } { + t.Setenv(key, "") + } + for k, v := range tt.envVars { + t.Setenv(k, v) + } + + cfg, err := NewHolmesConfigFromEnv() + if tt.wantErr { + require.Error(t, err) + return + } + require.NoError(t, err) + require.Equal(t, tt.envVars["HOLMES_AZURE_API_KEY"], cfg.AzureAPIKey) + require.Equal(t, tt.envVars["HOLMES_AZURE_API_BASE"], cfg.AzureAPIBase) + + if tt.envVars["HOLMES_IMAGE"] != "" { + require.Equal(t, tt.envVars["HOLMES_IMAGE"], cfg.Image) + } + if tt.envVars["HOLMES_MODEL"] != "" { + require.Equal(t, tt.envVars["HOLMES_MODEL"], cfg.Model) + } + if tt.envVars["HOLMES_DEFAULT_TIMEOUT"] != "" { + require.Equal(t, 300, cfg.DefaultTimeout) + } + if tt.envVars["HOLMES_MAX_CONCURRENT"] != "" { + require.Equal(t, 5, cfg.MaxConcurrentInvestigations) + } + }) + } +} + +func TestNewHolmesConfig(t *testing.T) { + ctx := context.Background() + + apiKey := "keyvault-api-key" + apiBase := "https://keyvault.openai.azure.com" + + tests := []struct { + name string + mocks func(*mock_azsecrets.MockClient) + wantErr bool + }{ + { + name: "reads secrets from keyvault", + mocks: func(m *mock_azsecrets.MockClient) { + m.EXPECT().GetSecret(ctx, holmesAzureAPIKeySecretName, "", nil). + Return(azsecrets.GetSecretResponse{ + Secret: azsecrets.Secret{Value: &apiKey}, + }, nil) + m.EXPECT().GetSecret(ctx, holmesAzureAPIBaseSecretName, "", nil). + Return(azsecrets.GetSecretResponse{ + Secret: azsecrets.Secret{Value: &apiBase}, + }, nil) + }, + }, + { + name: "API key not found in keyvault returns error", + mocks: func(m *mock_azsecrets.MockClient) { + m.EXPECT().GetSecret(ctx, holmesAzureAPIKeySecretName, "", nil). + Return(azsecrets.GetSecretResponse{}, fmt.Errorf("secret not found")) + }, + wantErr: true, + }, + { + name: "API base not found in keyvault returns error", + mocks: func(m *mock_azsecrets.MockClient) { + m.EXPECT().GetSecret(ctx, holmesAzureAPIKeySecretName, "", nil). + Return(azsecrets.GetSecretResponse{ + Secret: azsecrets.Secret{Value: &apiKey}, + }, nil) + m.EXPECT().GetSecret(ctx, holmesAzureAPIBaseSecretName, "", nil). + Return(azsecrets.GetSecretResponse{}, fmt.Errorf("secret not found")) + }, + wantErr: true, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + controller := gomock.NewController(t) + defer controller.Finish() + + mockKV := mock_azsecrets.NewMockClient(controller) + tt.mocks(mockKV) + + cfg, err := NewHolmesConfig(ctx, mockKV) + if tt.wantErr { + require.Error(t, err) + return + } + require.NoError(t, err) + require.Equal(t, apiKey, cfg.AzureAPIKey) + require.Equal(t, apiBase, cfg.AzureAPIBase) + // Non-secret values should still come from env/defaults + require.NotEmpty(t, cfg.Image) + require.NotEmpty(t, cfg.Model) + }) + } +} diff --git a/pkg/util/holmes/kubeconfig.go b/pkg/util/holmes/kubeconfig.go new file mode 100644 index 00000000000..1041a053493 --- /dev/null +++ b/pkg/util/holmes/kubeconfig.go @@ -0,0 +1,41 @@ +package holmes + +// Copyright (c) Microsoft Corporation. +// Licensed under the Apache License 2.0. + +import ( + "fmt" + "strings" + + clientcmdv1 "k8s.io/client-go/tools/clientcmd/api/v1" + + "sigs.k8s.io/yaml" +) + +// MakeExternalKubeconfig takes an internal kubeconfig (api-int.*) and converts +// it to use the external API endpoint (api.*) with insecure-skip-tls-verify. +// This is needed because the Hive AKS cluster cannot resolve api-int.* DNS +// names (Azure Private DNS is only linked to the cluster's VNet). +func MakeExternalKubeconfig(internalKubeconfig []byte) ([]byte, error) { + var cfg clientcmdv1.Config + err := yaml.Unmarshal(internalKubeconfig, &cfg) + if err != nil { + return nil, fmt.Errorf("failed to unmarshal kubeconfig: %w", err) + } + + for i := range cfg.Clusters { + originalServer := cfg.Clusters[i].Cluster.Server + rewrittenServer := strings.Replace(originalServer, "https://api-int.", "https://api.", 1) + cfg.Clusters[i].Cluster.Server = rewrittenServer + + if rewrittenServer != originalServer { + // The self-signed CA does not cover the external endpoint's cert, + // so skip TLS verification. The client cert is still used for + // authentication (mTLS for identity, not for server verification). + cfg.Clusters[i].Cluster.InsecureSkipTLSVerify = true + cfg.Clusters[i].Cluster.CertificateAuthorityData = nil + } + } + + return yaml.Marshal(cfg) +} diff --git a/pkg/util/holmes/kubeconfig_test.go b/pkg/util/holmes/kubeconfig_test.go new file mode 100644 index 00000000000..e1770063748 --- /dev/null +++ b/pkg/util/holmes/kubeconfig_test.go @@ -0,0 +1,98 @@ +package holmes + +// Copyright (c) Microsoft Corporation. +// Licensed under the Apache License 2.0. + +import ( + "testing" + + "github.com/stretchr/testify/require" + + clientcmdv1 "k8s.io/client-go/tools/clientcmd/api/v1" + + "sigs.k8s.io/yaml" +) + +func TestMakeExternalKubeconfig(t *testing.T) { + internalConfig := &clientcmdv1.Config{ + Clusters: []clientcmdv1.NamedCluster{ + { + Name: "test-cluster", + Cluster: clientcmdv1.Cluster{ + Server: "https://api-int.test.example.com:6443", + CertificateAuthorityData: []byte("some-ca-data"), + }, + }, + }, + AuthInfos: []clientcmdv1.NamedAuthInfo{ + { + Name: "system:aro-diagnostics", + AuthInfo: clientcmdv1.AuthInfo{ + ClientCertificateData: []byte("cert-data"), + ClientKeyData: []byte("key-data"), + }, + }, + }, + Contexts: []clientcmdv1.NamedContext{ + { + Name: "system:aro-diagnostics", + Context: clientcmdv1.Context{ + Cluster: "test-cluster", + AuthInfo: "system:aro-diagnostics", + }, + }, + }, + CurrentContext: "system:aro-diagnostics", + } + + internalKubeconfig, err := yaml.Marshal(internalConfig) + require.NoError(t, err) + + externalKubeconfig, err := MakeExternalKubeconfig(internalKubeconfig) + require.NoError(t, err) + + var got clientcmdv1.Config + err = yaml.Unmarshal(externalKubeconfig, &got) + require.NoError(t, err) + + // Server should be rewritten from api-int.* to api.* + require.Equal(t, "https://api.test.example.com:6443", got.Clusters[0].Cluster.Server) + + // CA data should be stripped + require.Nil(t, got.Clusters[0].Cluster.CertificateAuthorityData) + + // InsecureSkipTLSVerify should be set + require.True(t, got.Clusters[0].Cluster.InsecureSkipTLSVerify) + + // Client credentials should be preserved + require.Equal(t, []byte("cert-data"), got.AuthInfos[0].AuthInfo.ClientCertificateData) + require.Equal(t, []byte("key-data"), got.AuthInfos[0].AuthInfo.ClientKeyData) +} + +func TestMakeExternalKubeconfigNoRewriteNeeded(t *testing.T) { + // If the server already uses api.* (not api-int.*), it should not be changed + config := &clientcmdv1.Config{ + Clusters: []clientcmdv1.NamedCluster{ + { + Name: "test-cluster", + Cluster: clientcmdv1.Cluster{ + Server: "https://api.test.example.com:6443", + CertificateAuthorityData: []byte("some-ca-data"), + }, + }, + }, + } + + kubeconfig, err := yaml.Marshal(config) + require.NoError(t, err) + + result, err := MakeExternalKubeconfig(kubeconfig) + require.NoError(t, err) + + var got clientcmdv1.Config + err = yaml.Unmarshal(result, &got) + require.NoError(t, err) + + // Server should remain unchanged + require.Equal(t, "https://api.test.example.com:6443", got.Clusters[0].Cluster.Server) +} diff --git a/pkg/util/mocks/hive/hive.go b/pkg/util/mocks/hive/hive.go index 58cd5325359..ceb8189833e 100644 --- a/pkg/util/mocks/hive/hive.go +++ b/pkg/util/mocks/hive/hive.go @@ -11,6 +11,7 @@ package mock_hive import ( context "context" + io "io" reflect "reflect" gomock "go.uber.org/mock/gomock" @@ -22,6 +23,7 @@ import ( v1alpha1 "github.com/openshift/hive/apis/hiveinternal/v1alpha1" api "github.com/Azure/ARO-RP/pkg/api" + holmes "github.com/Azure/ARO-RP/pkg/util/holmes" ) // MockClusterManager is a mock of ClusterManager interface. @@ -150,6 +152,20 @@ func (mr *MockClusterManagerMockRecorder) Install(ctx, sub, doc, version, custom return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Install", reflect.TypeOf((*MockClusterManager)(nil).Install), ctx, sub, doc, version, customManifests) } +// InvestigateCluster mocks base method. +func (m *MockClusterManager) InvestigateCluster(ctx context.Context, hiveNamespace string, kubeconfig []byte, holmesConfig *holmes.HolmesConfig, question string, w io.Writer) error { + m.ctrl.T.Helper() + ret := m.ctrl.Call(m, "InvestigateCluster", ctx, hiveNamespace, kubeconfig, holmesConfig, question, w) + ret0, _ := ret[0].(error) + return ret0 +} + +// InvestigateCluster indicates an expected call of InvestigateCluster. +func (mr *MockClusterManagerMockRecorder) InvestigateCluster(ctx, hiveNamespace, kubeconfig, holmesConfig, question, w any) *gomock.Call { + mr.mock.ctrl.T.Helper() + return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "InvestigateCluster", reflect.TypeOf((*MockClusterManager)(nil).InvestigateCluster), ctx, hiveNamespace, kubeconfig, holmesConfig, question, w) +} + // IsClusterDeploymentReady mocks base method. func (m *MockClusterManager) IsClusterDeploymentReady(ctx context.Context, doc *api.OpenShiftClusterDocument) (bool, error) { m.ctrl.T.Helper()