From 9318963a9ff145132b57caaee653d0eccefcac1d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Tue, 31 Mar 2026 14:47:35 +0200 Subject: [PATCH 01/12] docs: add CI Medic Guide for e2e test failure investigation Comprehensive guide covering Prow job anatomy, artifact navigation, job lifecycle phases, failure triage workflows, and the AI Test Triager. Includes an internal companion (local.md) with Vault, ReportPortal, DevLake, and artifact unredaction details. Co-Authored-By: Claude Opus 4.6 --- docs/e2e-tests/CI-medic-guide.md | 725 +++++++++++++++++++++++++++++++ 1 file changed, 725 insertions(+) create mode 100644 docs/e2e-tests/CI-medic-guide.md diff --git a/docs/e2e-tests/CI-medic-guide.md b/docs/e2e-tests/CI-medic-guide.md new file mode 100644 index 0000000000..bd83042e3f --- /dev/null +++ b/docs/e2e-tests/CI-medic-guide.md @@ -0,0 +1,725 @@ +# CI Medic Guide + +A practical guide for investigating test failures in RHDH nightly jobs and PR checks. + +## Table of Contents + +- [Overview](#overview) +- [Anatomy of a Prow Job](#anatomy-of-a-prow-job) +- [Where to Find Logs and Artifacts](#where-to-find-logs-and-artifacts) +- [Job Lifecycle and Failure Points](#job-lifecycle-and-failure-points) +- [Job Types Reference](#job-types-reference) +- [Identifying Failure Types](#identifying-failure-types) +- [Common Failure Patterns (Cheat Sheet)](#common-failure-patterns-cheat-sheet) +- [Useful Links and Tools](#useful-links-and-tools) +- [AI Test Triager](#ai-test-triager-nightly-test-alerts) + +--- + +## Overview + +### What is a CI Medic? + +The CI medic is a **weekly rotating role** responsible for maintaining the health of PR checks and nightly E2E test jobs. When your rotation starts, you'll receive a Slack message with your responsibilities. + +### Core Responsibilities + +1. **Monitor PR Checks**: Keep an eye on the status and the queue to ensure they remain passing. +2. **Monitor Nightly Jobs**: Watch the `#rhdh-e2e-alerts` Slack channel and dedicated release channels. +3. **Triage Failures**: + - Use the **AI Test Triager** (`@Nightly Test Alerts` Slack app) as your starting point -- it automatically analyzes failed nightly jobs and provides root cause analysis, screenshot interpretation, and links to similar Jira issues. You can also invoke it manually by tagging `@Nightly Test Alerts` in Slack. + - Check [Jira](https://redhat.atlassian.net/jira/dashboards/21388#v=1&d=21388&rf=acef7fac-ada0-4363-b3fb-9aad7ae021f0&static=f0579c09-f63e-45aa-87b9-05e042eee707&g=60993:view@0a7ec296-c2fd-4ddc-b7cb-64de0540e8ba) for existing issues with the **`ci-fail`** label. + - If it's a **new issue**, create a bug and assign it to the responsible team or person. The AI triager can also create Jira bugs directly. + - If the failure **blocks PRs**, mark the test as skipped (`test.fixme`) until it is fixed. +4. **Monitor Infrastructure**: Watch `#announce-testplatform` for general OpenShift CI outages and issues. Get help at `#forum-ocp-testplatform`. +5. **Quality Cabal Call**: Attend the call and provide a status update of the CI. + +### Where Do Alerts Come In? + +- **Main branch**: `#rhdh-e2e-alerts` Slack channel +- **Release branches**: Dedicated channels like `#rhdh-e2e-alerts-1-8`, `#rhdh-e2e-alerts-1-9`, etc. +- **Infrastructure announcements**: `#announce-testplatform` (general OpenShift CI status) +- **Getting help**: `#forum-ocp-testplatform` (ask questions about CI platform issues) + +Each alert includes links to the job logs, artifacts, and a summary of which deployments/tests passed or failed. Check the bookmarks/folders in the `#rhdh-e2e-alerts` channel for additional resources. + +### Two Types of CI Jobs + +| | Nightly (Periodic) Jobs | PR Check (Presubmit) Jobs | +|---|---|---| +| **Trigger** | Scheduled (usually once per night) | On PR creation/update, or `/ok-to-test` | +| **Scope** | Full suite: showcase, RBAC, runtime, sanity plugins, localization, auth providers | Smaller scope: showcase + RBAC only | +| **Platforms** | OCP (multiple versions), AKS, EKS, GKE, OSD-GCP | OCP only (single version) | +| **Install methods** | Helm and Operator | Helm only | +| **Alert channel** | `#rhdh-e2e-alerts` / `#rhdh-e2e-alerts-{version}` | PR status checks on GitHub | + +**Triggering jobs on a PR**: All nightly job variants can also be triggered on a PR by commenting `/test `. Use `/test ?` to list all available jobs for that PR. This is useful for verifying a fix against a specific platform or install method before merging. + +--- + +## Anatomy of a Prow Job + +### Job Naming Convention + +Nightly jobs follow this pattern: + +``` +periodic-ci-redhat-developer-rhdh-{BRANCH}-e2e-{PLATFORM}-{INSTALL_METHOD}[-{VARIANT}]-nightly +``` + +Breaking it down: + +| Segment | Values | Meaning | +|---------|--------|---------| +| `{BRANCH}` | `main`, `release-1.9`, `release-1.10` | Git branch being tested | +| `{PLATFORM}` | `ocp`, `ocp-v4-{VER}`, `aks`, `eks`, `gke`, `osd-gcp` | Target platform (OCP versions rotate as new releases come out) | +| `{INSTALL_METHOD}` | `helm`, `operator` | Installation method | +| `{VARIANT}` | `auth-providers`, `upgrade` | Optional -- specialized test scenario | + +Examples: + +- `periodic-ci-redhat-developer-rhdh-main-e2e-ocp-helm-nightly` -- OCP nightly with Helm on main +- `periodic-ci-redhat-developer-rhdh-release-1.9-e2e-aks-helm-nightly` -- AKS nightly for release 1.9 +- `periodic-ci-redhat-developer-rhdh-main-e2e-ocp-operator-nightly` -- OCP nightly with Operator +- `periodic-ci-redhat-developer-rhdh-main-e2e-ocp-operator-auth-providers-nightly` -- Auth provider tests +- `periodic-ci-redhat-developer-rhdh-main-e2e-ocp-helm-upgrade-nightly` -- Upgrade scenario tests + +PR check jobs use the `pull-ci-` prefix instead of `periodic-ci-`. + +### How the Pipeline Works + +[Prow](https://docs.ci.openshift.org/docs/architecture/prow/) is the CI scheduler. It triggers [ci-operator](https://docs.ci.openshift.org/docs/architecture/ci-operator/), which orchestrates the entire workflow: + +``` +Prow (scheduler) + └── ci-operator (orchestrator) ── openshift/release repo + ├── 1. Claim/provision cluster: ── (ci-operator config + │ - OCP: ephemeral cluster from Hive ── + step registry) + │ - AKS/EKS: provisioned on demand via Mapt + │ - GKE: long-running shared cluster + ├── 2. Clone rhdh repo & Wait for RHDH image (if it needs to be built) ── openshift/release repo + ├── 3. Run test step in e2e-runner image ── rhdh repo + │ ├── a. Install operators (Tekton, etc.) ── (.ci/pipelines/ + │ ├── b. Deploy RHDH (Helm or Operator) ── openshift-ci-tests.sh) + │ ├── c. Wait for deployment health check + │ ├── d. Run Playwright tests + │ └── e. Collect artifacts + ├── 4. Run post-steps ── openshift/release repo + │ (send Slack alert, collect must-gather) ── (step registry) + └── 5. Release cluster +``` + +the test step (2, 3) run inside the [`e2e-runner`](https://quay.io/repository/rhdh-community/rhdh-e2e-runner?tab=tags) image, which is built by a [GitHub Actions workflow](../../.github/workflows/push-e2e-runner.yaml) and mirrored into OpenShift CI. + +Each phase can fail independently. Knowing *where* in this pipeline the failure occurred is the first step in triage. + +--- + +## Where to Find Logs and Artifacts + +### Navigating the Prow UI + +When you click on a failed job (from Slack alert or Prow dashboard), you land on the **Spyglass** view. This page shows: + +- **Job metadata**: branch, duration, result +- **Build log**: the top-level `build-log.txt` (ci-operator output) +- **JUnit results**: parsed test results if available (if Playwright ran and test cases failed) +- **Artifacts link**: link to the full GCS artifact tree + +### Monitoring a Running PR Check in Real Time + +While a PR check is running, you can monitor its live progress, logs, and system resource usage directly in the OpenShift CI cluster console. + +**How to find the link:** + +1. Open the Prow job page for the PR check (e.g., from the GitHub PR status check "Details" link). The URL looks like: + ``` + https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/redhat-developer_rhdh/{PR_NUMBER}/{JOB_NAME}/{BUILD_ID} + ``` +2. In the **build log**, look for a line near the top like: + ``` + Using namespace https://console.build08.ci.openshift.org/k8s/cluster/projects/ci-op-XXXXXXXX + ``` +3. Click that link to open the OpenShift console for the CI namespace where the job is running. + +**What you can see in the CI namespace:** + +- **Pods**: All pods running for the job (test container, sidecar containers, etc.) +- **Pod logs**: Live streaming logs from each container +- **Events**: Kubernetes events (scheduling, image pulls, failures) +- **Resource usage**: CPU and memory metrics for the running pods +- **Terminal**: You can open a terminal into a running pod for live debugging + +This is especially useful when: +- A job is hanging and you want to see what it's doing right now +- You need to check pod resource consumption (OOM suspicion) +- You want to watch deployment progress in real time rather than waiting for artifacts + +**Logging into the claimed cluster (OCP jobs):** While a job is executing, you can also log into the ephemeral OCP cluster where RHDH is being deployed and tested. Use the [`ocp-cluster-claim-login.sh`](../../.ci/pipelines/ocp-cluster-claim-login.sh) script: + +```bash +# Provide the Prow job URL +.ci/pipelines/ocp-cluster-claim-login.sh "https://prow.ci.openshift.org/view/gs/..." +``` + +This gives you direct `oc` access to the cluster, allowing you to inspect pods, check logs, describe resources, and debug issues live. See [Cluster Access](#cluster-access-ocp-jobs-only) for details. + +**Prerequisite**: You must be a member of the `openshift` GitHub organization. Request access at [DevServices GitHub Access Request](https://devservices.dpp.openshift.com/support/github_access_request/). For cluster login, you also need to be in the `rhdh-pool-admins` [Rover group](https://rover.redhat.com/groups/search?q=rhdh-pool-admins). + +### Artifact Directory Structure + +``` +artifacts/ +├── ci-operator.log # ci-operator orchestration log +├── ci-operator-step-graph.json # Step execution graph with timing +├── {TEST_NAME}/ # e.g., e2e-ocp-helm-nightly/ +│ ├── redhat-developer-rhdh-{STEP}/ # Main test step +│ │ ├── build-log.txt # Full output of openshift-ci-tests.sh +│ │ ├── finished.json # Exit code and timing +│ │ └── artifacts/ # Test-generated artifacts +│ │ ├── reporting/ # Status files consumed by the Slack reporter (`Nightly Test Alerts`) +│ │ ├── showcase/ # Per-project artifacts +│ │ │ ├── junit-results-showcase.xml +│ │ │ ├── test-log.html # Playwright output (colorized) +│ │ │ ├── playwright-report/ # Interactive HTML report +│ │ │ ├── test-results/ # Videos, traces per test +│ │ │ └── pod_logs/ # Logs from all pods +│ │ ├── showcase-rbac/ # Same structure as above +│ │ ├── showcase-runtime/ +│ │ ├── showcase-sanity-plugins/ +│ │ ├── showcase-localization-fr/ +│ │ ├── showcase-localization-it/ +│ │ └── showcase-localization-ja/ +│ ├── gather-must-gather/ # Cluster diagnostics +│ └── redhat-developer-rhdh-send-alert/ # Slack notification step (`Nightly Test Alerts`) +├── build-resources/ # Build pod info +│ ├── pods.json +│ └── events.json +└── clone-log.txt # Repo cloning output +``` + +### Key Files to Check (In Order) + +1. **`build-log.txt`** (in test step) -- Full script output. Search for `❌` or `Error` to find failures. +2. **Playwright HTML report** -- Detailed test results with screenshots and videos. +3. **`pod_logs/`** -- Pod logs from the RHDH deployment (only collected on failure). + +### How to View the Playwright HTML Report + +The Playwright report is in `artifacts/{project}/`. To view it: + +Open `index.html` in a browser from the GCS artifacts. The report contains per-test pass/fail status with duration, screenshots on failure, video recordings of each failed test, and [trace files](https://playwright.dev/docs/trace-viewer). + +--- + +## Job Lifecycle and Failure Points + +### Phase 1: Cluster Provisioning + +**What happens**: ci-operator requests a cluster from a pool (OCP) or provisions one via cloud APIs (AKS/EKS/GKE). + +**OCP cluster pools** (ephemeral, AWS us-east-2): RHDH uses dedicated Hive cluster pools with the `rhdh` prefix. You can find the current list by filtering for `rhdh` in the [existing cluster pools](https://docs.ci.openshift.org/how-tos/cluster-claim/#existing-cluster-pools) page. See also [`.ci/pipelines/README.md`](../../.ci/pipelines/README.md) for which pool is used by which job. + +**What can go wrong**: +- Cluster pool exhausted (no available clusters) +- Cluster claim timeout +- Cluster in unhealthy state + +**How to tell**: +- **OCP**: The job shows status `error` (not `failure`) in Prow. Check `build-log.txt` at the top level for cluster provisioning errors. +- **AKS/EKS**: Look for the `create` step in the Prow job artifacts — this is where Mapt provisions the cloud cluster. If it failed, the cluster was never created. + +**Action**: Re-trigger the job. This is purely infrastructure. + +### Phase 2: Repository Cloning and Test Runner Image + +**What happens**: ci-operator clones the repo. The test runner image ([`quay.io/rhdh-community/rhdh-e2e-runner`](https://quay.io/repository/rhdh-community/rhdh-e2e-runner?tab=tags)) is mirrored into OpenShift CI and used to run all test steps starting from `openshift-ci-tests.sh`. The image is built by a [GitHub Actions workflow](../../.github/workflows/push-e2e-runner.yaml) from [`.ci/images/Dockerfile`](../../.ci/images/Dockerfile) and pushed to Quay on every push to `main` or `release-*` branches. + +**What can go wrong**: +- Git clone failures (network/GitHub issues) +- Image mirror delay or failure (new image not yet available in CI) + +**How to tell**: Check `clone-log.txt` for clone errors. Check `build-resources/builds.json` for image issues. + +**Action**: Usually transient -- re-trigger. If the Dockerfile or GitHub Actions workflow changed recently, check the [workflow runs](https://github.com/redhat-developer/rhdh/actions/workflows/push-e2e-runner.yaml) to verify the image was built and pushed successfully. + +### Phase 3: Cluster Setup (Operators and Prerequisites) + +**What happens**: The [test script](../../.ci/pipelines/openshift-ci-tests.sh) installs required operators and infrastructure (see [operators.sh](../../.ci/pipelines/lib/operators.sh)): +- OpenShift Pipelines (Tekton) operator +- Crunchy PostgreSQL operator +- Orchestrator infrastructure (conditionally, see [orchestrator.sh](../../.ci/pipelines/lib/orchestrator.sh)) + +**What can go wrong**: +- Operator installation timeout (OperatorHub/Marketplace issues) +- CRD not becoming available +- Tekton webhook deployment not ready + +**How to tell**: Search `build-log.txt` for: +- `Failed to install subscription` +- Timeout waiting for operator CRDs +- `Tekton` or `pipeline` related errors early in the log + +**Action**: Usually infrastructure -- re-trigger. If operators were recently upgraded, investigate compatibility. + +### Phase 4: RHDH Deployment + +**What happens**: RHDH is deployed via Helm chart or Operator CR. Health checks poll the Backstage URL. + +**Helm deployment flow** (see [helm.sh](../../.ci/pipelines/lib/helm.sh)): +1. Create namespace, RBAC resources, ConfigMaps (see [config.sh](../../.ci/pipelines/lib/config.sh)) +2. Deploy Redis cache +3. Deploy PostgreSQL (for RBAC namespace) +4. Deploy RHDH via `helm upgrade --install` +5. Poll health endpoint (up to 30 attempts, 30 seconds apart) via [testing.sh](../../.ci/pipelines/lib/testing.sh) + +**Operator deployment flow** (see [operator.sh](../../.ci/pipelines/install-methods/operator.sh)): +1. Install RHDH Operator +2. Wait for `backstages.rhdh.redhat.com` CRD (300s timeout) +3. Create ConfigMaps for dynamic plugins +4. Apply Backstage CR ([`rhdh-start.yaml`](../../.ci/pipelines/resources/rhdh-operator/rhdh-start.yaml) or [`rhdh-start-rbac.yaml`](../../.ci/pipelines/resources/rhdh-operator/rhdh-start-rbac.yaml)) +5. Poll health endpoint + +**What can go wrong**: +- Helm chart errors (invalid values, missing CRDs) +- Pod stuck in `CrashLoopBackOff` (bad config, missing secrets, image pull failure) +- Health check timeout (`Failed to reach Backstage after N attempts`) +- PostgreSQL operator fails to create user secret (`postgress-external-db-pguser-janus-idp`) + +**How to tell**: Search `build-log.txt` for: +- `CrashLoopBackOff` -- pod is crash-looping +- `Failed to reach Backstage` -- health check timeout +- `helm upgrade` failures +- `Crunchy Postgres operator failed to create the user` -- PostgreSQL setup issue +- Check `pod_logs/` for application-level errors + +**Action**: Check pod logs and events in artifacts. May be a config issue (real bug) or transient infra (re-trigger). + +### Phase 5: Test Execution + +**What happens**: Playwright tests run inside the test container against the deployed RHDH instance (see [testing.sh](../../.ci/pipelines/lib/testing.sh)). + +```bash +yarn playwright test --project="${playwright_project}" +``` + +Tests are configured in [`playwright.config.ts`](../../e2e-tests/playwright.config.ts) with: +- **Timeout**: 90 seconds per test +- **Retries**: 2 on CI (1 for auth-providers) +- **Workers**: 3 parallel +- **Viewport**: 1920x1080 + +Project names are defined in [`projects.json`](../../e2e-tests/playwright/projects.json) (single source of truth) and loaded by CI via [`playwright-projects.sh`](../../.ci/pipelines/playwright-projects.sh). + +**What can go wrong**: +- Individual test failures (assertions, timeouts, element not found) +- Authentication/login failures (Keycloak issues) +- API timeouts (external service dependencies) +- Flaky tests (pass on retry but show up in JUnit XML as failures) + +**How to tell**: This is the most common scenario. Look at: +- `junit-results-{project}.xml` -- which tests failed +- Playwright HTML report -- detailed failure info with screenshots/videos +- `test-log.html` -- full Playwright console output + +**Important**: The Playwright exit code is the source of truth. Exit code `0` means all tests ultimately passed (even if some were retried). JUnit XML may still report initial failures for retried tests. + +**Action**: Review the specific test failures. Check if the failure is: +- **Flaky**: Passed on retry -- file a flaky test ticket +- **Consistent**: Fails across retries -- real bug, investigate further +- **Broad**: Many tests fail in the same way -- likely a deployment/config issue, not individual test bugs + +### Phase 6: Artifact Collection and Reporting + +**What happens**: Test results, pod logs, screenshots, and videos are collected. Status files are written (see [reporting.sh](../../.ci/pipelines/reporting.sh) and [test-run-tracker.sh](../../.ci/pipelines/lib/test-run-tracker.sh)). A Slack alert is sent via the [send-alert step](https://github.com/openshift/release/tree/master/ci-operator/step-registry/redhat-developer/rhdh/send/alert). + +**What can go wrong**: Rarely fails, but if it does, you may not get artifacts or Slack notification. Check the Prow UI directly. + +--- + +## Job Types Reference + +### OCP Nightly (`ocp-nightly`) + +The most comprehensive nightly job. Runs on OpenShift using ephemeral cluster claims. See [`ocp-nightly.sh`](../../.ci/pipelines/jobs/ocp-nightly.sh). + +**Namespaces**: `showcase-ci-nightly`, `showcase-rbac-nightly`, `postgress-external-db-nightly`, plus a runtime namespace for `showcase-runtime` tests + +**Test suites run (in order)**: +1. **Standard deployment tests** (`showcase`, `showcase-rbac`) -- core functionality with and without RBAC +2. **Runtime config change tests** (`showcase-runtime`) -- tests that modify RHDH configuration at runtime +3. **Sanity plugins check** (`showcase-sanity-plugins`) -- validates plugin loading and basic functionality +4. **Localization tests** (`showcase-localization-fr`, `showcase-localization-it`, `showcase-localization-ja`) -- UI translations + +**OSD-GCP variant**: When the job name contains `osd-gcp`, orchestrator is disabled and localization tests are skipped. + +### OCP Operator (`ocp-operator`) + +Same as OCP nightly but deploys RHDH using the Operator instead of Helm. See [`ocp-operator.sh`](../../.ci/pipelines/jobs/ocp-operator.sh). + +**Namespaces**: `showcase`, `showcase-rbac`, `showcase-runtime` (when runtime tests are enabled) + +**Test suites**: `showcase-operator`, `showcase-operator-rbac` + +**Key differences**: +- Installs RHDH Operator and waits for `backstages.rhdh.redhat.com` CRD (300s timeout) +- Uses Backstage CR (`rhdh-start.yaml`) instead of Helm release +- Orchestrator workflows currently disabled (tracked in RHDHBUGS-2184) +- Runtime config tests currently commented out (tracked in RHDHBUGS-2608) + +### OCP PR Check (`ocp-pull`) + +Runs on every PR that modifies e2e test code. Smaller scope for faster feedback. See [`ocp-pull.sh`](../../.ci/pipelines/jobs/ocp-pull.sh). + +**Namespaces**: `showcase`, `showcase-rbac` + +**Test suites**: `showcase`, `showcase-rbac` only + +**Key differences**: +- No runtime, sanity plugin, or localization tests +- No orchestrator infrastructure setup +- Deploys test Backstage customization provider + +### Auth Providers (`auth-providers`) + +Tests authentication provider integrations. Has a completely different deployment approach. See [`auth-providers.sh`](../../.ci/pipelines/jobs/auth-providers.sh). + +**Namespace**: `showcase-auth-providers` (dedicated) + +**Release name**: `rhdh-auth-providers` + +**Providers tested**: +- OIDC via Red Hat Backstage Keycloak (RHBK) +- Microsoft OAuth2 +- GitHub authentication +- LDAP / Active Directory (may be commented out) + +**Key differences**: +- Uses RHDH **Operator** for deployment (not Helm) +- TypeScript-based test configuration (not Bash scripts) -- see [auth-providers test directory](../../e2e-tests/playwright/e2e/auth-providers/) +- Dedicated values file: [`values_showcase-auth-providers.yaml`](../../.ci/pipelines/value_files/values_showcase-auth-providers.yaml) +- Only **1 retry** (vs 2 for other projects) -- due to complex auth setup/teardown +- Dedicated logs folder: `e2e-tests/auth-providers-logs` +- Requires specific plugins: `keycloak-dynamic`, `github-org-dynamic`, `msgraph-dynamic`, `rbac` + +### Upgrade (`upgrade`) + +Tests upgrading RHDH from a previous version to the current one. See [`upgrade.sh`](../../.ci/pipelines/jobs/upgrade.sh). + +**Namespace**: `showcase-upgrade-nightly` + +**Flow**: +1. Dynamically determine the previous release version +2. Deploy RHDH at the previous version +3. Deploy orchestrator workflows on the previous version +4. Upgrade to the current version +5. Run upgrade-specific Playwright tests + +**Common failures**: Version detection issues, database migration failures during upgrade, backward compatibility problems. + +### AKS Helm / AKS Operator + +Tests on Azure Kubernetes Service. See [`aks-helm.sh`](../../.ci/pipelines/jobs/aks-helm.sh) / [`aks-operator.sh`](../../.ci/pipelines/jobs/aks-operator.sh). + +**Namespaces**: `showcase-k8s-ci-nightly`, `showcase-rbac-k8s-ci-nightly` + +**Test suites**: `showcase-k8s`, `showcase-rbac-k8s` + +**Platform specifics**: +- Uses Azure Spot VMs -- pods may be preempted mid-test (tolerations/affinity patches via [`aks-spot-patch.yaml`](../../.ci/pipelines/cluster/aks/patch/aks-spot-patch.yaml)) +- Ingress via Azure Web App Routing controller (`webapprouting.kubernetes.azure.com`) -- see [`aks-operator-ingress.yaml`](../../.ci/pipelines/cluster/aks/manifest/aks-operator-ingress.yaml) +- Gets LoadBalancer IP from `app-routing-system` namespace (`nginx` service) +- Image pull secrets from Red Hat registry required + +**Common failures**: +- Spot VM preemption causing pod evictions +- LoadBalancer IP not obtained (check `app-routing-system` namespace) +- Azure API throttling +- Image pull failures from Red Hat registry + +### EKS Helm / EKS Operator + +Tests on AWS Elastic Kubernetes Service. See [`eks-helm.sh`](../../.ci/pipelines/jobs/eks-helm.sh) / [`eks-operator.sh`](../../.ci/pipelines/jobs/eks-operator.sh). AWS utilities in [`aws.sh`](../../.ci/pipelines/cluster/eks/aws.sh). + +**Namespaces**: `showcase-k8s-ci-nightly`, `showcase-rbac-k8s-ci-nightly` + +**Test suites**: `showcase-k8s`, `showcase-rbac-k8s` + +**Platform specifics** (DNS/cert logic in [`aws.sh`](../../.ci/pipelines/cluster/eks/aws.sh)): +- **Dynamic DNS**: Generates domain names (`eks-ci-{N}.{region}.{parent-domain}`), tries up to 50 numbers +- **AWS Certificate Manager**: Requests/retrieves SSL certificates per domain. DNS validation with Route53. +- **ALB ingress controller**: AWS Application Load Balancer with SSL redirect -- see [`eks-operator-ingress.yaml`](../../.ci/pipelines/cluster/eks/manifest/eks-operator-ingress.yaml) +- **External DNS**: Automatically creates Route53 records from ingress annotations + +**Network setup flow**: +1. Generate unique domain name and reserve in Route53 +2. Request certificate from ACM, wait for DNS validation (up to 30 minutes) +3. Deploy with ALB ingress, get LoadBalancer hostname +4. Update Route53 CNAME to point to ALB +5. Verify DNS resolution (30 attempts, 15 second intervals) + +**Common failures**: +- Domain number exhaustion (50 limit) +- Certificate issuance delays or validation failures (ACM) +- DNS propagation delays (can take 15-30 minutes) +- Route53 API throttling +- ALB creation/deletion race conditions + +**Cleanup**: Route53 DNS records are deleted after test completion. + +### GKE Helm / GKE Operator + +Tests on Google Kubernetes Engine. See [`gke-helm.sh`](../../.ci/pipelines/jobs/gke-helm.sh) / [`gke-operator.sh`](../../.ci/pipelines/jobs/gke-operator.sh). GCP utilities in [`gcloud.sh`](../../.ci/pipelines/cluster/gke/gcloud.sh). + +**Namespaces**: `showcase-k8s-ci-nightly`, `showcase-rbac-k8s-ci-nightly` + +**Test suites**: `showcase-k8s`, `showcase-rbac-k8s` + +**Platform specifics** (cert logic in [`gcloud.sh`](../../.ci/pipelines/cluster/gke/gcloud.sh)): +- Uses a **long-running cluster** (not ephemeral like OCP) +- Pre-provisioned static IP: `rhdh-static-ip` +- Google-managed SSL certificates via `gcloud` +- GCE ingress class with FrontendConfig for SSL policy and HTTPS redirect -- see [`frontend-config.yaml`](../../.ci/pipelines/cluster/gke/manifest/frontend-config.yaml) and [`gke-operator-ingress.yaml`](../../.ci/pipelines/cluster/gke/manifest/gke-operator-ingress.yaml) +- Ingress annotation: `ingress.gcp.kubernetes.io/pre-shared-cert` + +**Common failures**: +- SSL certificate creation delays (CA issuance timing) +- Static IP already in use or unavailable +- GCP quota limits on certificates/IPs +- Cloud Load Balancer propagation delays +- FrontendConfig not applying (timing issues) + +--- + +## Identifying Failure Types + +### Infrastructure Failure + +The job never got to run tests. Something went wrong with the CI platform itself. + +**Indicators**: +- Prow shows the job as `error` (red circle) rather than `failure` (red X) +- Failure is in `build-log.txt` (top level), not in the test step +- `ci-operator.log` shows provisioning or setup errors +- No test artifacts exist at all + +**Where to look**: +- Top-level `build-log.txt` +- `ci-operator.log` +- `ci-operator-step-graph.json` -- shows which step failed + +**Common causes**: +- Cluster pool exhaustion +- Cloud provider API failures (AKS/EKS/GKE auth, quota) +- Operator marketplace down +- Network/DNS issues at the CI level +- Image registry unavailable + +**Action**: Re-trigger the job. If it persists across multiple runs, escalate to CI platform team. + +### Deployment Failure + +The cluster was provisioned, but RHDH failed to deploy or start properly. + +**Indicators**: +- `STATUS_FAILED_TO_DEPLOY.txt` contains `true` for one or more namespaces +- `build-log.txt` (test step) shows deployment errors before any test execution +- `pod_logs/` contain application crash logs +- No JUnit XML or Playwright report exists for that namespace + +**Where to look**: +- Test step `build-log.txt` -- search for `CrashLoopBackOff`, `Failed to reach Backstage`, `helm upgrade` errors +- `pod_logs/` -- check RHDH container logs for startup errors +- Kubernetes events -- look for `ImagePullBackOff`, `FailedScheduling`, etc. + +**Common causes**: +- Bad configuration in ConfigMaps (see [`resources/config_map/`](../../.ci/pipelines/resources/config_map/)) or values files (see [`value_files/`](../../.ci/pipelines/value_files/)) +- Missing secrets (especially PostgreSQL user secret for RBAC) +- Image pull failures (wrong tag, registry auth, rate limiting) +- Resource constraints (OOM, CPU limits) +- Operator CRD not available in time + +**Action**: Investigate the specific error. If it's a config change in a recent PR, that PR likely caused it. If it's transient (image pull timeout), re-trigger. + +### Test Failure + +RHDH deployed successfully, but one or more Playwright tests failed. + +**Indicators**: +- `STATUS_FAILED_TO_DEPLOY.txt` is `false` (deployment succeeded) +- `STATUS_TEST_FAILED.txt` is `true` +- JUnit XML and Playwright report exist with specific test failures +- `STATUS_NUMBER_OF_TEST_FAILED.txt` shows the count + +**Where to look**: +- `junit-results-{project}.xml` -- which tests failed +- Playwright HTML report -- screenshots, videos, error messages +- `test-log.html` -- full console output of the test run +- `pod_logs/` -- if the test failure suggests a backend issue + +**Subcategories**: + +| Pattern | Likely Cause | Action | +|---------|-------------|--------| +| Single test fails, passes on retry | Flaky test | File flaky test ticket | +| Single test fails consistently | Real test bug or app regression | Investigate, file bug | +| Login/auth tests fail | Keycloak or auth provider issue | Check Keycloak pod logs | +| Many tests timeout | App slow or partially broken | Check pod logs, resource usage | +| All tests fail uniformly | Deployment issue not caught by health check | Treat as deployment failure | + +--- + +## Common Failure Patterns (Cheat Sheet) + +| Symptom | Type | Where to Look | Likely Cause | Action | +|---------|------|---------------|--------------|--------| +| Job status is `error` (not `failure`) | Infra | Top-level `build-log.txt` | Cluster provisioning failed | Re-trigger | +| `failed to acquire cluster lease` | Infra | `ci-operator.log` | Cluster pool exhausted | Wait and re-trigger | +| `CrashLoopBackOff` in test step log | Deploy | `pod_logs/`, K8s events | Bad config, missing secret, OOM | Check pod logs | +| `Failed to reach Backstage after N attempts` | Deploy | Test step `build-log.txt` | Pod didn't start or health check path wrong | Check pod logs, events | +| `postgress-external-db-pguser-janus-idp` secret timeout | Deploy | Test step log | Crunchy Postgres operator issue | Check operator logs | +| `Failed to install subscription` | Infra/Deploy | Test step `build-log.txt` | OperatorHub/Marketplace issue | Re-trigger, check OLM | +| `ImagePullBackOff` or `ErrImagePull` | Deploy | K8s events, pod describe | Wrong image tag or registry auth | Verify image exists, check pull secrets | +| `helm upgrade` command fails | Deploy | Test step `build-log.txt` | Invalid values, missing CRDs | Check recent values file changes | +| Playwright timeout on login page | Test | HTML report, videos | Keycloak down or misconfigured | Check Keycloak pod logs | +| `backstages.rhdh.redhat.com` CRD timeout | Deploy | Test step log | RHDH Operator not installed | Check operator subscription | +| Test passes on retry (flaky) | Test | JUnit XML (failures > 0 but exit 0) | Non-deterministic test | File flaky test ticket | +| All tests fail with same error | Deploy | Pod logs, HTML report | App not functional despite health check | Investigate app state | +| Certificate issuance timeout (EKS/GKE) | Infra | Test step `build-log.txt` | ACM/GCP cert delays | Re-trigger | +| DNS resolution failure (EKS) | Infra | Test step `build-log.txt` | Route53 propagation delay | Re-trigger | +| Spot VM preemption (AKS) | Infra | K8s events | Azure reclaimed spot instance | Re-trigger | +| `LoadBalancer` IP not obtained (K8s) | Infra | Test step `build-log.txt` | Ingress controller issue | Check ingress controller pods | +| Domain number exhaustion (EKS) | Infra | Test step `build-log.txt` | All 50 domain slots taken | Manual DNS cleanup needed | + +--- + +## Useful Links and Tools + +### AI Test Triager (`@Nightly Test Alerts`) + +The **AI Test Triager** is an automated analysis tool integrated into the `@Nightly Test Alerts` Slack app. It significantly speeds up the triage process by doing much of the investigation work for you. + +**How it works**: +- **Automatically triggered** on every failed nightly job -- the analysis appears alongside the failure alert in Slack. +- **Manually invoked** by tagging `@Nightly Test Alerts` in Slack when you want to analyze a specific failure. + +**What it does**: + +| Capability | Description | +|------------|-------------| +| **Artifact inspection** | Reads `build-log.txt`, locates JUnit results, screenshots, and pod logs | +| **JUnit parsing** | Extracts only failed test cases with clean error messages | +| **Screenshot analysis** | Uses AI vision to interpret failure screenshots and identify what went wrong on screen | +| **Root cause analysis** | Provides a concise 1-2 sentence diagnosis of each failure | +| **Duplicate detection** | Searches Jira for semantically similar existing issues to avoid duplicates | +| **Bug creation** | Can create or update Jira bug tickets with detailed findings | + +**Recommended workflow**: +1. A nightly job fails and the alert appears in Slack with the AI analysis. +2. Review the AI triager's root cause analysis and similar Jira issues. +3. If it's a known issue, confirm and move on. +4. If it's a new issue, use the triager's output to create a Jira bug (it can do this for you) or investigate further manually. + +### Prow Dashboard + +| Link | Description | +|------|-------------| +| [Nightly Jobs (main)](https://prow.ci.openshift.org/?type=periodic&job=periodic-ci-redhat-developer-rhdh-main-e2e-*) | All main branch nightly jobs | +| [Nightly Jobs (all branches)](https://prow.ci.openshift.org/?type=periodic&job=periodic-ci-redhat-developer-rhdh-*-e2e-*) | All nightly jobs across branches | +| [PR Check Jobs](https://prow.ci.openshift.org/?type=presubmit&job=pull-ci-redhat-developer-rhdh-*-e2e-*) | PR presubmit jobs | +| [Configured Jobs](https://prow.ci.openshift.org/configured-jobs/redhat-developer/rhdh) | All configured jobs for the repo | +| [Job History (example)](https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-redhat-developer-rhdh-main-e2e-ocp-helm-nightly) | Historical runs for a specific job | + +### Accessing Artifacts Directly + +Artifacts are stored in GCS. You can browse them via: + +- **Spyglass** (Prow UI): Click on a job run, then navigate the artifacts tree +- **GCS Web**: `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/{JOB_NAME}/{BUILD_ID}/` + +### Cluster Access (OCP Jobs Only) + +To log into the ephemeral cluster of a running or recent OCP job: + +```bash +.ci/pipelines/ocp-cluster-claim-login.sh +# Or provide the Prow URL directly: +.ci/pipelines/ocp-cluster-claim-login.sh "https://prow.ci.openshift.org/view/gs/..." +``` + +The script will: +1. Extract the cluster namespace from the Prow build log +2. Log into the hosted-mgmt cluster +3. Retrieve `kubeadmin` credentials +4. Log into the ephemeral cluster +5. Offer to open the web console (copies password to clipboard) + +**Requirements**: You must be a member of the `rhdh-pool-admins` [Rover group](https://rover.redhat.com/groups/search?q=rhdh-pool-admins). + +**Important**: Ephemeral clusters are deleted when the CI job terminates. You can only access them while the job is running or shortly after. + +### Re-triggering a Nightly Job + +Use the trigger script to re-run a failed nightly job: + +```bash +# Basic re-trigger +.ci/pipelines/trigger-nightly-job.sh --job periodic-ci-redhat-developer-rhdh-main-e2e-ocp-helm-nightly + +# Dry run (preview without triggering) +.ci/pipelines/trigger-nightly-job.sh --job --dry-run + +# With custom image (e.g., RC verification) +.ci/pipelines/trigger-nightly-job.sh --job --quay-repo rhdh/rhdh-hub-rhel9 --tag 1.9-123 + +# With Slack alerts enabled +.ci/pipelines/trigger-nightly-job.sh --job --send-alerts +``` + +**Authentication**: The script uses a dedicated kubeconfig at `~/.config/openshift-ci/kubeconfig`. If the token is expired, it will open a browser for SSO login. + +### CI Configuration (openshift/release repo) + +The Prow job definitions and ci-operator configs live in the [openshift/release](https://github.com/openshift/release) repo: + +| Path | Description | +|------|-------------| +| [`ci-operator/config/redhat-developer/rhdh/`](https://github.com/openshift/release/tree/master/ci-operator/config/redhat-developer/rhdh) | ci-operator configuration files | +| [`ci-operator/jobs/redhat-developer/rhdh/`](https://github.com/openshift/release/tree/master/ci-operator/jobs/redhat-developer/rhdh) | Generated Prow job definitions | +| [`ci-operator/step-registry/redhat-developer/rhdh/`](https://github.com/openshift/release/tree/master/ci-operator/step-registry/redhat-developer/rhdh) | Step registry (test steps, alert sending) | + +### Documentation + +| Resource | Link | +|----------|------| +| OpenShift CI Documentation | [docs.ci.openshift.org](https://docs.ci.openshift.org/) | +| ci-operator Architecture | [ci-operator docs](https://docs.ci.openshift.org/docs/architecture/ci-operator/) | +| Artifacts Documentation | [Artifacts how-to](https://docs.ci.openshift.org/docs/how-tos/artifacts/) | +| Prow Overview | [Prow docs](https://docs.ci.openshift.org/docs/architecture/prow/) | +| Cluster Pools & Claims | [Cluster pools docs](https://docs.ci.openshift.org/docs/how-tos/cluster-claim/) | +| RHDH CI Pipeline README | [`.ci/pipelines/README.md`](../../.ci/pipelines/README.md) | +| E2E Testing CI Documentation | [`CI.md`](CI.md) | +| Playwright Documentation | [playwright.dev](https://playwright.dev/) | +| Playwright Trace Viewer | [Trace viewer docs](https://playwright.dev/docs/trace-viewer) | + +### Key Files in This Repo + +| File | Purpose | +|------|---------| +| [`.ci/pipelines/openshift-ci-tests.sh`](../../.ci/pipelines/openshift-ci-tests.sh) | Main entry point -- dispatches to job handlers | +| [`.ci/pipelines/lib/testing.sh`](../../.ci/pipelines/lib/testing.sh) | Test execution, health checks, artifact collection | +| [`.ci/pipelines/lib/log.sh`](../../.ci/pipelines/lib/log.sh) | Structured logging (log levels, colors, sections) | +| [`.ci/pipelines/reporting.sh`](../../.ci/pipelines/reporting.sh) | Status tracking and result persistence | +| [`.ci/pipelines/env_variables.sh`](../../.ci/pipelines/env_variables.sh) | Environment variables and secrets | +| [`.ci/pipelines/jobs/`](../../.ci/pipelines/jobs/) | Per-job-type handlers (ocp-nightly, aks-helm, etc.) | +| [`.ci/pipelines/trigger-nightly-job.sh`](../../.ci/pipelines/trigger-nightly-job.sh) | Manual nightly job trigger via Gangway API | +| [`.ci/pipelines/ocp-cluster-claim-login.sh`](../../.ci/pipelines/ocp-cluster-claim-login.sh) | Cluster access for debugging | +| [`e2e-tests/playwright/projects.json`](../../e2e-tests/playwright/projects.json) | Playwright project definitions (source of truth) | +| [`e2e-tests/playwright.config.ts`](../../e2e-tests/playwright.config.ts) | Playwright configuration (timeouts, retries, workers) | +| [`.ci/pipelines/lib/config.sh`](../../.ci/pipelines/lib/config.sh) | ConfigMap selection and app-config management | +| [`.ci/pipelines/lib/operators.sh`](../../.ci/pipelines/lib/operators.sh) | Operator/OLM installation functions | +| [`.ci/pipelines/lib/helm.sh`](../../.ci/pipelines/lib/helm.sh) | Helm chart operations and value merging | +| [`.ci/pipelines/lib/namespace.sh`](../../.ci/pipelines/lib/namespace.sh) | Namespace lifecycle and image pull secrets | +| [`.ci/pipelines/cleanup.sh`](../../.ci/pipelines/cleanup.sh) | Exit trap for cleanup | +| [`.ci/pipelines/resources/config_map/`](../../.ci/pipelines/resources/config_map/) | App-config YAML files (RBAC and non-RBAC variants) | +| [`.ci/pipelines/value_files/`](../../.ci/pipelines/value_files/) | Helm values overrides for different platforms | From 46721d63f0f833badd62d0fc3b0a1b09ba5673cf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Tue, 31 Mar 2026 14:47:45 +0200 Subject: [PATCH 02/12] docs: fix stale references in CI docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix Slack channel name: #rhdh-e2e-test-alerts → #rhdh-e2e-alerts - Replace janus-idp org with openshift org for /ok-to-test - Remove IBM Cloud nightly tests (no longer used) - Add missing platforms: EKS, OSD-GCP, OCP Operator - Fix function refs: run_tests() in utils.sh → testing::run_tests() in lib/testing.sh - Fix typos (environmentr, test calimed) - Add pointer to CI Medic Guide for detailed triage Co-Authored-By: Claude Opus 4.6 --- docs/e2e-tests/CI.md | 35 +++++++++++++++---------- docs/e2e-tests/enhanced-ci-reporting.md | 8 +++--- 2 files changed, 25 insertions(+), 18 deletions(-) diff --git a/docs/e2e-tests/CI.md b/docs/e2e-tests/CI.md index 907a130ae3..64cd404fef 100644 --- a/docs/e2e-tests/CI.md +++ b/docs/e2e-tests/CI.md @@ -17,13 +17,13 @@ For scenarios where tests are not automatically triggered, or when you need to m 1. **Commenting `/ok-to-test`:** - **Purpose:** This command is used to validate a PR for testing, especially important for external contributors or when tests are not automatically triggered. - - **Who Can Use It:** Only members of the [janus-idp](https://github.com/janus-idp) GitHub organization can mark the PR with this comment. + - **Who Can Use It:** Only members of the [openshift](https://github.com/openshift) GitHub organization can mark the PR with this comment. - **Use Cases:** - **External Contributors:** For PRs from contributors outside the organization, a member needs to comment `/ok-to-test` to initiate tests. - **More Details:** For additional information about `/ok-to-test`, please refer to the [Kubernetes Community Pull Requests Guide](https://github.com/kubernetes/community/blob/master/contributors/guide/pull-requests.md#more-about-ok-to-test). 2. **Triggering Tests Post-Validation:** - - After a janus-idp member has validated the PR with `/ok-to-test`, anyone can trigger tests using the following commands: + - After an `openshift` org member has validated the PR with `/ok-to-test`, anyone can trigger tests using the following commands: - `/test ?` to get a list of all available jobs - `/test e2e-ocp-helm` for mandatory PR checks - **Note:** Avoid using `/test all` as it may trigger unnecessary jobs and consume CI resources. Instead, use `/test ?` to see available options and trigger only the specific tests you need. @@ -38,7 +38,7 @@ For scenarios where tests are not automatically triggered, or when you need to m Use `/test ?` to see the complete list of available jobs for your specific branch and PR context. -These interactions are picked up by the OpenShift-CI service, which sets up a test environmentr. The configurations and steps for setting up this environment are defined in the `openshift-ci-tests.sh` script. For more details, see the [High-Level Overview of `openshift-ci-tests.sh`](#high-level-overview-of-openshift-ci-testssh). +These interactions are picked up by the OpenShift-CI service, which sets up a test environment. The configurations and steps for setting up this environment are defined in the `openshift-ci-tests.sh` script. For more details, see the [High-Level Overview of `openshift-ci-tests.sh`](#high-level-overview-of-openshift-ci-testssh). ### Retrying Tests @@ -51,11 +51,11 @@ If the initial automatically triggered tests fail, OpenShift-CI will add a comme - **Purpose:** Validate new PRs for code quality, functionality, and integration. - **Trigger:** - **Automatic:** When a PR includes code changes affecting tests (excluding doc-only changes), tests are automatically triggered. - - **Manual:** When `/ok-to-test` is commented by a janus-idp member for external contributors or when `/test`, `/test images`, or `/test e2e-ocp-helm` is commented after validation. + - **Manual:** When `/ok-to-test` is commented by an `openshift` org member for external contributors or when `/test`, `/test images`, or `/test e2e-ocp-helm` is commented after validation. - **Environment:** Runs on ephemeral OpenShift clusters managed by Hive. Kubernetes jobs use ephemeral EKS and AKS clusters on spot instances managed by [Mapt](https://github.com/redhat-developer/mapt). GKE uses a long-running cluster. - **Configurations:** - Tests are executed on both **RBAC** (Role-Based Access Control) and **non-RBAC** namespaces. Different sets of tests are executed for both the **non-RBAC RHDH instance** and the **RBAC RHDH instance**, each deployed in separate namespaces. -- **Access:** In order to access the environment, you can run the bash at `.ci/pipelines/ocp-cluster-claim-login.sh`. You will be prompted the prow url (the url from the openshift agent, which looks like https://prow.ci.openshift.org/...). Once you test calimed a cluster, this script will forward the cluster web console url along with the credentials. +- **Access:** In order to access the environment, you can run the bash at `.ci/pipelines/ocp-cluster-claim-login.sh`. You will be prompted the prow url (the url from the openshift agent, which looks like https://prow.ci.openshift.org/...). Once you have claimed a cluster, this script will forward the cluster web console url along with the credentials. - **Steps:** 1. **Detection:** OpenShift-CI detects the PR event. 2. **Environment Setup:** The test environment is set up using the `openshift-ci-tests.sh` script (see the [High-Level Overview](#high-level-overview-of-openshift-ci-testssh)). @@ -86,15 +86,18 @@ Nightly tests are run to ensure the stability and reliability of our codebase ov ### Nightly Test Environments -- **AKS Nightly Tests:** Nightly tests for Azure Kubernetes Service (AKS) run on a dedicated cluster. We do not have AKS PR checks; the AKS environment is exclusively used for nightly runs. -- **IBM Cloud Tests:** All nightly tests for the `main` and `release-1.n`(depending on the latest release versions) branches run against OpenShift clusters on IBM Cloud. -- **GKE Nightly Tests:** Nightly tests on top of Google Kubernetes Engine. As the AKS, GKE is only used for nightly tests. +- **OCP Nightly Tests:** Run on ephemeral OpenShift clusters provisioned via Hive cluster pools on AWS. Multiple OCP versions are tested. +- **OCP Operator Nightly Tests:** Same as OCP nightly but deploy RHDH using the Operator instead of Helm. +- **AKS Nightly Tests:** Nightly tests on Azure Kubernetes Service (AKS) using ephemeral clusters provisioned by [Mapt](https://github.com/redhat-developer/mapt). AKS is exclusively used for nightly runs (no PR checks). +- **EKS Nightly Tests:** Nightly tests on AWS Elastic Kubernetes Service (EKS) using ephemeral clusters provisioned by Mapt. +- **GKE Nightly Tests:** Nightly tests on Google Kubernetes Engine using a long-running shared cluster. +- **OSD-GCP Nightly Tests:** Nightly tests on OpenShift Dedicated on GCP. Orchestrator is disabled and localization tests are skipped in this environment. ### Additional Nightly Jobs for Main Branch The nightly job for the `main` branch also runs against three OpenShift Container Platform (OCP) versions to ensure compatibility and stability across multiple versions. We maintain testing on the three most recent OCP versions. As new OCP versions are released, we will update our testing pipeline to include the latest versions and drop support for older ones accordingly. -> **Note:** The output of the nightly runs, including test results and any relevant notifications, is posted on the Slack channel **`#rhdh-e2e-test-alerts`**. +> **Note:** The output of the nightly runs, including test results and any relevant notifications, is posted on the Slack channel **`#rhdh-e2e-alerts`**. ### Localization Tests @@ -115,8 +118,10 @@ The localization test implementation is in `.ci/pipelines/jobs/ocp-nightly.sh` ( - **Purpose:** Ensure ongoing stability and detect regressions in different environments. - **Trigger:** Scheduled to run every night. - **Environments:** - - **AKS Nightly Tests:** Runs on the dedicated AKS cluster. - - **IBM Cloud Nightly Tests:** Runs on OpenShift clusters on IBM Cloud, covering the most recent OCP versions. + - **OCP:** Ephemeral clusters from Hive pools (multiple OCP versions). + - **AKS/EKS:** Ephemeral clusters provisioned by [Mapt](https://github.com/redhat-developer/mapt). + - **GKE:** Long-running shared cluster. + - **OSD-GCP:** OpenShift Dedicated on GCP. - **Configurations:** - Tests are executed on both **RBAC** (Role-Based Access Control) and **non-RBAC** namespaces. - **Steps:** @@ -133,11 +138,13 @@ The localization test implementation is in `.ci/pipelines/jobs/ocp-nightly.sh` ( - Collects and aggregates results. - Stores artifacts for later review for a retention period of **6 months**. 5. **Reporting:** - - Posts outputs to Slack channel `#rhdh-e2e-test-alerts`. + - Posts outputs to Slack channel `#rhdh-e2e-alerts`. - Generates report. - **Artifacts:** Comprehensive test reports, logs, screenshots. - **Notifications:** Results posted on Slack. +> **Note:** The nightly testing diagram below is a simplified overview. For detailed job types, failure investigation, and triage workflows, see the [CI Medic Guide](CI-medic-guide.md). + ### Nightly Testing Diagram ![Nightly Testing Diagram](../images/nightly_diagram.svg) @@ -166,7 +173,7 @@ The OpenShift CI definitions for PR checks and nightly runs, as well as executio - **Testing:** Runs end-to-end tests with Playwright. - **Cleanup and Reporting:** Cleans up resources and collects artifacts after testing. -Detailed steps on how the tests and reports are managed can be found in the `run_tests()` function within the `utils.sh` script. The CI pipeline executes tests directly using Playwright's `--project` flag (e.g., `yarn playwright test --project=showcase`) rather than yarn script aliases. The `check_and_test()` and `run_tests()` functions accept an explicit Playwright project argument, decoupling the namespace from the test project name for more flexible reuse. +Detailed steps on how the tests and reports are managed can be found in the `testing::run_tests()` and `testing::check_and_test()` functions in `.ci/pipelines/lib/testing.sh`. The CI pipeline executes tests directly using Playwright's `--project` flag (e.g., `yarn playwright test --project=showcase`) rather than yarn script aliases. These functions accept an explicit Playwright project argument, decoupling the namespace from the test project name for more flexible reuse. ### Playwright Project Names (Single Source of Truth) @@ -201,7 +208,7 @@ When the test run is complete, the status will be reported under your PR checks. - The script cleans up all temporary resources after tests. - **Reporting:** - Generates reports and stores artifacts for **6 months**. - - Nightly test results are posted to Slack channel `#rhdh-e2e-test-alerts`. + - Nightly test results are posted to Slack channel `#rhdh-e2e-alerts`. ### Configuration Details diff --git a/docs/e2e-tests/enhanced-ci-reporting.md b/docs/e2e-tests/enhanced-ci-reporting.md index 6e0c4b1486..c71956fc2b 100644 --- a/docs/e2e-tests/enhanced-ci-reporting.md +++ b/docs/e2e-tests/enhanced-ci-reporting.md @@ -4,7 +4,7 @@ This document describes the enhanced CI reporting system that provides detailed ## Overview -The enhanced CI reporting system uses the [`.ci/pipelines/reporting.sh`](../../.ci/pipelines/reporting.sh) script to track various aspects of test execution and deployment status. Results are stored in the `SHARED_DIR` for use by OpenShift CI steps and are formatted into Slack notifications sent to the `#rhdh-e2e-test-alerts` channel. +The enhanced CI reporting system uses the [`.ci/pipelines/reporting.sh`](../../.ci/pipelines/reporting.sh) script to track various aspects of test execution and deployment status. Results are stored in the `SHARED_DIR` for use by OpenShift CI steps and are formatted into Slack notifications sent to the `#rhdh-e2e-alerts` channel. **Note:** The `SHARED_DIR` can only contain files. No directories or nested structures are supported. @@ -115,7 +115,7 @@ The reporting system integrates with OpenShift CI through: 1. **Step Registry**: OpenShift CI steps can read the status files from `SHARED_DIR` 2. **Artifact Collection**: Status files are preserved in artifacts for debugging -3. **Slack Notifications**: Results are formatted and sent to `#rhdh-e2e-test-alerts` +3. **Slack Notifications**: Results are formatted and sent to `#rhdh-e2e-alerts` ### The `redhat-developer-rhdh-send-alert` step @@ -124,7 +124,7 @@ The `redhat-developer-rhdh-send-alert` step is defined in the [OpenShift release - Runs as a post-step in OpenShift CI jobs - Reads the status files from `SHARED_DIR` that were written by the reporting functions - Formats the collected status information into structured Slack messages -- Sends notifications to the `#rhdh-e2e-test-alerts` channel +- Sends notifications to the `#rhdh-e2e-alerts` channel - Handles multiple deployments and their individual test results - Provides links to job logs, artifacts, and ReportPortal results @@ -132,7 +132,7 @@ The step is configured in job definitions to run after test execution completes, ## Slack Notifications -For nightly runs, the system automatically sends notifications to the `#rhdh-e2e-test-alerts` Slack channel. The message format includes: +For nightly runs, the system automatically sends notifications to the `#rhdh-e2e-alerts` Slack channel. The message format includes: - **Job Header**: Job name with overall status - **Logs Link**: Direct link to job logs From eb37c07623ad161795a707ef44b7345886060a0b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Tue, 31 Mar 2026 14:48:45 +0200 Subject: [PATCH 03/12] docs: update CI pipeline README and reporting docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Replace hardcoded cluster pool list with link to OpenShift CI docs - Remove stale IBM Cloud migration history - Fix example Prow URL (janus-idp_backstage-showcase → redhat-developer-rhdh) - Replace inline module list with link to lib/README.md - Add release branch Slack channels to enhanced-ci-reporting.md - Add link to CI Medic Guide Co-Authored-By: Claude Opus 4.6 --- .ci/pipelines/README.md | 58 ++++++++----------------- docs/e2e-tests/enhanced-ci-reporting.md | 2 +- 2 files changed, 18 insertions(+), 42 deletions(-) diff --git a/.ci/pipelines/README.md b/.ci/pipelines/README.md index 537f7c12d3..0cc9f73903 100644 --- a/.ci/pipelines/README.md +++ b/.ci/pipelines/README.md @@ -2,17 +2,16 @@ ## Overview -The RHDH deployment for end-to-end (e2e) tests in CI has been updated to use **ephemeral clusters** -on OpenShift Container Platform (OCP) instead of persistent clusters. +The RHDH deployment for end-to-end (e2e) tests in CI uses **ephemeral clusters** on OpenShift +Container Platform (OCP). -### Key Updates +### Key Details -- Starting from version **1.5**, ephemeral clusters are used for: - - OCP nightly jobs (v4.17, v4.16, and v4.14). - - PR checks on the main branch. -- Previously, RHDH PR checks utilized persistent clusters created on IBM Cloud. -- Now, ephemeral clusters are provisioned using the **OpenShift CI cluster claim** on AWS via the +- Ephemeral OCP clusters are provisioned using the **OpenShift CI cluster claim** on AWS via the RHDH-QE account in the `us-east-2` region. +- Used for OCP nightly jobs (multiple OCP versions) and PR checks on the main branch. +- Non-OCP platforms (AKS, EKS) use ephemeral clusters provisioned by + [Mapt](https://github.com/redhat-developer/mapt). GKE uses a long-running shared cluster. --- @@ -28,23 +27,12 @@ To access ephemeral clusters, you must: ## Cluster Pools -The following cluster pools are available for different OCP versions: +RHDH uses dedicated Hive cluster pools with the `rhdh` prefix. Pool versions rotate as new OCP +releases come out. -- **RHDH-4-19-US-EAST-2** - - Usage: OCP v4.19 nightly jobs. - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-19-0-amd64-aws-us-east-2_clusterpool.yaml). - -- **RHDH-4-18-US-EAST-2** - - Usage: OCP v4.18 nightly jobs. - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-18-0-amd64-aws-us-east-2_clusterpool.yaml). - -- **RHDH-4-17-US-EAST-2** - - Usage: PR checks on the main branch and OCP v4.17 nightly jobs. - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-17-0-amd64-aws-us-east-2_clusterpool.yaml). - -- **RHDH-4-16-US-EAST-2** - - Usage: OCP v4.16 nightly jobs. - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-16-0-amd64-aws-us-east-2_clusterpool.yaml). +To find the current list of available pools, filter for `rhdh` in the +[existing cluster pools](https://docs.ci.openshift.org/how-tos/cluster-claim/#existing-cluster-pools) +page. --- @@ -87,7 +75,7 @@ ephemeral environment credentials. .ci/pipelines/ocp-cluster-claim-login.sh ``` 2. Provide the Prow log URL when prompted, for example: - `https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/janus-idp_backstage-showcase/2089/pull-ci-janus-idp-backstage-showcase-main-e2e-tests/1866766753132974080 ` + `https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-redhat-developer-rhdh-main-e2e-ocp-helm-nightly/` 3. The script will: - Log in to the hosted-mgmt cluster, which manages ephemeral cluster creation. - Retrieve admin credentials and log in to the ephemeral cluster. @@ -153,20 +141,8 @@ yarn shellcheck ### Modular Architecture -Pipeline utilities are organized into modules in `.ci/pipelines/lib/`: - -- `log.sh` - Logging functions -- `common.sh` - Common utilities (oc_login, sed_inplace, etc.) -- `k8s-wait.sh` - Kubernetes wait/polling operations -- `operators.sh` - Operator installations - -Usage example: - -```bash -# Using modular functions -k8s_wait::deployment "namespace" "deployment" -common::oc_login -operator::install_pipelines -``` +Pipeline utilities are organized into modules in `.ci/pipelines/lib/`. See +[`lib/README.md`](lib/README.md) for the full list of modules, function signatures, and conventions. -See `lib/README.md` for module details. +For detailed triage and failure investigation, see the +[CI Medic Guide](../../docs/e2e-tests/CI-medic-guide.md). diff --git a/docs/e2e-tests/enhanced-ci-reporting.md b/docs/e2e-tests/enhanced-ci-reporting.md index c71956fc2b..b66ba4e6cb 100644 --- a/docs/e2e-tests/enhanced-ci-reporting.md +++ b/docs/e2e-tests/enhanced-ci-reporting.md @@ -132,7 +132,7 @@ The step is configured in job definitions to run after test execution completes, ## Slack Notifications -For nightly runs, the system automatically sends notifications to the `#rhdh-e2e-alerts` Slack channel. The message format includes: +For nightly runs, the system automatically sends notifications to the `#rhdh-e2e-alerts` Slack channel (main branch) or `#rhdh-e2e-alerts-{VERSION}` channels for release branches (e.g., `#rhdh-e2e-alerts-1-9`, `#rhdh-e2e-alerts-1-10`). The message format includes: - **Job Header**: Job name with overall status - **Logs Link**: Direct link to job logs From 53ddf9f816601554515995d9a18c702fe18f8e16 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Tue, 31 Mar 2026 14:48:54 +0200 Subject: [PATCH 04/12] docs: update rulesync CI rule with current refs and job handlers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix function refs: run_tests() → testing::run_tests() in lib/testing.sh - Fix Slack channel name: #rhdh-e2e-test-alerts → #rhdh-e2e-alerts - Replace hardcoded cluster pools with link to OpenShift CI docs - Add all missing job handlers (aks/eks/gke-operator, osd-gcp, upgrade) - Add showcase-runtime-db to Playwright project list Co-Authored-By: Claude Opus 4.6 --- .claude/memories/ci-e2e-testing.md | 2 +- .claude/rules/ci-e2e-testing.md | 46 ++++++++++-------------------- .cursor/rules/ci-e2e-testing.mdc | 46 ++++++++++-------------------- .rulesync/rules/ci-e2e-testing.md | 46 ++++++++++-------------------- 4 files changed, 46 insertions(+), 94 deletions(-) diff --git a/.claude/memories/ci-e2e-testing.md b/.claude/memories/ci-e2e-testing.md index 48053c7593..302ebc9751 100644 --- a/.claude/memories/ci-e2e-testing.md +++ b/.claude/memories/ci-e2e-testing.md @@ -297,7 +297,7 @@ Available cluster pools for different OCP versions: #### Nightly Tests - **Schedule**: Automated nightly runs - **Environments**: Multiple OCP versions, AKS, GKE -- **Reporting**: Slack notifications to `#rhdh-e2e-test-alerts` +- **Reporting**: Slack notifications to `#rhdh-e2e-alerts` ### Test Execution Environment diff --git a/.claude/rules/ci-e2e-testing.md b/.claude/rules/ci-e2e-testing.md index 48053c7593..53f7d4607c 100644 --- a/.claude/rules/ci-e2e-testing.md +++ b/.claude/rules/ci-e2e-testing.md @@ -77,6 +77,7 @@ test.beforeAll(async ({ }, testInfo) => { - `showcase-operator`: General functionality tests with base deployment using Operator - `showcase-operator-rbac`: General functionality tests with RBAC-enabled deployment using Operator - `showcase-runtime`: Runtime environment tests + - `showcase-runtime-db`: Runtime database tests - `showcase-sanity-plugins`: Plugin sanity checks - `showcase-upgrade`: Upgrade scenario tests - `showcase-localization-fr`: French localization tests @@ -125,18 +126,18 @@ test.beforeAll(async ({ }, testInfo) => { #### CI/CD Pipeline Execution -In the CI/CD pipeline, tests are executed directly using Playwright's `--project` flag via the `run_tests()` function in `.ci/pipelines/utils.sh`: +In the CI/CD pipeline, tests are executed directly using Playwright's `--project` flag via `testing::run_tests()` in `.ci/pipelines/lib/testing.sh`: ```bash yarn playwright test --project="${playwright_project}" ``` -The namespace and Playwright project are decoupled, allowing flexible reuse. The `check_and_test()` and `run_tests()` functions accept an explicit `playwright_project` argument: +The namespace and Playwright project are decoupled, allowing flexible reuse. The `testing::check_and_test()` and `testing::run_tests()` functions accept an explicit `playwright_project` argument: ```bash # Function signatures: -check_and_test "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" [max_attempts] [wait_seconds] -run_tests "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" +testing::check_and_test "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" [max_attempts] [wait_seconds] +testing::run_tests "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" ``` #### Local Development Scripts @@ -266,25 +267,7 @@ Check the readme at `.ci/pipelines/README.md` ### Cluster Pools -Available cluster pools for different OCP versions: - -- **RHDH-4-19-US-EAST-2** - - Usage: OCP v4.19 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-19-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-18-US-EAST-2** - - Usage: OCP v4.18 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-18-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-17-US-EAST-2** - - Usage: PR checks on main branch and OCP v4.17 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-17-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-16-US-EAST-2** - - Usage: OCP v4.16 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-16-0-amd64-aws-us-east-2_clusterpool.yaml) - -**Note:** This is subject to change. Use `.ci/pipelines/README.md` as a source of truth. +RHDH uses dedicated Hive cluster pools with the `rhdh` prefix on AWS `us-east-2`. Pool versions rotate as new OCP releases come out. Find the current list by filtering for `rhdh` in the [existing cluster pools](https://docs.ci.openshift.org/how-tos/cluster-claim/#existing-cluster-pools) page. See also `.ci/pipelines/README.md`. ### CI Job Types @@ -297,7 +280,7 @@ Available cluster pools for different OCP versions: #### Nightly Tests - **Schedule**: Automated nightly runs - **Environments**: Multiple OCP versions, AKS, GKE -- **Reporting**: Slack notifications to `#rhdh-e2e-test-alerts` +- **Reporting**: Slack notifications to `#rhdh-e2e-alerts` ### Test Execution Environment @@ -347,14 +330,15 @@ yarn prettier:fix # Fix formatting for shell, markdown, and YAML file - Functions may temporarily disable/re-enable error handling with `set +e` / `set -e` pattern #### Job Handlers -The main script handles different job types: -- `handle_aks_helm`: AKS Helm deployment -- `handle_eks_helm`: EKS Helm deployment -- `handle_gke_helm`: GKE Helm deployment -- `handle_ocp_operator`: OCP Operator deployment -- `handle_ocp_nightly`: OCP nightly tests -- `handle_ocp_pull`: OCP PR tests +The main script dispatches to job-specific handlers in `.ci/pipelines/jobs/`: +- `handle_aks_helm` / `handle_aks_operator`: AKS Helm/Operator deployment +- `handle_eks_helm` / `handle_eks_operator`: EKS Helm/Operator deployment +- `handle_gke_helm` / `handle_gke_operator`: GKE Helm/Operator deployment +- `handle_ocp_nightly`: OCP Helm nightly tests (also handles OSD-GCP Helm) +- `handle_ocp_operator`: OCP Operator nightly tests (also handles OSD-GCP Operator) +- `handle_ocp_pull`: OCP PR checks - `handle_auth_providers`: Auth provider tests +- `handle_ocp_helm_upgrade`: Upgrade scenario tests #### Special Case: showcase-auth-providers Deployment diff --git a/.cursor/rules/ci-e2e-testing.mdc b/.cursor/rules/ci-e2e-testing.mdc index d89f8e0ab3..1924c829ba 100644 --- a/.cursor/rules/ci-e2e-testing.mdc +++ b/.cursor/rules/ci-e2e-testing.mdc @@ -80,6 +80,7 @@ test.beforeAll(async ({ }, testInfo) => { - `showcase-operator`: General functionality tests with base deployment using Operator - `showcase-operator-rbac`: General functionality tests with RBAC-enabled deployment using Operator - `showcase-runtime`: Runtime environment tests + - `showcase-runtime-db`: Runtime database tests - `showcase-sanity-plugins`: Plugin sanity checks - `showcase-upgrade`: Upgrade scenario tests - `showcase-localization-fr`: French localization tests @@ -128,18 +129,18 @@ test.beforeAll(async ({ }, testInfo) => { #### CI/CD Pipeline Execution -In the CI/CD pipeline, tests are executed directly using Playwright's `--project` flag via the `run_tests()` function in `.ci/pipelines/utils.sh`: +In the CI/CD pipeline, tests are executed directly using Playwright's `--project` flag via `testing::run_tests()` in `.ci/pipelines/lib/testing.sh`: ```bash yarn playwright test --project="${playwright_project}" ``` -The namespace and Playwright project are decoupled, allowing flexible reuse. The `check_and_test()` and `run_tests()` functions accept an explicit `playwright_project` argument: +The namespace and Playwright project are decoupled, allowing flexible reuse. The `testing::check_and_test()` and `testing::run_tests()` functions accept an explicit `playwright_project` argument: ```bash # Function signatures: -check_and_test "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" [max_attempts] [wait_seconds] -run_tests "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" +testing::check_and_test "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" [max_attempts] [wait_seconds] +testing::run_tests "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" ``` #### Local Development Scripts @@ -269,25 +270,7 @@ Check the readme at `.ci/pipelines/README.md` ### Cluster Pools -Available cluster pools for different OCP versions: - -- **RHDH-4-19-US-EAST-2** - - Usage: OCP v4.19 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-19-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-18-US-EAST-2** - - Usage: OCP v4.18 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-18-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-17-US-EAST-2** - - Usage: PR checks on main branch and OCP v4.17 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-17-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-16-US-EAST-2** - - Usage: OCP v4.16 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-16-0-amd64-aws-us-east-2_clusterpool.yaml) - -**Note:** This is subject to change. Use `.ci/pipelines/README.md` as a source of truth. +RHDH uses dedicated Hive cluster pools with the `rhdh` prefix on AWS `us-east-2`. Pool versions rotate as new OCP releases come out. Find the current list by filtering for `rhdh` in the [existing cluster pools](https://docs.ci.openshift.org/how-tos/cluster-claim/#existing-cluster-pools) page. See also `.ci/pipelines/README.md`. ### CI Job Types @@ -300,7 +283,7 @@ Available cluster pools for different OCP versions: #### Nightly Tests - **Schedule**: Automated nightly runs - **Environments**: Multiple OCP versions, AKS, GKE -- **Reporting**: Slack notifications to `#rhdh-e2e-test-alerts` +- **Reporting**: Slack notifications to `#rhdh-e2e-alerts` ### Test Execution Environment @@ -350,14 +333,15 @@ yarn prettier:fix # Fix formatting for shell, markdown, and YAML file - Functions may temporarily disable/re-enable error handling with `set +e` / `set -e` pattern #### Job Handlers -The main script handles different job types: -- `handle_aks_helm`: AKS Helm deployment -- `handle_eks_helm`: EKS Helm deployment -- `handle_gke_helm`: GKE Helm deployment -- `handle_ocp_operator`: OCP Operator deployment -- `handle_ocp_nightly`: OCP nightly tests -- `handle_ocp_pull`: OCP PR tests +The main script dispatches to job-specific handlers in `.ci/pipelines/jobs/`: +- `handle_aks_helm` / `handle_aks_operator`: AKS Helm/Operator deployment +- `handle_eks_helm` / `handle_eks_operator`: EKS Helm/Operator deployment +- `handle_gke_helm` / `handle_gke_operator`: GKE Helm/Operator deployment +- `handle_ocp_nightly`: OCP Helm nightly tests (also handles OSD-GCP Helm) +- `handle_ocp_operator`: OCP Operator nightly tests (also handles OSD-GCP Operator) +- `handle_ocp_pull`: OCP PR checks - `handle_auth_providers`: Auth provider tests +- `handle_ocp_helm_upgrade`: Upgrade scenario tests #### Special Case: showcase-auth-providers Deployment diff --git a/.rulesync/rules/ci-e2e-testing.md b/.rulesync/rules/ci-e2e-testing.md index c9d0d983cc..09f6d20ded 100644 --- a/.rulesync/rules/ci-e2e-testing.md +++ b/.rulesync/rules/ci-e2e-testing.md @@ -83,6 +83,7 @@ test.beforeAll(async ({ }, testInfo) => { - `showcase-operator`: General functionality tests with base deployment using Operator - `showcase-operator-rbac`: General functionality tests with RBAC-enabled deployment using Operator - `showcase-runtime`: Runtime environment tests + - `showcase-runtime-db`: Runtime database tests - `showcase-sanity-plugins`: Plugin sanity checks - `showcase-upgrade`: Upgrade scenario tests - `showcase-localization-fr`: French localization tests @@ -131,18 +132,18 @@ test.beforeAll(async ({ }, testInfo) => { #### CI/CD Pipeline Execution -In the CI/CD pipeline, tests are executed directly using Playwright's `--project` flag via the `run_tests()` function in `.ci/pipelines/utils.sh`: +In the CI/CD pipeline, tests are executed directly using Playwright's `--project` flag via `testing::run_tests()` in `.ci/pipelines/lib/testing.sh`: ```bash yarn playwright test --project="${playwright_project}" ``` -The namespace and Playwright project are decoupled, allowing flexible reuse. The `check_and_test()` and `run_tests()` functions accept an explicit `playwright_project` argument: +The namespace and Playwright project are decoupled, allowing flexible reuse. The `testing::check_and_test()` and `testing::run_tests()` functions accept an explicit `playwright_project` argument: ```bash # Function signatures: -check_and_test "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" [max_attempts] [wait_seconds] -run_tests "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" +testing::check_and_test "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" [max_attempts] [wait_seconds] +testing::run_tests "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" ``` #### Local Development Scripts @@ -272,25 +273,7 @@ Check the readme at `.ci/pipelines/README.md` ### Cluster Pools -Available cluster pools for different OCP versions: - -- **RHDH-4-19-US-EAST-2** - - Usage: OCP v4.19 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-19-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-18-US-EAST-2** - - Usage: OCP v4.18 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-18-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-17-US-EAST-2** - - Usage: PR checks on main branch and OCP v4.17 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-17-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-16-US-EAST-2** - - Usage: OCP v4.16 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-16-0-amd64-aws-us-east-2_clusterpool.yaml) - -**Note:** This is subject to change. Use `.ci/pipelines/README.md` as a source of truth. +RHDH uses dedicated Hive cluster pools with the `rhdh` prefix on AWS `us-east-2`. Pool versions rotate as new OCP releases come out. Find the current list by filtering for `rhdh` in the [existing cluster pools](https://docs.ci.openshift.org/how-tos/cluster-claim/#existing-cluster-pools) page. See also `.ci/pipelines/README.md`. ### CI Job Types @@ -303,7 +286,7 @@ Available cluster pools for different OCP versions: #### Nightly Tests - **Schedule**: Automated nightly runs - **Environments**: Multiple OCP versions, AKS, GKE -- **Reporting**: Slack notifications to `#rhdh-e2e-test-alerts` +- **Reporting**: Slack notifications to `#rhdh-e2e-alerts` ### Test Execution Environment @@ -353,14 +336,15 @@ yarn prettier:fix # Fix formatting for shell, markdown, and YAML file - Functions may temporarily disable/re-enable error handling with `set +e` / `set -e` pattern #### Job Handlers -The main script handles different job types: -- `handle_aks_helm`: AKS Helm deployment -- `handle_eks_helm`: EKS Helm deployment -- `handle_gke_helm`: GKE Helm deployment -- `handle_ocp_operator`: OCP Operator deployment -- `handle_ocp_nightly`: OCP nightly tests -- `handle_ocp_pull`: OCP PR tests +The main script dispatches to job-specific handlers in `.ci/pipelines/jobs/`: +- `handle_aks_helm` / `handle_aks_operator`: AKS Helm/Operator deployment +- `handle_eks_helm` / `handle_eks_operator`: EKS Helm/Operator deployment +- `handle_gke_helm` / `handle_gke_operator`: GKE Helm/Operator deployment +- `handle_ocp_nightly`: OCP Helm nightly tests (also handles OSD-GCP Helm) +- `handle_ocp_operator`: OCP Operator nightly tests (also handles OSD-GCP Operator) +- `handle_ocp_pull`: OCP PR checks - `handle_auth_providers`: Auth provider tests +- `handle_ocp_helm_upgrade`: Upgrade scenario tests #### Special Case: showcase-auth-providers Deployment From 1795ffc93f2b6e89daf40c67e85b7dc8ab06b247 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Wed, 1 Apr 2026 11:42:20 +0200 Subject: [PATCH 05/12] docs: streamline CI Medic Guide - reduce duplication, add usage chapter - Add "How to Use This Guide" chapter with day-1 setup and rotation lookup table - Replace static cheat sheet with pointers to AI Test Triager and resolved Jira ci-fail issues - Replace duplicated script usage sections with pointers to source docs - Simplify cloud platform sections with practical failure patterns - Consolidate documentation/key files tables into concise related docs section Co-Authored-By: Claude Opus 4.6 --- docs/e2e-tests/CI-medic-guide.md | 250 ++++++++++--------------------- 1 file changed, 81 insertions(+), 169 deletions(-) diff --git a/docs/e2e-tests/CI-medic-guide.md b/docs/e2e-tests/CI-medic-guide.md index bd83042e3f..6e808a0478 100644 --- a/docs/e2e-tests/CI-medic-guide.md +++ b/docs/e2e-tests/CI-medic-guide.md @@ -5,12 +5,13 @@ A practical guide for investigating test failures in RHDH nightly jobs and PR ch ## Table of Contents - [Overview](#overview) +- [How to Use This Guide](#how-to-use-this-guide) - [Anatomy of a Prow Job](#anatomy-of-a-prow-job) - [Where to Find Logs and Artifacts](#where-to-find-logs-and-artifacts) - [Job Lifecycle and Failure Points](#job-lifecycle-and-failure-points) - [Job Types Reference](#job-types-reference) - [Identifying Failure Types](#identifying-failure-types) -- [Common Failure Patterns (Cheat Sheet)](#common-failure-patterns-cheat-sheet) +- [Finding Past Failures](#finding-past-failures) - [Useful Links and Tools](#useful-links-and-tools) - [AI Test Triager](#ai-test-triager-nightly-test-alerts) @@ -20,7 +21,7 @@ A practical guide for investigating test failures in RHDH nightly jobs and PR ch ### What is a CI Medic? -The CI medic is a **weekly rotating role** responsible for maintaining the health of PR checks and nightly E2E test jobs. When your rotation starts, you'll receive a Slack message with your responsibilities. +The CI medic is a **weekly rotating role** responsible for maintaining the health of PR checks and nightly E2E test jobs. When your rotation starts, you'll receive a Slack message with your responsibilities as a reminder. The complete role description is described in [this Google Doc](https://docs.google.com/document/d/1CjqSQYA6g35-95OpHXobcJdWFRGS5yu-MV8-mfuDmQA/edit?usp=sharing) ### Core Responsibilities @@ -39,7 +40,7 @@ The CI medic is a **weekly rotating role** responsible for maintaining the healt - **Main branch**: `#rhdh-e2e-alerts` Slack channel - **Release branches**: Dedicated channels like `#rhdh-e2e-alerts-1-8`, `#rhdh-e2e-alerts-1-9`, etc. - **Infrastructure announcements**: `#announce-testplatform` (general OpenShift CI status) -- **Getting help**: `#forum-ocp-testplatform` (ask questions about CI platform issues) +- **Getting help**: `#forum-ocp-testplatform` (ask questions about CI platform issues, or see if others face similar issues) Each alert includes links to the job logs, artifacts, and a summary of which deployments/tests passed or failed. Check the bookmarks/folders in the `#rhdh-e2e-alerts` channel for additional resources. @@ -57,6 +58,53 @@ Each alert includes links to the job logs, artifacts, and a summary of which dep --- +## How to Use This Guide + +This guide is a **reference**, not a textbook. You don't need to read it cover-to-cover before your rotation starts. Instead, use it as a companion that you come back to as situations arise during the week. + +### Getting Started (Day 1) + +When your rotation begins: + +1. **Read the [Overview](#overview)** above to understand the role and where alerts come in. +2. **Familiarize yourself with the [Useful Links and Tools](#useful-links-and-tools)** section -- open the Prow dashboards, join the Slack channels, and make sure you have access. +3. **Try the [AI Test Triager](#ai-test-triager-nightly-test-alerts)** on a recent failure in `#rhdh-e2e-alerts` to see how it works. It will handle most of the initial analysis for you. + +That's enough to start triaging. + +### During Your Rotation + +Use the rest of the guide on demand as you encounter specific situations: + +| Situation | Section to consult | +|-----------|-------------------| +| A job failed and you need to find the logs | [Where to Find Logs and Artifacts](#where-to-find-logs-and-artifacts) | +| You can't tell *where* in the pipeline it broke | [Job Lifecycle and Failure Points](#job-lifecycle-and-failure-points) | +| You need to understand what a specific job does | [Job Types Reference](#job-types-reference) | +| You're unsure if it's infra, deployment, or a test bug | [Identifying Failure Types](#identifying-failure-types) | +| You need to re-trigger a job or access a cluster | [Useful Links and Tools](#useful-links-and-tools) | + +### Understanding the CI Scripts + +The guide links heavily to scripts in `.ci/pipelines/`. You don't need to read those scripts upfront either. When you're investigating a failure and need to understand what a specific phase does, follow the links from the relevant [Job Lifecycle](#job-lifecycle-and-failure-points) or [Job Types](#job-types-reference) section to the source code. + +Key entry points if you do want to explore: +- [`.ci/pipelines/openshift-ci-tests.sh`](../../.ci/pipelines/openshift-ci-tests.sh) -- the main dispatcher, start here to understand how jobs are routed +- [`.ci/pipelines/jobs/`](../../.ci/pipelines/jobs/) -- one handler per job type, each is self-contained +- [`.ci/pipelines/lib/testing.sh`](../../.ci/pipelines/lib/testing.sh) -- how tests are executed, health-checked, and artifacts collected + +### Improving This Guide + +This guide is a living document. When you finish your rotation: + +- **Update outdated information** -- job names, namespaces, and platform details change over time. +- **Clarify anything that confused you** -- if you had to figure something out the hard way, save the next person the trouble. +- **Remove stale content** -- if a job type or failure mode no longer exists, remove it rather than leaving it to confuse future medics. + +Small, incremental improvements after each rotation keep this guide accurate and useful. + +--- + ## Anatomy of a Prow Job ### Job Naming Convention @@ -155,16 +203,9 @@ This is especially useful when: - You need to check pod resource consumption (OOM suspicion) - You want to watch deployment progress in real time rather than waiting for artifacts -**Logging into the claimed cluster (OCP jobs):** While a job is executing, you can also log into the ephemeral OCP cluster where RHDH is being deployed and tested. Use the [`ocp-cluster-claim-login.sh`](../../.ci/pipelines/ocp-cluster-claim-login.sh) script: - -```bash -# Provide the Prow job URL -.ci/pipelines/ocp-cluster-claim-login.sh "https://prow.ci.openshift.org/view/gs/..." -``` - -This gives you direct `oc` access to the cluster, allowing you to inspect pods, check logs, describe resources, and debug issues live. See [Cluster Access](#cluster-access-ocp-jobs-only) for details. +**Logging into the claimed cluster (OCP jobs):** While a job is executing, you can also log into the ephemeral OCP cluster using [`ocp-cluster-claim-login.sh`](../../.ci/pipelines/ocp-cluster-claim-login.sh). See [`.ci/pipelines/README.md`](../../.ci/pipelines/README.md) for prerequisites, access requirements, and usage. -**Prerequisite**: You must be a member of the `openshift` GitHub organization. Request access at [DevServices GitHub Access Request](https://devservices.dpp.openshift.com/support/github_access_request/). For cluster login, you also need to be in the `rhdh-pool-admins` [Rover group](https://rover.redhat.com/groups/search?q=rhdh-pool-admins). +**Prerequisite**: You must be a member of the `openshift` GitHub organization. Request access at [DevServices GitHub Access Request](https://devservices.dpp.openshift.com/support/github_access_request/). ### Artifact Directory Structure @@ -297,19 +338,7 @@ Open `index.html` in a browser from the GCS artifacts. The report contains per-t ### Phase 5: Test Execution -**What happens**: Playwright tests run inside the test container against the deployed RHDH instance (see [testing.sh](../../.ci/pipelines/lib/testing.sh)). - -```bash -yarn playwright test --project="${playwright_project}" -``` - -Tests are configured in [`playwright.config.ts`](../../e2e-tests/playwright.config.ts) with: -- **Timeout**: 90 seconds per test -- **Retries**: 2 on CI (1 for auth-providers) -- **Workers**: 3 parallel -- **Viewport**: 1920x1080 - -Project names are defined in [`projects.json`](../../e2e-tests/playwright/projects.json) (single source of truth) and loaded by CI via [`playwright-projects.sh`](../../.ci/pipelines/playwright-projects.sh). +**What happens**: Playwright tests run inside the test container against the deployed RHDH instance (see [testing.sh](../../.ci/pipelines/lib/testing.sh)). For test configuration details (timeouts, retries, workers), see [`playwright.config.ts`](../../e2e-tests/playwright.config.ts). For project names, see [`projects.json`](../../e2e-tests/playwright/projects.json). **What can go wrong**: - Individual test failures (assertions, timeouts, element not found) @@ -425,17 +454,7 @@ Tests on Azure Kubernetes Service. See [`aks-helm.sh`](../../.ci/pipelines/jobs/ **Test suites**: `showcase-k8s`, `showcase-rbac-k8s` -**Platform specifics**: -- Uses Azure Spot VMs -- pods may be preempted mid-test (tolerations/affinity patches via [`aks-spot-patch.yaml`](../../.ci/pipelines/cluster/aks/patch/aks-spot-patch.yaml)) -- Ingress via Azure Web App Routing controller (`webapprouting.kubernetes.azure.com`) -- see [`aks-operator-ingress.yaml`](../../.ci/pipelines/cluster/aks/manifest/aks-operator-ingress.yaml) -- Gets LoadBalancer IP from `app-routing-system` namespace (`nginx` service) -- Image pull secrets from Red Hat registry required - -**Common failures**: -- Spot VM preemption causing pod evictions -- LoadBalancer IP not obtained (check `app-routing-system` namespace) -- Azure API throttling -- Image pull failures from Red Hat registry +**Common failures**: Most failures are either [Mapt](https://github.com/redhat-developer/mapt) failing to create the cluster (check the `create` step in artifacts) or the cluster being slower than OCP, causing timeouts during deployment or networking setup. Re-trigger in both cases. ### EKS Helm / EKS Operator @@ -451,21 +470,7 @@ Tests on AWS Elastic Kubernetes Service. See [`eks-helm.sh`](../../.ci/pipelines - **ALB ingress controller**: AWS Application Load Balancer with SSL redirect -- see [`eks-operator-ingress.yaml`](../../.ci/pipelines/cluster/eks/manifest/eks-operator-ingress.yaml) - **External DNS**: Automatically creates Route53 records from ingress annotations -**Network setup flow**: -1. Generate unique domain name and reserve in Route53 -2. Request certificate from ACM, wait for DNS validation (up to 30 minutes) -3. Deploy with ALB ingress, get LoadBalancer hostname -4. Update Route53 CNAME to point to ALB -5. Verify DNS resolution (30 attempts, 15 second intervals) - -**Common failures**: -- Domain number exhaustion (50 limit) -- Certificate issuance delays or validation failures (ACM) -- DNS propagation delays (can take 15-30 minutes) -- Route53 API throttling -- ALB creation/deletion race conditions - -**Cleanup**: Route53 DNS records are deleted after test completion. +**Common failures**: Usually AWS resource limits (domain slots, certificates, Route53 throttling). If persistent, check the job handler for which resource is exhausted. ### GKE Helm / GKE Operator @@ -477,17 +482,10 @@ Tests on Google Kubernetes Engine. See [`gke-helm.sh`](../../.ci/pipelines/jobs/ **Platform specifics** (cert logic in [`gcloud.sh`](../../.ci/pipelines/cluster/gke/gcloud.sh)): - Uses a **long-running cluster** (not ephemeral like OCP) -- Pre-provisioned static IP: `rhdh-static-ip` - Google-managed SSL certificates via `gcloud` - GCE ingress class with FrontendConfig for SSL policy and HTTPS redirect -- see [`frontend-config.yaml`](../../.ci/pipelines/cluster/gke/manifest/frontend-config.yaml) and [`gke-operator-ingress.yaml`](../../.ci/pipelines/cluster/gke/manifest/gke-operator-ingress.yaml) -- Ingress annotation: `ingress.gcp.kubernetes.io/pre-shared-cert` -**Common failures**: -- SSL certificate creation delays (CA issuance timing) -- Static IP already in use or unavailable -- GCP quota limits on certificates/IPs -- Cloud Load Balancer propagation delays -- FrontendConfig not applying (timing issues) +**Common failures**: Since GKE uses a long-running shared cluster, most issues stem from stale state -- a previous job exited without proper cleanup, or two jobs were triggered at the same time and collided on shared resources (namespaces, certificates, static IP). If jobs overlap, adjust the cron schedule in the [ci-operator config](https://github.com/openshift/release/tree/master/ci-operator/config/redhat-developer/rhdh) to space them out. --- @@ -501,7 +499,7 @@ The job never got to run tests. Something went wrong with the CI platform itself - Prow shows the job as `error` (red circle) rather than `failure` (red X) - Failure is in `build-log.txt` (top level), not in the test step - `ci-operator.log` shows provisioning or setup errors -- No test artifacts exist at all +- No test artifacts for RHDH exist at all **Where to look**: - Top-level `build-log.txt` @@ -511,9 +509,8 @@ The job never got to run tests. Something went wrong with the CI platform itself **Common causes**: - Cluster pool exhaustion - Cloud provider API failures (AKS/EKS/GKE auth, quota) -- Operator marketplace down - Network/DNS issues at the CI level -- Image registry unavailable +- Image or the image registry unavailable **Action**: Re-trigger the job. If it persists across multiple runs, escalate to CI platform team. @@ -522,7 +519,6 @@ The job never got to run tests. Something went wrong with the CI platform itself The cluster was provisioned, but RHDH failed to deploy or start properly. **Indicators**: -- `STATUS_FAILED_TO_DEPLOY.txt` contains `true` for one or more namespaces - `build-log.txt` (test step) shows deployment errors before any test execution - `pod_logs/` contain application crash logs - No JUnit XML or Playwright report exists for that namespace @@ -534,7 +530,6 @@ The cluster was provisioned, but RHDH failed to deploy or start properly. **Common causes**: - Bad configuration in ConfigMaps (see [`resources/config_map/`](../../.ci/pipelines/resources/config_map/)) or values files (see [`value_files/`](../../.ci/pipelines/value_files/)) -- Missing secrets (especially PostgreSQL user secret for RBAC) - Image pull failures (wrong tag, registry auth, rate limiting) - Resource constraints (OOM, CPU limits) - Operator CRD not available in time @@ -546,13 +541,9 @@ The cluster was provisioned, but RHDH failed to deploy or start properly. RHDH deployed successfully, but one or more Playwright tests failed. **Indicators**: -- `STATUS_FAILED_TO_DEPLOY.txt` is `false` (deployment succeeded) -- `STATUS_TEST_FAILED.txt` is `true` - JUnit XML and Playwright report exist with specific test failures -- `STATUS_NUMBER_OF_TEST_FAILED.txt` shows the count **Where to look**: -- `junit-results-{project}.xml` -- which tests failed - Playwright HTML report -- screenshots, videos, error messages - `test-log.html` -- full console output of the test run - `pod_logs/` -- if the test failure suggests a backend issue @@ -563,33 +554,26 @@ RHDH deployed successfully, but one or more Playwright tests failed. |---------|-------------|--------| | Single test fails, passes on retry | Flaky test | File flaky test ticket | | Single test fails consistently | Real test bug or app regression | Investigate, file bug | -| Login/auth tests fail | Keycloak or auth provider issue | Check Keycloak pod logs | | Many tests timeout | App slow or partially broken | Check pod logs, resource usage | | All tests fail uniformly | Deployment issue not caught by health check | Treat as deployment failure | --- -## Common Failure Patterns (Cheat Sheet) - -| Symptom | Type | Where to Look | Likely Cause | Action | -|---------|------|---------------|--------------|--------| -| Job status is `error` (not `failure`) | Infra | Top-level `build-log.txt` | Cluster provisioning failed | Re-trigger | -| `failed to acquire cluster lease` | Infra | `ci-operator.log` | Cluster pool exhausted | Wait and re-trigger | -| `CrashLoopBackOff` in test step log | Deploy | `pod_logs/`, K8s events | Bad config, missing secret, OOM | Check pod logs | -| `Failed to reach Backstage after N attempts` | Deploy | Test step `build-log.txt` | Pod didn't start or health check path wrong | Check pod logs, events | -| `postgress-external-db-pguser-janus-idp` secret timeout | Deploy | Test step log | Crunchy Postgres operator issue | Check operator logs | -| `Failed to install subscription` | Infra/Deploy | Test step `build-log.txt` | OperatorHub/Marketplace issue | Re-trigger, check OLM | -| `ImagePullBackOff` or `ErrImagePull` | Deploy | K8s events, pod describe | Wrong image tag or registry auth | Verify image exists, check pull secrets | -| `helm upgrade` command fails | Deploy | Test step `build-log.txt` | Invalid values, missing CRDs | Check recent values file changes | -| Playwright timeout on login page | Test | HTML report, videos | Keycloak down or misconfigured | Check Keycloak pod logs | -| `backstages.rhdh.redhat.com` CRD timeout | Deploy | Test step log | RHDH Operator not installed | Check operator subscription | -| Test passes on retry (flaky) | Test | JUnit XML (failures > 0 but exit 0) | Non-deterministic test | File flaky test ticket | -| All tests fail with same error | Deploy | Pod logs, HTML report | App not functional despite health check | Investigate app state | -| Certificate issuance timeout (EKS/GKE) | Infra | Test step `build-log.txt` | ACM/GCP cert delays | Re-trigger | -| DNS resolution failure (EKS) | Infra | Test step `build-log.txt` | Route53 propagation delay | Re-trigger | -| Spot VM preemption (AKS) | Infra | K8s events | Azure reclaimed spot instance | Re-trigger | -| `LoadBalancer` IP not obtained (K8s) | Infra | Test step `build-log.txt` | Ingress controller issue | Check ingress controller pods | -| Domain number exhaustion (EKS) | Infra | Test step `build-log.txt` | All 50 domain slots taken | Manual DNS cleanup needed | +## Finding Past Failures + +Instead of maintaining a static cheat sheet that goes stale, use these two sources to find how similar failures were investigated and resolved in the past: + +### AI Test Triager + +The **AI Test Triager** (`@Nightly Test Alerts` Slack app) is your first stop for any failure. It automatically analyzes failed nightly jobs, provides root cause analysis, and searches Jira for similar existing issues. See [AI Test Triager](#ai-test-triager-nightly-test-alerts) for details. + +### Resolved Jira `ci-fail` Issues + +Previously resolved CI failures are tracked in Jira with the **`ci-fail`** label. Search for resolved issues to find patterns, root causes, and fixes for failures you're seeing: + +- [Resolved `ci-fail` issues (RHDHBUGS)](https://redhat.atlassian.net/issues/?jql=project%20%3D%20RHDHBUGS%20AND%20labels%20%3D%20ci-fail%20AND%20status%20in%20(Done%2C%20Closed)%20ORDER%20BY%20resolved%20DESC) + +When investigating a failure, search these resolved issues for keywords from the error message (e.g., `CrashLoopBackOff`, `Failed to reach Backstage`, `ImagePullBackOff`). The resolution comments often describe exactly what was wrong and how it was fixed. --- @@ -639,87 +623,15 @@ Artifacts are stored in GCS. You can browse them via: ### Cluster Access (OCP Jobs Only) -To log into the ephemeral cluster of a running or recent OCP job: - -```bash -.ci/pipelines/ocp-cluster-claim-login.sh -# Or provide the Prow URL directly: -.ci/pipelines/ocp-cluster-claim-login.sh "https://prow.ci.openshift.org/view/gs/..." -``` - -The script will: -1. Extract the cluster namespace from the Prow build log -2. Log into the hosted-mgmt cluster -3. Retrieve `kubeadmin` credentials -4. Log into the ephemeral cluster -5. Offer to open the web console (copies password to clipboard) - -**Requirements**: You must be a member of the `rhdh-pool-admins` [Rover group](https://rover.redhat.com/groups/search?q=rhdh-pool-admins). - -**Important**: Ephemeral clusters are deleted when the CI job terminates. You can only access them while the job is running or shortly after. +Use [`.ci/pipelines/ocp-cluster-claim-login.sh`](../../.ci/pipelines/ocp-cluster-claim-login.sh) to log into the ephemeral cluster of a running or recent OCP job. See [`.ci/pipelines/README.md`](../../.ci/pipelines/README.md) for prerequisites and usage. ### Re-triggering a Nightly Job -Use the trigger script to re-run a failed nightly job: - -```bash -# Basic re-trigger -.ci/pipelines/trigger-nightly-job.sh --job periodic-ci-redhat-developer-rhdh-main-e2e-ocp-helm-nightly - -# Dry run (preview without triggering) -.ci/pipelines/trigger-nightly-job.sh --job --dry-run +Use [`.ci/pipelines/trigger-nightly-job.sh`](../../.ci/pipelines/trigger-nightly-job.sh) to re-run a failed nightly job. Run with `--help` for all options. You can also use the `/trigger-nightly-job` AI command to trigger jobs interactively. -# With custom image (e.g., RC verification) -.ci/pipelines/trigger-nightly-job.sh --job --quay-repo rhdh/rhdh-hub-rhel9 --tag 1.9-123 +### Related Documentation -# With Slack alerts enabled -.ci/pipelines/trigger-nightly-job.sh --job --send-alerts -``` - -**Authentication**: The script uses a dedicated kubeconfig at `~/.config/openshift-ci/kubeconfig`. If the token is expired, it will open a browser for SSO login. - -### CI Configuration (openshift/release repo) - -The Prow job definitions and ci-operator configs live in the [openshift/release](https://github.com/openshift/release) repo: - -| Path | Description | -|------|-------------| -| [`ci-operator/config/redhat-developer/rhdh/`](https://github.com/openshift/release/tree/master/ci-operator/config/redhat-developer/rhdh) | ci-operator configuration files | -| [`ci-operator/jobs/redhat-developer/rhdh/`](https://github.com/openshift/release/tree/master/ci-operator/jobs/redhat-developer/rhdh) | Generated Prow job definitions | -| [`ci-operator/step-registry/redhat-developer/rhdh/`](https://github.com/openshift/release/tree/master/ci-operator/step-registry/redhat-developer/rhdh) | Step registry (test steps, alert sending) | - -### Documentation - -| Resource | Link | -|----------|------| -| OpenShift CI Documentation | [docs.ci.openshift.org](https://docs.ci.openshift.org/) | -| ci-operator Architecture | [ci-operator docs](https://docs.ci.openshift.org/docs/architecture/ci-operator/) | -| Artifacts Documentation | [Artifacts how-to](https://docs.ci.openshift.org/docs/how-tos/artifacts/) | -| Prow Overview | [Prow docs](https://docs.ci.openshift.org/docs/architecture/prow/) | -| Cluster Pools & Claims | [Cluster pools docs](https://docs.ci.openshift.org/docs/how-tos/cluster-claim/) | -| RHDH CI Pipeline README | [`.ci/pipelines/README.md`](../../.ci/pipelines/README.md) | -| E2E Testing CI Documentation | [`CI.md`](CI.md) | -| Playwright Documentation | [playwright.dev](https://playwright.dev/) | -| Playwright Trace Viewer | [Trace viewer docs](https://playwright.dev/docs/trace-viewer) | - -### Key Files in This Repo - -| File | Purpose | -|------|---------| -| [`.ci/pipelines/openshift-ci-tests.sh`](../../.ci/pipelines/openshift-ci-tests.sh) | Main entry point -- dispatches to job handlers | -| [`.ci/pipelines/lib/testing.sh`](../../.ci/pipelines/lib/testing.sh) | Test execution, health checks, artifact collection | -| [`.ci/pipelines/lib/log.sh`](../../.ci/pipelines/lib/log.sh) | Structured logging (log levels, colors, sections) | -| [`.ci/pipelines/reporting.sh`](../../.ci/pipelines/reporting.sh) | Status tracking and result persistence | -| [`.ci/pipelines/env_variables.sh`](../../.ci/pipelines/env_variables.sh) | Environment variables and secrets | -| [`.ci/pipelines/jobs/`](../../.ci/pipelines/jobs/) | Per-job-type handlers (ocp-nightly, aks-helm, etc.) | -| [`.ci/pipelines/trigger-nightly-job.sh`](../../.ci/pipelines/trigger-nightly-job.sh) | Manual nightly job trigger via Gangway API | -| [`.ci/pipelines/ocp-cluster-claim-login.sh`](../../.ci/pipelines/ocp-cluster-claim-login.sh) | Cluster access for debugging | -| [`e2e-tests/playwright/projects.json`](../../e2e-tests/playwright/projects.json) | Playwright project definitions (source of truth) | -| [`e2e-tests/playwright.config.ts`](../../e2e-tests/playwright.config.ts) | Playwright configuration (timeouts, retries, workers) | -| [`.ci/pipelines/lib/config.sh`](../../.ci/pipelines/lib/config.sh) | ConfigMap selection and app-config management | -| [`.ci/pipelines/lib/operators.sh`](../../.ci/pipelines/lib/operators.sh) | Operator/OLM installation functions | -| [`.ci/pipelines/lib/helm.sh`](../../.ci/pipelines/lib/helm.sh) | Helm chart operations and value merging | -| [`.ci/pipelines/lib/namespace.sh`](../../.ci/pipelines/lib/namespace.sh) | Namespace lifecycle and image pull secrets | -| [`.ci/pipelines/cleanup.sh`](../../.ci/pipelines/cleanup.sh) | Exit trap for cleanup | -| [`.ci/pipelines/resources/config_map/`](../../.ci/pipelines/resources/config_map/) | App-config YAML files (RBAC and non-RBAC variants) | -| [`.ci/pipelines/value_files/`](../../.ci/pipelines/value_files/) | Helm values overrides for different platforms | +- [`.ci/pipelines/README.md`](../../.ci/pipelines/README.md) -- cluster pools, access requirements, development guidelines +- [`.ci/pipelines/lib/README.md`](../../.ci/pipelines/lib/README.md) -- full list of pipeline library modules and function signatures +- [`CI.md`](CI.md) -- CI testing processes, job definitions, openshift/release repo links +- [OpenShift CI Documentation](https://docs.ci.openshift.org/) -- Prow, ci-operator, cluster pools, artifacts From 17888ba3fbe1e451b89b79208c82b7e9cee551ac Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Wed, 1 Apr 2026 11:57:19 +0200 Subject: [PATCH 06/12] docs: add link to internal resources doc in CI Medic Guide Co-Authored-By: Claude Opus 4.6 --- docs/e2e-tests/CI-medic-guide.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/e2e-tests/CI-medic-guide.md b/docs/e2e-tests/CI-medic-guide.md index 6e808a0478..bc74f8c058 100644 --- a/docs/e2e-tests/CI-medic-guide.md +++ b/docs/e2e-tests/CI-medic-guide.md @@ -68,7 +68,8 @@ When your rotation begins: 1. **Read the [Overview](#overview)** above to understand the role and where alerts come in. 2. **Familiarize yourself with the [Useful Links and Tools](#useful-links-and-tools)** section -- open the Prow dashboards, join the Slack channels, and make sure you have access. -3. **Try the [AI Test Triager](#ai-test-triager-nightly-test-alerts)** on a recent failure in `#rhdh-e2e-alerts` to see how it works. It will handle most of the initial analysis for you. +3. **Review the [Internal Resources doc](https://docs.google.com/document/d/1yiMU-u2v8_rC-TBawcaJwV5jAvWcbTjhspuTe3KNcCo/edit?usp=sharing)** -- it covers Vault secrets, ReportPortal dashboards, DevLake analytics, and how to unredact artifacts. These are internal tools you'll need during triage. +4. **Try the [AI Test Triager](#ai-test-triager-nightly-test-alerts)** on a recent failure in `#rhdh-e2e-alerts` to see how it works. It will handle most of the initial analysis for you. That's enough to start triaging. @@ -631,6 +632,7 @@ Use [`.ci/pipelines/trigger-nightly-job.sh`](../../.ci/pipelines/trigger-nightly ### Related Documentation +- [Internal Resources (Google Doc)](https://docs.google.com/document/d/1yiMU-u2v8_rC-TBawcaJwV5jAvWcbTjhspuTe3KNcCo/edit?usp=sharing) -- Vault secrets, ReportPortal, DevLake, unredacting artifacts (Red Hat internal) - [`.ci/pipelines/README.md`](../../.ci/pipelines/README.md) -- cluster pools, access requirements, development guidelines - [`.ci/pipelines/lib/README.md`](../../.ci/pipelines/lib/README.md) -- full list of pipeline library modules and function signatures - [`CI.md`](CI.md) -- CI testing processes, job definitions, openshift/release repo links From 8819ae08d6416a0f52585bf9d5b5b7c5fd7902a6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Wed, 1 Apr 2026 11:59:10 +0200 Subject: [PATCH 07/12] docs: fix PR review comments in CI docs - Simplify OSD-GCP nightly test description - Fix /ok-to-test org scope to include both openshift and redhat-developer Co-Authored-By: Claude Opus 4.6 --- docs/e2e-tests/CI.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/e2e-tests/CI.md b/docs/e2e-tests/CI.md index 64cd404fef..1625807b20 100644 --- a/docs/e2e-tests/CI.md +++ b/docs/e2e-tests/CI.md @@ -17,13 +17,13 @@ For scenarios where tests are not automatically triggered, or when you need to m 1. **Commenting `/ok-to-test`:** - **Purpose:** This command is used to validate a PR for testing, especially important for external contributors or when tests are not automatically triggered. - - **Who Can Use It:** Only members of the [openshift](https://github.com/openshift) GitHub organization can mark the PR with this comment. + - **Who Can Use It:** Only members of the [openshift](https://github.com/openshift) or [redhat-developer](https://github.com/redhat-developer) GitHub organizations can mark the PR with this comment. - **Use Cases:** - **External Contributors:** For PRs from contributors outside the organization, a member needs to comment `/ok-to-test` to initiate tests. - **More Details:** For additional information about `/ok-to-test`, please refer to the [Kubernetes Community Pull Requests Guide](https://github.com/kubernetes/community/blob/master/contributors/guide/pull-requests.md#more-about-ok-to-test). 2. **Triggering Tests Post-Validation:** - - After an `openshift` org member has validated the PR with `/ok-to-test`, anyone can trigger tests using the following commands: + - After an `openshift` or `redhat-developer` org member has validated the PR with `/ok-to-test`, anyone can trigger tests using the following commands: - `/test ?` to get a list of all available jobs - `/test e2e-ocp-helm` for mandatory PR checks - **Note:** Avoid using `/test all` as it may trigger unnecessary jobs and consume CI resources. Instead, use `/test ?` to see available options and trigger only the specific tests you need. @@ -51,7 +51,7 @@ If the initial automatically triggered tests fail, OpenShift-CI will add a comme - **Purpose:** Validate new PRs for code quality, functionality, and integration. - **Trigger:** - **Automatic:** When a PR includes code changes affecting tests (excluding doc-only changes), tests are automatically triggered. - - **Manual:** When `/ok-to-test` is commented by an `openshift` org member for external contributors or when `/test`, `/test images`, or `/test e2e-ocp-helm` is commented after validation. + - **Manual:** When `/ok-to-test` is commented by an `openshift` or `redhat-developer` org member for external contributors or when `/test`, `/test images`, or `/test e2e-ocp-helm` is commented after validation. - **Environment:** Runs on ephemeral OpenShift clusters managed by Hive. Kubernetes jobs use ephemeral EKS and AKS clusters on spot instances managed by [Mapt](https://github.com/redhat-developer/mapt). GKE uses a long-running cluster. - **Configurations:** - Tests are executed on both **RBAC** (Role-Based Access Control) and **non-RBAC** namespaces. Different sets of tests are executed for both the **non-RBAC RHDH instance** and the **RBAC RHDH instance**, each deployed in separate namespaces. @@ -91,7 +91,7 @@ Nightly tests are run to ensure the stability and reliability of our codebase ov - **AKS Nightly Tests:** Nightly tests on Azure Kubernetes Service (AKS) using ephemeral clusters provisioned by [Mapt](https://github.com/redhat-developer/mapt). AKS is exclusively used for nightly runs (no PR checks). - **EKS Nightly Tests:** Nightly tests on AWS Elastic Kubernetes Service (EKS) using ephemeral clusters provisioned by Mapt. - **GKE Nightly Tests:** Nightly tests on Google Kubernetes Engine using a long-running shared cluster. -- **OSD-GCP Nightly Tests:** Nightly tests on OpenShift Dedicated on GCP. Orchestrator is disabled and localization tests are skipped in this environment. +- **OSD-GCP Nightly Tests:** Nightly tests on OpenShift Dedicated on GCP. ### Additional Nightly Jobs for Main Branch From 7e88ade1391980fac6f8c03869ddb1792ffc1474 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Wed, 1 Apr 2026 12:01:20 +0200 Subject: [PATCH 08/12] docs: fix org membership requirement and OSD-GCP description - /ok-to-test requires membership in both openshift and redhat-developer orgs - Add OSD-GCP description in CI Medic Guide job types Co-Authored-By: Claude Opus 4.6 --- docs/e2e-tests/CI-medic-guide.md | 2 +- docs/e2e-tests/CI.md | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/e2e-tests/CI-medic-guide.md b/docs/e2e-tests/CI-medic-guide.md index bc74f8c058..b459df3873 100644 --- a/docs/e2e-tests/CI-medic-guide.md +++ b/docs/e2e-tests/CI-medic-guide.md @@ -381,7 +381,7 @@ The most comprehensive nightly job. Runs on OpenShift using ephemeral cluster cl 3. **Sanity plugins check** (`showcase-sanity-plugins`) -- validates plugin loading and basic functionality 4. **Localization tests** (`showcase-localization-fr`, `showcase-localization-it`, `showcase-localization-ja`) -- UI translations -**OSD-GCP variant**: When the job name contains `osd-gcp`, orchestrator is disabled and localization tests are skipped. +**OSD-GCP variant**: Nightly tests on OpenShift Dedicated on GCP. Uses the same handler but orchestrator is disabled and localization tests are skipped. ### OCP Operator (`ocp-operator`) diff --git a/docs/e2e-tests/CI.md b/docs/e2e-tests/CI.md index 1625807b20..321883e120 100644 --- a/docs/e2e-tests/CI.md +++ b/docs/e2e-tests/CI.md @@ -17,13 +17,13 @@ For scenarios where tests are not automatically triggered, or when you need to m 1. **Commenting `/ok-to-test`:** - **Purpose:** This command is used to validate a PR for testing, especially important for external contributors or when tests are not automatically triggered. - - **Who Can Use It:** Only members of the [openshift](https://github.com/openshift) or [redhat-developer](https://github.com/redhat-developer) GitHub organizations can mark the PR with this comment. + - **Who Can Use It:** Only members of both the [openshift](https://github.com/openshift) and [redhat-developer](https://github.com/redhat-developer) GitHub organizations can mark the PR with this comment. - **Use Cases:** - **External Contributors:** For PRs from contributors outside the organization, a member needs to comment `/ok-to-test` to initiate tests. - **More Details:** For additional information about `/ok-to-test`, please refer to the [Kubernetes Community Pull Requests Guide](https://github.com/kubernetes/community/blob/master/contributors/guide/pull-requests.md#more-about-ok-to-test). 2. **Triggering Tests Post-Validation:** - - After an `openshift` or `redhat-developer` org member has validated the PR with `/ok-to-test`, anyone can trigger tests using the following commands: + - After an `openshift` and `redhat-developer` org member has validated the PR with `/ok-to-test`, anyone can trigger tests using the following commands: - `/test ?` to get a list of all available jobs - `/test e2e-ocp-helm` for mandatory PR checks - **Note:** Avoid using `/test all` as it may trigger unnecessary jobs and consume CI resources. Instead, use `/test ?` to see available options and trigger only the specific tests you need. @@ -51,7 +51,7 @@ If the initial automatically triggered tests fail, OpenShift-CI will add a comme - **Purpose:** Validate new PRs for code quality, functionality, and integration. - **Trigger:** - **Automatic:** When a PR includes code changes affecting tests (excluding doc-only changes), tests are automatically triggered. - - **Manual:** When `/ok-to-test` is commented by an `openshift` or `redhat-developer` org member for external contributors or when `/test`, `/test images`, or `/test e2e-ocp-helm` is commented after validation. + - **Manual:** When `/ok-to-test` is commented by an `openshift` and `redhat-developer` org member for external contributors or when `/test`, `/test images`, or `/test e2e-ocp-helm` is commented after validation. - **Environment:** Runs on ephemeral OpenShift clusters managed by Hive. Kubernetes jobs use ephemeral EKS and AKS clusters on spot instances managed by [Mapt](https://github.com/redhat-developer/mapt). GKE uses a long-running cluster. - **Configurations:** - Tests are executed on both **RBAC** (Role-Based Access Control) and **non-RBAC** namespaces. Different sets of tests are executed for both the **non-RBAC RHDH instance** and the **RBAC RHDH instance**, each deployed in separate namespaces. From 522573cc3044b842ef9f2a3c9a17072c55543335 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Wed, 1 Apr 2026 12:05:27 +0200 Subject: [PATCH 09/12] docs: sync rulesync opencode output Co-Authored-By: Claude Opus 4.6 --- .claude/rules/ci-e2e-testing.md | 2 -- .cursor/rules/ci-e2e-testing.mdc | 2 -- .opencode/memories/ci-e2e-testing.md | 48 +++++++++------------------- .rulesync/rules/ci-e2e-testing.md | 2 -- 4 files changed, 15 insertions(+), 39 deletions(-) diff --git a/.claude/rules/ci-e2e-testing.md b/.claude/rules/ci-e2e-testing.md index 53f7d4607c..debff26ed0 100644 --- a/.claude/rules/ci-e2e-testing.md +++ b/.claude/rules/ci-e2e-testing.md @@ -528,5 +528,3 @@ The choice of config map depends on the **Playwright test project** being execut ### **Configuration Deployment Process** The config maps are deployed as Kubernetes ConfigMaps during CI/CD pipeline execution and are mounted into the RHDH pods to provide runtime configuration. The pipeline selects the appropriate config map based on the test project being executed. - ---- diff --git a/.cursor/rules/ci-e2e-testing.mdc b/.cursor/rules/ci-e2e-testing.mdc index 1924c829ba..07a6af3dd1 100644 --- a/.cursor/rules/ci-e2e-testing.mdc +++ b/.cursor/rules/ci-e2e-testing.mdc @@ -531,5 +531,3 @@ The choice of config map depends on the **Playwright test project** being execut ### **Configuration Deployment Process** The config maps are deployed as Kubernetes ConfigMaps during CI/CD pipeline execution and are mounted into the RHDH pods to provide runtime configuration. The pipeline selects the appropriate config map based on the test project being executed. - ---- diff --git a/.opencode/memories/ci-e2e-testing.md b/.opencode/memories/ci-e2e-testing.md index 48053c7593..debff26ed0 100644 --- a/.opencode/memories/ci-e2e-testing.md +++ b/.opencode/memories/ci-e2e-testing.md @@ -77,6 +77,7 @@ test.beforeAll(async ({ }, testInfo) => { - `showcase-operator`: General functionality tests with base deployment using Operator - `showcase-operator-rbac`: General functionality tests with RBAC-enabled deployment using Operator - `showcase-runtime`: Runtime environment tests + - `showcase-runtime-db`: Runtime database tests - `showcase-sanity-plugins`: Plugin sanity checks - `showcase-upgrade`: Upgrade scenario tests - `showcase-localization-fr`: French localization tests @@ -125,18 +126,18 @@ test.beforeAll(async ({ }, testInfo) => { #### CI/CD Pipeline Execution -In the CI/CD pipeline, tests are executed directly using Playwright's `--project` flag via the `run_tests()` function in `.ci/pipelines/utils.sh`: +In the CI/CD pipeline, tests are executed directly using Playwright's `--project` flag via `testing::run_tests()` in `.ci/pipelines/lib/testing.sh`: ```bash yarn playwright test --project="${playwright_project}" ``` -The namespace and Playwright project are decoupled, allowing flexible reuse. The `check_and_test()` and `run_tests()` functions accept an explicit `playwright_project` argument: +The namespace and Playwright project are decoupled, allowing flexible reuse. The `testing::check_and_test()` and `testing::run_tests()` functions accept an explicit `playwright_project` argument: ```bash # Function signatures: -check_and_test "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" [max_attempts] [wait_seconds] -run_tests "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" +testing::check_and_test "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" [max_attempts] [wait_seconds] +testing::run_tests "${RELEASE_NAME}" "${NAMESPACE}" "${PLAYWRIGHT_PROJECT}" "${URL}" ``` #### Local Development Scripts @@ -266,25 +267,7 @@ Check the readme at `.ci/pipelines/README.md` ### Cluster Pools -Available cluster pools for different OCP versions: - -- **RHDH-4-19-US-EAST-2** - - Usage: OCP v4.19 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-19-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-18-US-EAST-2** - - Usage: OCP v4.18 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-18-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-17-US-EAST-2** - - Usage: PR checks on main branch and OCP v4.17 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-17-0-amd64-aws-us-east-2_clusterpool.yaml) - -- **RHDH-4-16-US-EAST-2** - - Usage: OCP v4.16 nightly jobs - - [Cluster Pool Configuration](https://github.com/openshift/release/blob/master/clusters/hosted-mgmt/hive/pools/rhdh/rhdh-ocp-4-16-0-amd64-aws-us-east-2_clusterpool.yaml) - -**Note:** This is subject to change. Use `.ci/pipelines/README.md` as a source of truth. +RHDH uses dedicated Hive cluster pools with the `rhdh` prefix on AWS `us-east-2`. Pool versions rotate as new OCP releases come out. Find the current list by filtering for `rhdh` in the [existing cluster pools](https://docs.ci.openshift.org/how-tos/cluster-claim/#existing-cluster-pools) page. See also `.ci/pipelines/README.md`. ### CI Job Types @@ -297,7 +280,7 @@ Available cluster pools for different OCP versions: #### Nightly Tests - **Schedule**: Automated nightly runs - **Environments**: Multiple OCP versions, AKS, GKE -- **Reporting**: Slack notifications to `#rhdh-e2e-test-alerts` +- **Reporting**: Slack notifications to `#rhdh-e2e-alerts` ### Test Execution Environment @@ -347,14 +330,15 @@ yarn prettier:fix # Fix formatting for shell, markdown, and YAML file - Functions may temporarily disable/re-enable error handling with `set +e` / `set -e` pattern #### Job Handlers -The main script handles different job types: -- `handle_aks_helm`: AKS Helm deployment -- `handle_eks_helm`: EKS Helm deployment -- `handle_gke_helm`: GKE Helm deployment -- `handle_ocp_operator`: OCP Operator deployment -- `handle_ocp_nightly`: OCP nightly tests -- `handle_ocp_pull`: OCP PR tests +The main script dispatches to job-specific handlers in `.ci/pipelines/jobs/`: +- `handle_aks_helm` / `handle_aks_operator`: AKS Helm/Operator deployment +- `handle_eks_helm` / `handle_eks_operator`: EKS Helm/Operator deployment +- `handle_gke_helm` / `handle_gke_operator`: GKE Helm/Operator deployment +- `handle_ocp_nightly`: OCP Helm nightly tests (also handles OSD-GCP Helm) +- `handle_ocp_operator`: OCP Operator nightly tests (also handles OSD-GCP Operator) +- `handle_ocp_pull`: OCP PR checks - `handle_auth_providers`: Auth provider tests +- `handle_ocp_helm_upgrade`: Upgrade scenario tests #### Special Case: showcase-auth-providers Deployment @@ -544,5 +528,3 @@ The choice of config map depends on the **Playwright test project** being execut ### **Configuration Deployment Process** The config maps are deployed as Kubernetes ConfigMaps during CI/CD pipeline execution and are mounted into the RHDH pods to provide runtime configuration. The pipeline selects the appropriate config map based on the test project being executed. - ---- diff --git a/.rulesync/rules/ci-e2e-testing.md b/.rulesync/rules/ci-e2e-testing.md index 09f6d20ded..985d6a0d7b 100644 --- a/.rulesync/rules/ci-e2e-testing.md +++ b/.rulesync/rules/ci-e2e-testing.md @@ -534,5 +534,3 @@ The choice of config map depends on the **Playwright test project** being execut ### **Configuration Deployment Process** The config maps are deployed as Kubernetes ConfigMaps during CI/CD pipeline execution and are mounted into the RHDH pods to provide runtime configuration. The pipeline selects the appropriate config map based on the test project being executed. - ---- From 2742a2310f886ae9108643868d6f9219e2fe3fe2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Wed, 1 Apr 2026 12:08:48 +0200 Subject: [PATCH 10/12] chore: gitignore all *.local.md files Co-Authored-By: Claude Opus 4.6 --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 0d9b7d0575..84ac1c7a9c 100644 --- a/.gitignore +++ b/.gitignore @@ -42,6 +42,7 @@ site # Local configuration files *.local.yaml +*.local.md # Sensitive credentials *-credentials.local.yaml From 2a98bb28b820d648b5b85d9cc7f49aa0691e8a00 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Wed, 1 Apr 2026 12:12:12 +0200 Subject: [PATCH 11/12] docs: fix stale references in nightly testing SVG diagram Replace IBM Cloud references with OCP and fix Slack channel name from #rhdh-e2e-test-alerts to #rhdh-e2e-alerts. Co-Authored-By: Claude Opus 4.6 --- docs/images/nightly_diagram.svg | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/images/nightly_diagram.svg b/docs/images/nightly_diagram.svg index 02b01130ad..b3a4dd3072 100644 --- a/docs/images/nightly_diagram.svg +++ b/docs/images/nightly_diagram.svg @@ -1,3 +1,3 @@ -

Nightly

Reporting

Artifact Collection

Env Setup - IBM Cloud

Env Setup - AKS

Generate Reports

Trigger Nightly Test Job

AKS Cluster: Run Tests

IBM Cloud OCP Cluster: Run Tests on Recent Versions

Configure AKS Environment

Deploy RHDH Instance with Helm

Run Test Suites on AKS

Configure IBM Cloud RBAC & non-RBAC Namespaces

Deploy RHDH Instance with Helm

Run Test Suites on IBM Cloud OCP

Collect Logs, Screenshots, Recordings

Store Artifacts for 6 Months

Post Results to #rhdh-e2e-test-alerts on Slack

\ No newline at end of file +

Nightly

Reporting

Artifact Collection

Env Setup - OCP

Env Setup - AKS

Generate Reports

Trigger Nightly Test Job

AKS Cluster: Run Tests

OCP Cluster: Run Tests on Recent Versions

Configure AKS Environment

Deploy RHDH Instance with Helm

Run Test Suites on AKS

Configure OCP RBAC & non-RBAC Namespaces

Deploy RHDH Instance with Helm

Run Test Suites on OCP

Collect Logs, Screenshots, Recordings

Store Artifacts for 6 Months

Post Results to #rhdh-e2e-alerts on Slack

\ No newline at end of file From 6a86212913e3c054447d2c1fd9ce7032c0cc9b91 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Zbyn=C4=9Bk=20Dr=C3=A1pela?= Date: Thu, 2 Apr 2026 11:07:51 +0200 Subject: [PATCH 12/12] docs: clarify CI Medic Guide focus on RHDH Core --- docs/e2e-tests/CI-medic-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/e2e-tests/CI-medic-guide.md b/docs/e2e-tests/CI-medic-guide.md index b459df3873..52c83cfe67 100644 --- a/docs/e2e-tests/CI-medic-guide.md +++ b/docs/e2e-tests/CI-medic-guide.md @@ -1,6 +1,6 @@ # CI Medic Guide -A practical guide for investigating test failures in RHDH nightly jobs and PR checks. +A practical guide for investigating test failures in RHDH Core (`redhat-developer/rhdh`) nightly jobs and PR checks. ## Table of Contents