GIE-505: add observability toolset evals by tremes · Pull Request #241 · openshift/openshift-mcp-server

tremes · 2026-04-20T14:08:42Z

Requires #226

I have been trying the following so far

make run-server
make run-evals EVAL_LABEL_SELECTOR="suite=observability" EVAL_TASK_FILTER="get-alerts"

Summary by CodeRabbit

Release Notes

New Features
- Added comprehensive observability evaluation suite across multiple agent platforms
- Introduced 30+ new evaluation tasks testing agent capabilities for monitoring scenarios including alerts, metrics discovery, label exploration, and observability queries
- Tasks validate proper interaction with Kubernetes monitoring systems and error handling scenarios

coderabbitai · 2026-04-20T14:08:50Z

📝 Walkthrough

Walkthrough

This pull request adds a comprehensive observability evaluation suite by introducing task set configurations across three agent implementations (Claude, Gemini, OpenAI) and defining 26 new observability test tasks spanning alert management, metric discovery, label exploration, and cluster health queries.

Changes

Cohort / File(s)	Summary
Eval Configuration - Observability Task Sets `evals/claude-code/eval.yaml`, `evals/gemini-agent/eval.yaml`, `evals/openai-agent/eval.yaml`	Added new observability task set configurations that target `../tasks/observability//.yaml` files, apply `suite: observability` label selector, and assert Kubernetes tool usage with tool call bounds of 1–20.
Observability Alert Tasks `evals/tasks/observability/alerts/alert-investigation.yaml`, `evals/tasks/observability/alerts/filtered-alerts.yaml`, `evals/tasks/observability/alerts/get-alerts.yaml`, `evals/tasks/observability/alerts/get-silences.yaml`	Introduced four alert management tasks: alert investigation with Alertmanager metrics, filtered severity-based alerts, retrieval of firing alerts (Watchdog detection), and silence status querying.
Observability Label & Exploration Tasks `evals/tasks/observability/labels/get-series.yaml`, `evals/tasks/observability/labels/label-names.yaml`, `evals/tasks/observability/labels/label-values.yaml`, `evals/tasks/observability/labels/series-by-namespace.yaml`	Added four label discovery tasks: series cardinality (`kube_pod_info`), label enumeration, namespace value discovery, and namespace-scoped series retrieval.
Observability Metric Discovery Tasks `evals/tasks/observability/metrics/list-metrics.yaml`, `evals/tasks/observability/metrics/list-node-metrics.yaml`	Introduced metric discovery tasks: listing Prometheus metrics containing `kube` prefix and node-related metrics (e.g., `node_cpu_seconds_total`).
Observability Query Tasks - Core Metrics `evals/tasks/observability/queries/cpu-usage.yaml`, `evals/tasks/observability/queries/memory-usage.yaml`, `evals/tasks/observability/queries/network-traffic.yaml`, `evals/tasks/observability/queries/pending-pods.yaml`, `evals/tasks/observability/queries/namespace-pod-count.yaml`	Added resource usage and pod state queries: CPU usage by pod, memory utilization, network traffic, pending pod detection, and namespace pod counting.
Observability Query Tasks - Health & Diagnostics `evals/tasks/observability/queries/diagnose-cluster-health.yaml`, `evals/tasks/observability/queries/backend-reachability.yaml`, `evals/tasks/observability/queries/crashlooping-pods.yaml`, `evals/tasks/observability/queries/pods-created.yaml`	Introduced cluster health diagnostics: multi-metric cluster assessment, Prometheus backend reachability, crashlooping pod detection over time ranges, and pod creation tracking.
Observability Query Tasks - Error Handling & Advanced Metrics `evals/tasks/observability/queries/high-cardinality-rejection.yaml`, `evals/tasks/observability/queries/nonexistent-metric.yaml`, `evals/tasks/observability/queries/nonexistent-namespace.yaml`, `evals/tasks/observability/queries/namespace-resource-usage.yaml`, `evals/tasks/observability/queries/time-range-query.yaml`, `evals/tasks/observability/queries/visualize-cpu-usage.yaml`	Added error handling and advanced query tasks: high-cardinality guardrail handling, missing metric detection, empty namespace handling, multi-metric namespace aggregation, time-range CPU trending, and metric visualization.
Observability Query Tasks - Prometheus Internals `evals/tasks/observability/queries/prometheus-head-series.yaml`, `evals/tasks/observability/queries/prometheus-requests.yaml`, `evals/tasks/observability/queries/prometheus-wal-size.yaml`	Introduced Prometheus TSDB monitoring tasks: head series count, HTTP request rate calculation, and WAL storage size retrieval.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Whiskers twitching, code compiled clear,
Observability tasks now appear!
Alerts and metrics, queries galore,
Twenty-six new tests to explore!
Kubernetes hops through eval delight,
Cluster health monitored just right! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The PR title "GIE-505: add observability toolset evals" clearly and specifically describes the main change—adding observability toolset evaluations. It is concise, directly related to the changeset, and uses a meaningful tracking reference.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-20T14:09:00Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

nader-ziada · 2026-04-20T19:54:51Z

The structure, organization, file layout, and eval config wiring are all solid. The weak spot is the llmJudge.contains values are using too many use generic words that already appear in the prompt, so an agent could pass without actually calling tools or discovering anything.

slashpai · 2026-04-22T05:19:32Z

The structure, organization, file layout, and eval config wiring are all solid. The weak spot is the llmJudge.contains values are using too many use generic words that already appear in the prompt, so an agent could pass without actually calling tools or discovering anything.

@nader-ziada @tremes I will look at this today and have PR in obs-mcp to update evals and we can update it here once merged.

nader-ziada · 2026-04-23T12:41:59Z

@slashpai thanks for the updates, LGTM

can you move the PR from draft

slashpai · 2026-04-23T13:26:57Z

Thank you. I have updated in main repo. Once merged @tremes can update this branch

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (5)

evals/tasks/observability/queries/memory-usage.yaml (1)
20-25: Top-5 requirement is not explicitly validated.

Current checks confirm metric intent but not that the answer actually returns five ranked results. Consider adding a stricter judge token/pattern tied to ranking output (e.g., ordinal/list structure plus five pod entries).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/memory-usage.yaml` around lines 20 - 25,
The current llmJudge only checks for the word "pod" and doesn't validate that
the answer returns a top-5 ranked list; update the judge block (llmJudge) to
require a pattern that enforces five distinct, ordered pod entries (for example
add a new token/pattern field or replace contains with a regex that matches five
ordinal or numbered lines like "1. <pod>", "2. <pod>" through "5.") so the
evaluation asserts both count and ranking in the response.
evals/tasks/observability/labels/label-names.yaml (1)
21-23: pod check is weak for this prompt context.

Since kube_pod_info appears in the prompt, pod can appear without real discovery. Consider replacing this token with a less prompt-derived label (for example node, uid, or created_by_kind) to reduce false-positive passes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/labels/label-names.yaml` around lines 21 - 23, The
llmJudge label check currently looks for the token "pod" (llmJudge -> contains
-> "pod"), which is too easily satisfied by the prompt; update the label under
llmJudge to check for a less prompt-derived field such as "node", "uid", or
"created_by_kind" instead and update the reason text accordingly (e.g., change
contains to "node" and reason to "Verify the output includes the node label") so
the check reduces false-positive passes.
evals/tasks/observability/queries/network-traffic.yaml (1)
20-22: Consider replacing generic pod token with stronger evidence.

pod alone is easy to satisfy with templated language. A more discriminative check (for example presence of pod-like identifiers or ranked result wording) would reduce false positives.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/network-traffic.yaml` around lines 20 - 22,
The llmJudge check currently uses a weak token (contains: "pod") which yields
false positives; update the llmJudge validation to require stronger evidence
such as a pod-like identifier or structured phrasing (e.g., "pod/<name>" or
Kubernetes-style names), or require ranked/result wording like "top N pods" or
explicit pod names, replacing the simple contains: "pod" condition with a more
discriminative match so the judge only passes when real pod identifiers or
ranked results are present.
evals/tasks/observability/queries/nonexistent-metric.yaml (1)
17-20: The "not found" substring match is brittle—use a semantic judge or enumerate acceptable phrases.

The contains field performs literal substring matching, not semantic evaluation. Agents may naturally say "does not exist", "no such metric", "empty", or "unavailable", causing the check to fail. Either use an LLM judge with a rubric (e.g., llmJudge.llm(rubric="...")) to evaluate semantic meaning, or add multiple llmJudge entries with different contains values (OR-combined per framework semantics).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/nonexistent-metric.yaml` around lines 17 -
20, The current verify block uses a brittle literal match via the contains field
("not found"); update the verify to perform a semantic check by replacing the
single contains check with an llmJudge that uses an LLM rubric (e.g.,
llmJudge.llm(rubric="...")) describing acceptable semantic equivalents like
"does not exist", "no such metric", "unavailable", or "empty", or alternatively
add multiple llmJudge entries each checking different acceptable phrases so the
verification uses semantic judgment rather than a single substring match; look
for the verify -> - llmJudge / contains usage in the file to make this change.
evals/tasks/observability/queries/high-cardinality-rejection.yaml (1)
17-20: Verification is narrower than the stated reason.

The reason claims to verify both (a) the guardrail rejection and (b) that the agent "suggests a scoped alternative", but contains: "guardrail" only checks the first. Consider adding a second check (e.g., contains: "namespace" or contains: "scope"/"smaller"/"resolution") — or replace the contains check with a llmJudge prompt-based rubric that actually asserts both conditions, so a response that just complains about a guardrail without proposing an alternative doesn't pass.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/high-cardinality-rejection.yaml` around
lines 17 - 20, The verification only checks for "guardrail" but the reason
requires both recognizing the guardrail rejection and proposing a scoped
alternative; update the `verify.llmJudge` section in
high-cardinality-rejection.yaml to assert both conditions by either adding a
second `contains` check (e.g., `contains: "namespace"` or `contains: "scope"` /
`"smaller"`) alongside the existing `contains: "guardrail"`, or replace
`contains` with a prompt-based `llmJudge` rubric that explicitly requires the
response to mention the guardrail rejection and to suggest a concrete, scoped
alternative (use the `verify.llmJudge` key to host the new rubric).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/tasks/observability/alerts/alert-investigation.yaml`:
- Around line 21-23: Replace the redundant llmJudge `contains: "alertmanager"`
check with a concrete Alertmanager metric assertion so the judge verifies metric
investigation; specifically update the llmJudge entry (the second one shown) to
assert for a specific metric name such as
`alertmanager_notifications_failed_total` or
`alertmanager_config_last_reload_successful` instead of the substring
"alertmanager" so the check ensures the agent queried Alertmanager metrics.

In `@evals/tasks/observability/alerts/filtered-alerts.yaml`:
- Around line 17-24: The verify step only checks for the presence of
AlertmanagerReceiversNotConfigured in the output and doesn’t assert that the
agent passed a severity=warning filter to the get_alerts tool; update the verify
block to add an assertions.toolsUsed entry that matches toolPattern "get_alerts"
and includes args with severity: "warning" so the eval framework verifies the
agent actually supplied the severity filter, and consider replacing or making
AlertmanagerReceiversNotConfigured configurable (or use a more reliable alert
like "Watchdog") to avoid false failures when that specific alert is absent.

In `@evals/tasks/observability/alerts/get-alerts.yaml`:
- Around line 17-19: The llmJudge assertion is brittle because it checks for a
literal "Watchdog"; change the test to assert on alert structure/content instead
of a single name: modify the llmJudge block to validate that the retrieved
alerts include at least one entry with an "alertname" field and a "status" equal
to "firing" (or equivalent key/value pair), and update the "reason" to reflect
this structural check; reference the llmJudge assertion and its "contains" usage
to replace the string match with a structural/regex or JSON-path style check
that looks for alertname + status=firing.

In `@evals/tasks/observability/alerts/get-silences.yaml`:
- Around line 17-19: The llmJudge check is too generic and subject to
prompt-leakage because it only asserts contains: "silences"; update the llmJudge
assertion to look for more specific, non-leaked indicators such as a phrase or
regex that proves the agent inspected silences (e.g. require "active silences"
or "No active silences exist" OR require presence of matcher fields like
"matchers:" or a matcher key pattern), by replacing the simple contains with a
stricter string or regex match (refer to llmJudge and the contains entry in the
YAML) so only genuine instrumented responses pass.

In `@evals/tasks/observability/labels/get-series.yaml`:
- Around line 17-26: The current llmJudge checks are too weak (they allow
restating "kube_pod_info" or generic "namespace"); update the verify section to
require evidence only a real get_series call can produce by (1) adding an
llmJudge that asserts a numeric cardinality (e.g., a regex or type check for an
integer count) for the kube_pod_info series, (2) requiring at least one concrete
label value observed (e.g., "kube-system" or "openshift-monitoring") and at
least one specific label key from the API (e.g., "pod", "uid", "host_ip"), and
(3) adding a tool-call assertion that the agent invoked get_series (or
list_metrics) so the judge only passes when the tool was actually used. Ensure
references to the metric name kube_pod_info and the tool name get_series are
present in the updated verify checks.

In `@evals/tasks/observability/labels/series-by-namespace.yaml`:
- Around line 19-28: The llmJudge checks ("contains: \"container\"" and
"contains: \"pod\"") are too generic and can be satisfied by the prompt itself;
update the assertions in this YAML to require evidence only obtainable from a
real get_series call (for example assert concrete label values like
namespace="openshift-monitoring", a real pod name such as prometheus-k8s-0, or a
container value like prometheus), or assert a numeric cardinality for
container_cpu_usage_seconds_total in openshift-monitoring; additionally add a
paired tool-call assertion that verifies the get_series tool was invoked for the
openshift-monitoring namespace to prevent false positives (referencing the
llmJudge entries and the prompt inline for where to change the checks).

In `@evals/tasks/observability/metrics/list-metrics.yaml`:
- Around line 12-14: Update the task description and agent prompt to remove
references to the non-existent list_metrics tool and the name_regex parameter
and instead instruct the agent to use the existing prometheus_query (or
prometheus_query_range) tool with a PromQL name-match filter; specifically
mention using a query like `{__name__=~"kube.*"}` to list metrics containing
"kube", and reference the observable tool names prometheus_query and
prometheus_query_range so the task aligns with the actual observability toolset.

In `@evals/tasks/observability/metrics/list-node-metrics.yaml`:
- Around line 17-19: Replace the weak llmJudge.contains "node_" check with a
stricter assertion: either add a second llmJudge entry that checks for a
specific, less-likely-to-be-hallucinated metric name (e.g.,
"node_cpu_seconds_total" or another long metric string) or tighten tool-usage
validation to assert the tool pattern equals "list_metrics" (in addition to the
existing minToolCalls: 1) so the task requires an actual call to list_metrics;
update the list-node-metrics.yaml llmJudge and taskSet entries accordingly to
reference llmJudge.contains and the tool pattern "list_metrics" or add a new
llmJudge with the discriminating substring.

In `@evals/tasks/observability/queries/backend-reachability.yaml`:
- Around line 18-23: The current llmJudge `contains: "prometheus-k8s"` can be
spoofed by echoing the prompt token; update the validation to require evidence
only obtainable from an actual Prometheus query — for example change the judge
rule to assert the numeric up value or an `instance=` label (e.g., expect `"1"`
or `"instance="` in the response) or require the text "up" / "reachable"
alongside a numeric result, and pair this with a tool-usage assertion for the
Prometheus query; modify the `llmJudge` block and the `prompt.inline` (the
`up{job="prometheus-k8s"}` query) so the judge checks the query result (numeric
or instance label) rather than just the literal job name.

In `@evals/tasks/observability/queries/cpu-usage.yaml`:
- Around line 20-22: The llmJudge assertion `contains: "pod"` is too weak and
causes false positives; update the llmJudge in cpu-usage.yaml to assert a
concrete pod identifier (for example require `pod_name`, `pod=`, or a Kubernetes
UID pattern) or remove the `contains: "pod"` check and rely on the existing
`container_cpu_usage_seconds_total` assertion instead so the judge only passes
when an actual pod identifier is present. Target the `llmJudge` entry and
replace the `contains: "pod"` condition with a stricter token such as `contains:
"pod_name"` or `contains: "pod="` (or delete that assertion) to ensure responses
list real pod identifiers.

In `@evals/tasks/observability/queries/diagnose-cluster-health.yaml`:
- Around line 22-24: The llmJudge rule using contains: "kube_" is too
loose—update the llmJudge block to assert a concrete metric (e.g., contains:
"kube_node_status_condition" or "kube_pod_status_phase") and/or add a
complementary tool-usage assertion that checks the metrics query tool output
(for example assert that the metrics tool returned results or that a field like
metrics_query_response contains those metric names); target the llmJudge entry
and replace the generic contains: "kube_" with the chosen concrete metric string
and/or add an additional assertion verifying the tool invocation/response so the
judge cannot pass on narrative mentions alone.

In `@evals/tasks/observability/queries/nonexistent-namespace.yaml`:
- Around line 17-23: The llmJudge check uses a narrow case-sensitive contains:
"no data" which will miss equivalent valid responses; update the check for
llmJudge (the contains criterion) to either match the canonical empty-result
marker returned by the tool (e.g. the JSON token `"result":[]`), or broaden it
to a case-insensitive/semantic matcher that accepts any of the expected phrases
from the reason ("no data", "no results", "no pods found")—for example replace
the single contains with a regex/array of acceptable strings or a
case-insensitive match so the prompt inline "Show me the memory usage..." will
be accepted when the namespace is empty.

In `@evals/tasks/observability/queries/time-range-query.yaml`:
- Around line 21-26: The llmJudge assertion "contains: 'pod'" is too weak
because the prompt already mentions "pods"; update the check to verify actual
query execution or real results by replacing or augmenting llmJudge to (a)
assert a concrete pod-name pattern or list (e.g., require regex like
"pod-[a-z0-9]+"), or (b) assert a tool-call to execute_range_query with
start/end/step parameters (verify presence of execute_range_query and its args),
or (c) require a structured series-formatted result string; locate and change
the llmJudge block (symbol: llmJudge and its contains/reason fields) and/or add
a tool-call assertion referencing execute_range_query to ensure the response
reflects real query output rather than echoing the prompt.

In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml`:
- Around line 12-15: The task description incorrectly references a non-existent
tool show_timeseries; update the description in visualize-cpu-usage.yaml to
reference the actual observability tools (e.g., prometheus_query_range for
time-series retrieval) or alternatively register a new show_timeseries tool in
the observability toolset; specifically, change the text mentioning
show_timeseries to mention prometheus_query_range (or list available tools:
prometheus_query, prometheus_query_range, alertmanager_alerts) so the task
matches the real toolset and can retrieve time-series data programmatically.

---

Nitpick comments:
In `@evals/tasks/observability/labels/label-names.yaml`:
- Around line 21-23: The llmJudge label check currently looks for the token
"pod" (llmJudge -> contains -> "pod"), which is too easily satisfied by the
prompt; update the label under llmJudge to check for a less prompt-derived field
such as "node", "uid", or "created_by_kind" instead and update the reason text
accordingly (e.g., change contains to "node" and reason to "Verify the output
includes the node label") so the check reduces false-positive passes.

In `@evals/tasks/observability/queries/high-cardinality-rejection.yaml`:
- Around line 17-20: The verification only checks for "guardrail" but the reason
requires both recognizing the guardrail rejection and proposing a scoped
alternative; update the `verify.llmJudge` section in
high-cardinality-rejection.yaml to assert both conditions by either adding a
second `contains` check (e.g., `contains: "namespace"` or `contains: "scope"` /
`"smaller"`) alongside the existing `contains: "guardrail"`, or replace
`contains` with a prompt-based `llmJudge` rubric that explicitly requires the
response to mention the guardrail rejection and to suggest a concrete, scoped
alternative (use the `verify.llmJudge` key to host the new rubric).

In `@evals/tasks/observability/queries/memory-usage.yaml`:
- Around line 20-25: The current llmJudge only checks for the word "pod" and
doesn't validate that the answer returns a top-5 ranked list; update the judge
block (llmJudge) to require a pattern that enforces five distinct, ordered pod
entries (for example add a new token/pattern field or replace contains with a
regex that matches five ordinal or numbered lines like "1. <pod>", "2. <pod>"
through "5.") so the evaluation asserts both count and ranking in the response.

In `@evals/tasks/observability/queries/network-traffic.yaml`:
- Around line 20-22: The llmJudge check currently uses a weak token (contains:
"pod") which yields false positives; update the llmJudge validation to require
stronger evidence such as a pod-like identifier or structured phrasing (e.g.,
"pod/<name>" or Kubernetes-style names), or require ranked/result wording like
"top N pods" or explicit pod names, replacing the simple contains: "pod"
condition with a more discriminative match so the judge only passes when real
pod identifiers or ranked results are present.

In `@evals/tasks/observability/queries/nonexistent-metric.yaml`:
- Around line 17-20: The current verify block uses a brittle literal match via
the contains field ("not found"); update the verify to perform a semantic check
by replacing the single contains check with an llmJudge that uses an LLM rubric
(e.g., llmJudge.llm(rubric="...")) describing acceptable semantic equivalents
like "does not exist", "no such metric", "unavailable", or "empty", or
alternatively add multiple llmJudge entries each checking different acceptable
phrases so the verification uses semantic judgment rather than a single
substring match; look for the verify -> - llmJudge / contains usage in the file
to make this change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bedf71e1-ce58-4a89-8d3c-6cb91605ff9b

📥 Commits

Reviewing files that changed from the base of the PR and between 118141f and f0113dd.

📒 Files selected for processing (31)

evals/claude-code/eval.yaml
evals/gemini-agent/eval.yaml
evals/openai-agent/eval.yaml
evals/tasks/observability/alerts/alert-investigation.yaml
evals/tasks/observability/alerts/filtered-alerts.yaml
evals/tasks/observability/alerts/get-alerts.yaml
evals/tasks/observability/alerts/get-silences.yaml
evals/tasks/observability/labels/get-series.yaml
evals/tasks/observability/labels/label-names.yaml
evals/tasks/observability/labels/label-values.yaml
evals/tasks/observability/labels/series-by-namespace.yaml
evals/tasks/observability/metrics/list-metrics.yaml
evals/tasks/observability/metrics/list-node-metrics.yaml
evals/tasks/observability/queries/backend-reachability.yaml
evals/tasks/observability/queries/cpu-usage.yaml
evals/tasks/observability/queries/crashlooping-pods.yaml
evals/tasks/observability/queries/diagnose-cluster-health.yaml
evals/tasks/observability/queries/high-cardinality-rejection.yaml
evals/tasks/observability/queries/memory-usage.yaml
evals/tasks/observability/queries/namespace-pod-count.yaml
evals/tasks/observability/queries/namespace-resource-usage.yaml
evals/tasks/observability/queries/network-traffic.yaml
evals/tasks/observability/queries/nonexistent-metric.yaml
evals/tasks/observability/queries/nonexistent-namespace.yaml
evals/tasks/observability/queries/pending-pods.yaml
evals/tasks/observability/queries/pods-created.yaml
evals/tasks/observability/queries/prometheus-head-series.yaml
evals/tasks/observability/queries/prometheus-requests.yaml
evals/tasks/observability/queries/prometheus-wal-size.yaml
evals/tasks/observability/queries/time-range-query.yaml
evals/tasks/observability/queries/visualize-cpu-usage.yaml

coderabbitai · 2026-04-23T14:05:11Z

+    - llmJudge:
+        contains: "alertmanager"
+        reason: "Verify the agent investigated Alertmanager-related metrics for the alert"


⚠️ Potential issue | 🟡 Minor

Second judge contains: "alertmanager" is redundant and weak.

Substring alertmanager is already contained in AlertmanagerReceiversNotConfigured (checked above) and appears naturally in any narrative about Alertmanager, so it doesn't independently verify that the agent investigated related metrics. Consider asserting on a concrete Alertmanager metric name the agent should have queried (e.g., alertmanager_notifications_failed_total, alertmanager_config_last_reload_successful) to actually validate the "investigate related metrics" step.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/alerts/alert-investigation.yaml` around lines 21 - 23, Replace the redundant llmJudge `contains: "alertmanager"` check with a concrete Alertmanager metric assertion so the judge verifies metric investigation; specifically update the llmJudge entry (the second one shown) to assert for a specific metric name such as `alertmanager_notifications_failed_total` or `alertmanager_config_last_reload_successful` instead of the substring "alertmanager" so the check ensures the agent queried Alertmanager metrics.

coderabbitai · 2026-04-23T14:05:11Z

+spec:
+  verify:
+    - llmJudge:
+        contains: "AlertmanagerReceiversNotConfigured"
+        reason: "Verify the agent filtered for warning-severity alerts and found AlertmanagerReceiversNotConfigured"
+  prompt:
+    inline: |
+      Show me only the active alerts with severity=warning.


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check whether the eval framework supports asserting tool-call arguments (not just tool patterns/counts). rg -nP -C3 '(toolPattern|minToolCalls|maxToolCalls|toolArgs|arguments)' --type=yaml -g 'evals/**' rg -nP -C3 '(ToolPattern|ToolArgs|Arguments|MinToolCalls)' -g '!vendor/**' -g '!**/testdata/**'

Repository: openshift/openshift-mcp-server

Length of output: 50389

🏁 Script executed:

cat -n evals/tasks/observability/alerts/filtered-alerts.yaml

Repository: openshift/openshift-mcp-server

Length of output: 1010

🏁 Script executed:

# Search for task.yaml files with tool argument assertions to understand the proper structure find evals/tasks -name "*.yaml" -type f | head -20 | xargs -I {} sh -c 'echo "=== {} ===" && grep -A 10 "toolPattern\|args:" {} | head -15'

Repository: openshift/openshift-mcp-server

Length of output: 1473

🏁 Script executed:

# Look at the netedge example that showed args field cat -n evals/tasks/netedge/query_prometheus/task.yaml

Repository: openshift/openshift-mcp-server

Length of output: 519

🏁 Script executed:

# Search for other observability tasks to understand the pattern find evals/tasks/observability -name "*.yaml" -type f | xargs cat

Repository: openshift/openshift-mcp-server

Length of output: 20635

Use tool-call argument assertion to verify severity=warning filter is passed to get_alerts.

The current verify section only checks that AlertmanagerReceiversNotConfigured appears in the output; it does not verify the agent passed a severity=warning filter to the tool. An agent that fetched all alerts would still pass this check. The eval framework supports argument assertions (see evals/tasks/netedge/query_prometheus/task.yaml for the pattern), so add an args constraint under assertions.toolsUsed to verify the severity parameter was actually supplied:

assertions: toolsUsed: - server: kubernetes toolPattern: "get_alerts" args: severity: "warning"

Also note: relying on AlertmanagerReceiversNotConfigured to always fire may be fragile. If a cluster configures Alertmanager receivers, this alert will be absent and the eval will fail unrelated to agent behavior. Consider using a more reliable alert (like Watchdog, which is standard across OpenShift clusters) or make the alert name configurable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/alerts/filtered-alerts.yaml` around lines 17 - 24, The verify step only checks for the presence of AlertmanagerReceiversNotConfigured in the output and doesn’t assert that the agent passed a severity=warning filter to the get_alerts tool; update the verify block to add an assertions.toolsUsed entry that matches toolPattern "get_alerts" and includes args with severity: "warning" so the eval framework verifies the agent actually supplied the severity filter, and consider replacing or making AlertmanagerReceiversNotConfigured configurable (or use a more reliable alert like "Watchdog") to avoid false failures when that specific alert is absent.

coderabbitai · 2026-04-23T14:05:12Z

+    - llmJudge:
+        contains: "Watchdog"
+        reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert"


⚠️ Potential issue | 🟠 Major

Watchdog-only assertion is brittle across environments.

Requiring a specific alert name can fail valid runs on clusters where Watchdog is not firing, and can still pass via hallucinated output. Prefer asserting on retrieved alert structure/content (for example alertname + status=firing) rather than one fixed alert name.

Suggested adjustment

verify: - llmJudge: - contains: "Watchdog" - reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert" + contains: "alertname" + reason: "Verify the response includes concrete alert fields from retrieved firing alerts" + - llmJudge: + contains: "firing" + reason: "Verify the response reports firing-state alerts from Alertmanager output"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/alerts/get-alerts.yaml` around lines 17 - 19, The llmJudge assertion is brittle because it checks for a literal "Watchdog"; change the test to assert on alert structure/content instead of a single name: modify the llmJudge block to validate that the retrieved alerts include at least one entry with an "alertname" field and a "status" equal to "firing" (or equivalent key/value pair), and update the "reason" to reflect this structural check; reference the llmJudge assertion and its "contains" usage to replace the string match with a structural/regex or JSON-path style check that looks for alertname + status=firing.

coderabbitai · 2026-04-23T14:05:12Z

+    - llmJudge:
+        contains: "silences"
+        reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"


⚠️ Potential issue | 🟠 Major

Judge token is prompt-leaked and allows false positives.

contains: "silences" is too generic because the same term is in the prompt, so a non-instrumented answer can pass.

Suggested adjustment

verify: - llmJudge: - contains: "silences" - reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes" + contains: "matchers" + reason: "Verify the response includes concrete silence details when active silences exist" + - llmJudge: + contains: "no active silences" + reason: "Allow explicit empty-state reporting when no silences are present"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- llmJudge:

contains: "silences"

reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"

- llmJudge:

contains: "matchers"

reason: "Verify the response includes concrete silence details when active silences exist"

- llmJudge:

contains: "no active silences"

reason: "Allow explicit empty-state reporting when no silences are present"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/alerts/get-silences.yaml` around lines 17 - 19, The llmJudge check is too generic and subject to prompt-leakage because it only asserts contains: "silences"; update the llmJudge assertion to look for more specific, non-leaked indicators such as a phrase or regex that proves the agent inspected silences (e.g. require "active silences" or "No active silences exist" OR require presence of matcher fields like "matchers:" or a matcher key pattern), by replacing the simple contains with a stricter string or regex match (refer to llmJudge and the contains entry in the YAML) so only genuine instrumented responses pass.

coderabbitai · 2026-04-23T14:05:12Z

+  verify:
+    - llmJudge:
+        contains: "namespace"
+        reason: "Verify the agent retrieved actual series data containing label dimensions like namespace"
+    - llmJudge:
+        contains: "kube_pod_info"
+        reason: "Verify the agent queried the kube_pod_info metric and reported its cardinality"
+  prompt:
+    inline: |
+      How many time series exist for the kube_pod_info metric? Show the count and list the label names present.


⚠️ Potential issue | 🟠 Major

Both judge tokens are weak: one is in the prompt, the other is generic.

kube_pod_info is mentioned verbatim in the prompt, so the agent can pass that judge by merely restating the metric name without ever invoking list_metrics or get_series. namespace is a generic word that is almost certainly in any reasonable answer (and is implied by the prompt's "label names"), so it also doesn't prove real tool usage.

Suggest requiring evidence only a real get_series result can produce — e.g., a numeric cardinality count in the response, concrete label values observed in this cluster (kube-system, openshift-monitoring), or specific label keys returned by the API (pod, uid, host_ip). Complementing with a tool-call assertion on get_series would strengthen this further.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/labels/get-series.yaml` around lines 17 - 26, The current llmJudge checks are too weak (they allow restating "kube_pod_info" or generic "namespace"); update the verify section to require evidence only a real get_series call can produce by (1) adding an llmJudge that asserts a numeric cardinality (e.g., a regex or type check for an integer count) for the kube_pod_info series, (2) requiring at least one concrete label value observed (e.g., "kube-system" or "openshift-monitoring") and at least one specific label key from the API (e.g., "pod", "uid", "host_ip"), and (3) adding a tool-call assertion that the agent invoked get_series (or list_metrics) so the judge only passes when the tool was actually used. Ensure references to the metric name kube_pod_info and the tool name get_series are present in the updated verify checks.

coderabbitai · 2026-04-23T14:05:12Z

+    - llmJudge:
+        contains: "pod"
+        reason: "Verify the response identifies specific pods"


⚠️ Potential issue | 🟠 Major

False-positive risk: contains: "pod" is trivially satisfied.

This is exactly the concern raised by reviewers on the PR: the prompt is "Which pods are using the most CPU?", so any response that merely restates or references the prompt (including a refusal or a wrong answer) will contain the substring pod and pass this assertion without the agent actually identifying real pods.

Consider replacing this with a stricter check — e.g., require a concrete pod-identifying field/label such as pod= or pod_name, or rely solely on the container_cpu_usage_seconds_total assertion and remove this one.

Suggested diff

- llmJudge: - contains: "pod" - reason: "Verify the response identifies specific pods" + contains: "pod=" + reason: "Verify the response identifies specific pods by label (e.g., pod=\"<name>\")"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- llmJudge:

contains: "pod"

reason: "Verify the response identifies specific pods"

- llmJudge:

contains: "pod="

reason: "Verify the response identifies specific pods by label (e.g., pod=\"<name>\")"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/cpu-usage.yaml` around lines 20 - 22, The llmJudge assertion `contains: "pod"` is too weak and causes false positives; update the llmJudge in cpu-usage.yaml to assert a concrete pod identifier (for example require `pod_name`, `pod=`, or a Kubernetes UID pattern) or remove the `contains: "pod"` check and rely on the existing `container_cpu_usage_seconds_total` assertion instead so the judge only passes when an actual pod identifier is present. Target the `llmJudge` entry and replace the `contains: "pod"` condition with a stricter token such as `contains: "pod_name"` or `contains: "pod="` (or delete that assertion) to ensure responses list real pod identifiers.

coderabbitai · 2026-04-23T14:05:12Z

+    - llmJudge:
+        contains: "kube_"
+        reason: "Verify the agent queried Kubernetes metrics (e.g. kube_node_status_condition) for health assessment"


⚠️ Potential issue | 🟡 Minor

contains: "kube_" is a very loose marker.

kube_ will match any casual mention of a kube_* metric name in the agent's narrative (including hypothetical references like "I would check kube_node_status_condition") without proof that the agent actually invoked a metrics tool. Consider tightening to a concrete metric the agent is expected to return results from (e.g., kube_node_status_condition or kube_pod_status_phase), or complement with a tool-usage assertion against the metrics query tool so the judge doesn't pass on prose alone.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/diagnose-cluster-health.yaml` around lines 22 - 24, The llmJudge rule using contains: "kube_" is too loose—update the llmJudge block to assert a concrete metric (e.g., contains: "kube_node_status_condition" or "kube_pod_status_phase") and/or add a complementary tool-usage assertion that checks the metrics query tool output (for example assert that the metrics tool returned results or that a field like metrics_query_response contains those metric names); target the llmJudge entry and replace the generic contains: "kube_" with the chosen concrete metric string and/or add an additional assertion verifying the tool invocation/response so the judge cannot pass on narrative mentions alone.

coderabbitai · 2026-04-23T14:05:12Z

+  verify:
+    - llmJudge:
+        contains: "no data"
+        reason: "Verify the agent reports no data, no results, or no pods found in the nonexistent namespace"
+  prompt:
+    inline: |
+      Show me the memory usage for all pods in the namespace called totally-fake-namespace-12345.


⚠️ Potential issue | 🟡 Minor

contains: "no data" is narrow and may miss valid correct responses.

The stated reason accepts "no data, no results, or no pods found", but a strict substring match on "no data" will fail any of those equivalent phrasings (and is also case-sensitive on literal n/N). Either pick the most universal phrasing used by the tool's actual empty response, or use a broader/semantic judge criterion. If the PromQL/Prometheus tool returns a well-known empty-result marker (e.g. "result": []), asserting on that verbatim marker would be more reliable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/nonexistent-namespace.yaml` around lines 17 - 23, The llmJudge check uses a narrow case-sensitive contains: "no data" which will miss equivalent valid responses; update the check for llmJudge (the contains criterion) to either match the canonical empty-result marker returned by the tool (e.g. the JSON token `"result":[]`), or broaden it to a case-insensitive/semantic matcher that accepts any of the expected phrases from the reason ("no data", "no results", "no pods found")—for example replace the single contains with a regex/array of acceptable strings or a case-insensitive match so the prompt inline "Show me the memory usage..." will be accepted when the namespace is empty.

coderabbitai · 2026-04-23T14:05:12Z

+    - llmJudge:
+        contains: "pod"
+        reason: "Verify the response includes specific pod names from the query results"
+  prompt:
+    inline: |
+      Show me the CPU usage trend for pods in the default namespace over the last 30 minutes.


⚠️ Potential issue | 🟡 Minor

contains: "pod" is trivially satisfied by the prompt wording.

The prompt itself mentions "pods in the default namespace", so any reasonable agent response will repeat the word pod regardless of whether it actually executed a range query or returned real results. This does not verify "specific pod names from the query results" as the reason claims. Consider a stricter assertion such as requiring a concrete pod-name pattern, a namespace label value returned by the query, or a series-formatted result string; alternatively, assert the use of execute_range_query with start/end/step via a tool-call assertion rather than relying on output substrings.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/time-range-query.yaml` around lines 21 - 26, The llmJudge assertion "contains: 'pod'" is too weak because the prompt already mentions "pods"; update the check to verify actual query execution or real results by replacing or augmenting llmJudge to (a) assert a concrete pod-name pattern or list (e.g., require regex like "pod-[a-z0-9]+"), or (b) assert a tool-call to execute_range_query with start/end/step parameters (verify presence of execute_range_query and its args), or (c) require a structured series-formatted result string; locate and change the llmJudge block (symbol: llmJudge and its contains/reason fields) and/or add a tool-call assertion referencing execute_range_query to ensure the response reflects real query output rather than echoing the prompt.

coderabbitai · 2026-04-23T14:05:12Z

+  description: |
+    Tests if the agent uses the show_timeseries tool to visualize CPU usage
+    as a chart. The agent should discover the metric, then use show_timeseries
+    to render a time-series visualization.


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -nP -C2 '"(show_timeseries|list_metrics|get_alerts|list_node_metrics|get_series|get_silences|label_names|label_values)"' pkg/ ast-grep --pattern 'func init$_() []api.ServerTool { $$$ }'

Repository: openshift/openshift-mcp-server

Length of output: 56

🏁 Script executed:

#!/bin/bash # Find the toolset.go file echo "=== Finding toolset.go ===" find . -name "toolset.go" -path "*/observability/*" 2>/dev/null # Find the toolset_test.go file echo "=== Finding toolset_test.go ===" find . -name "toolset_test.go" -path "*/observability/*" 2>/dev/null # Find the YAML file echo "=== Finding visualize-cpu-usage.yaml ===" find . -name "visualize-cpu-usage.yaml" 2>/dev/null

Repository: openshift/openshift-mcp-server

Length of output: 319

🏁 Script executed:

#!/bin/bash # Read the YAML file to see the exact description echo "=== Content of visualize-cpu-usage.yaml ===" cat evals/tasks/observability/queries/visualize-cpu-usage.yaml echo -e "\n=== Content of toolset.go ===" cat pkg/toolsets/observability/toolset.go echo -e "\n=== Content of toolset_test.go ===" cat pkg/toolsets/observability/toolset_test.go

Repository: openshift/openshift-mcp-server

Length of output: 6418

🏁 Script executed:

#!/bin/bash # Search for show_timeseries anywhere in the repo echo "=== Searching for show_timeseries ===" rg -r "show_timeseries" . 2>/dev/null | head -50 # Search for what tools are actually defined/registered echo -e "\n=== Searching for tool registration patterns ===" rg -n "prometheus_query|alertmanager_alerts|list_metrics" pkg/toolsets/observability/ 2>/dev/null | head -30

Repository: openshift/openshift-mcp-server

Length of output: 27625

Description references a tool that does not exist in the observability toolset.

The description references show_timeseries, but the observability toolset only registers three tools: prometheus_query, prometheus_query_range, and alertmanager_alerts (confirmed in pkg/toolsets/observability/toolset_test.go line 51). Either add the show_timeseries tool to the toolset, or update the description to reference actual tools like prometheus_query_range for retrieving time-series data. As-is, the task can only pass by the judge matching the hardcoded metric name, not by actual visualization capability.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml` around lines 12 - 15, The task description incorrectly references a non-existent tool show_timeseries; update the description in visualize-cpu-usage.yaml to reference the actual observability tools (e.g., prometheus_query_range for time-series retrieval) or alternatively register a new show_timeseries tool in the observability toolset; specifically, change the text mentioning show_timeseries to mention prometheus_query_range (or list available tools: prometheus_query, prometheus_query_range, alertmanager_alerts) so the task matches the real toolset and can retrieve time-series data programmatically.

slashpai · 2026-04-23T14:17:57Z

@nader-ziada can you review again? It's synced now

nader-ziada

LGTM

Cali0707 · 2026-04-23T15:36:56Z

@tremes do we need to install anything into the openshift cluster for these evals to work? Or is everything included by default?

If we need to install anything, can you add a make target?

slashpai · 2026-04-23T15:48:53Z

we don't need installation extra for these evals

Cali0707

/lgtm
/approve

@tremes can you backport this to release-0.3 ?

openshift-ci · 2026-04-23T17:54:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Cali0707, tremes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Cali0707]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Cali0707 · 2026-04-23T17:54:15Z

/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request"

openshift-ci · 2026-04-23T17:54:30Z

@Cali0707: Overrode contexts on behalf of Cali0707: Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request

Details

In response to this:

/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Cali0707 · 2026-04-23T19:23:43Z

@tremes can you set a JIRA reference for this PR?

slashpai · 2026-04-24T02:56:06Z

/cherry-pick release-0.3

openshift-cherrypick-robot · 2026-04-24T02:56:09Z

@slashpai: once the present PR merges, I will cherry-pick it on top of release-0.3 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

slashpai · 2026-04-24T03:13:22Z

/retitle GIE-505: add observability toolset evals

openshift-ci-robot · 2026-04-24T03:13:29Z

@tremes: This pull request references GIE-505 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Requires #226

I have been trying the following so far

make run-server
make run-evals EVAL_LABEL_SELECTOR="suite=observability" EVAL_TASK_FILTER="get-alerts"

Summary by CodeRabbit

Release Notes

New Features

Added comprehensive observability evaluation suite across multiple agent platforms

Introduced 30+ new evaluation tasks testing agent capabilities for monitoring scenarios including alerts, metrics discovery, label exploration, and observability queries

Tasks validate proper interaction with Kubernetes monitoring systems and error handling scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-04-24T03:27:30Z

@tremes: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-cherrypick-robot · 2026-04-24T03:33:17Z

@slashpai: new pull request created: #260

Details

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

feat: add observability toolset evals

1397087

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2026

slashpai mentioned this pull request Apr 22, 2026

evals: replace generic llmJudge contains values with tool-derived checks rhobs/obs-mcp#71

Merged

attempt to improve the observability evals

44ce079

update to the latest version

f0113dd

tremes marked this pull request as ready for review April 23, 2026 13:59

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2026

openshift-ci Bot requested review from Cali0707 and matzew April 23, 2026 14:00

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

nader-ziada reviewed Apr 23, 2026

View reviewed changes

Cali0707 approved these changes Apr 23, 2026

View reviewed changes

openshift-ci Bot assigned Cali0707 Apr 23, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2026

openshift-ci Bot changed the title ~~feat: add observability toolset evals~~ GIE-505: add observability toolset evals Apr 24, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2026

openshift-merge-bot Bot merged commit 9077d51 into openshift:main Apr 24, 2026
10 of 11 checks passed

openshift-cherrypick-robot mentioned this pull request Apr 24, 2026

[release-0.3] NO-JIRA: add observability toolset evals #260

Merged

slashpai mentioned this pull request Apr 27, 2026

GIE-507: update observability task prompts to request metric names and PromQL queries #274

Merged

Conversation

tremes commented Apr 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

openshift-ci Bot commented Apr 20, 2026

Uh oh!

nader-ziada commented Apr 20, 2026

Uh oh!

slashpai commented Apr 22, 2026

Uh oh!

nader-ziada commented Apr 23, 2026

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

nader-ziada left a comment

Choose a reason for hiding this comment

Uh oh!

Cali0707 commented Apr 23, 2026

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

Cali0707 left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Apr 23, 2026

Uh oh!

Cali0707 commented Apr 23, 2026

Uh oh!

openshift-ci Bot commented Apr 23, 2026

Uh oh!

Cali0707 commented Apr 23, 2026

Uh oh!

slashpai commented Apr 24, 2026

Uh oh!

openshift-cherrypick-robot commented Apr 24, 2026

Uh oh!

tremes commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited by openshift-ci Bot

Loading

slashpai commented Apr 24, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Apr 24, 2026 •

edited by openshift-ci Bot

Loading