Skip to content

GIE-505: add observability toolset evals#241

Merged
openshift-merge-bot[bot] merged 3 commits intoopenshift:mainfrom
tremes:observability-evals
Apr 24, 2026
Merged

GIE-505: add observability toolset evals#241
openshift-merge-bot[bot] merged 3 commits intoopenshift:mainfrom
tremes:observability-evals

Conversation

@tremes
Copy link
Copy Markdown

@tremes tremes commented Apr 20, 2026

Requires #226

I have been trying the following so far

make run-server
make run-evals EVAL_LABEL_SELECTOR="suite=observability" EVAL_TASK_FILTER="get-alerts"

Summary by CodeRabbit

Release Notes

  • New Features
    • Added comprehensive observability evaluation suite across multiple agent platforms
    • Introduced 30+ new evaluation tasks testing agent capabilities for monitoring scenarios including alerts, metrics discovery, label exploration, and observability queries
    • Tasks validate proper interaction with Kubernetes monitoring systems and error handling scenarios

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

This pull request adds a comprehensive observability evaluation suite by introducing task set configurations across three agent implementations (Claude, Gemini, OpenAI) and defining 26 new observability test tasks spanning alert management, metric discovery, label exploration, and cluster health queries.

Changes

Cohort / File(s) Summary
Eval Configuration - Observability Task Sets
evals/claude-code/eval.yaml, evals/gemini-agent/eval.yaml, evals/openai-agent/eval.yaml
Added new observability task set configurations that target ../tasks/observability/*/*.yaml files, apply suite: observability label selector, and assert Kubernetes tool usage with tool call bounds of 1–20.
Observability Alert Tasks
evals/tasks/observability/alerts/alert-investigation.yaml, evals/tasks/observability/alerts/filtered-alerts.yaml, evals/tasks/observability/alerts/get-alerts.yaml, evals/tasks/observability/alerts/get-silences.yaml
Introduced four alert management tasks: alert investigation with Alertmanager metrics, filtered severity-based alerts, retrieval of firing alerts (Watchdog detection), and silence status querying.
Observability Label & Exploration Tasks
evals/tasks/observability/labels/get-series.yaml, evals/tasks/observability/labels/label-names.yaml, evals/tasks/observability/labels/label-values.yaml, evals/tasks/observability/labels/series-by-namespace.yaml
Added four label discovery tasks: series cardinality (kube_pod_info), label enumeration, namespace value discovery, and namespace-scoped series retrieval.
Observability Metric Discovery Tasks
evals/tasks/observability/metrics/list-metrics.yaml, evals/tasks/observability/metrics/list-node-metrics.yaml
Introduced metric discovery tasks: listing Prometheus metrics containing kube prefix and node-related metrics (e.g., node_cpu_seconds_total).
Observability Query Tasks - Core Metrics
evals/tasks/observability/queries/cpu-usage.yaml, evals/tasks/observability/queries/memory-usage.yaml, evals/tasks/observability/queries/network-traffic.yaml, evals/tasks/observability/queries/pending-pods.yaml, evals/tasks/observability/queries/namespace-pod-count.yaml
Added resource usage and pod state queries: CPU usage by pod, memory utilization, network traffic, pending pod detection, and namespace pod counting.
Observability Query Tasks - Health & Diagnostics
evals/tasks/observability/queries/diagnose-cluster-health.yaml, evals/tasks/observability/queries/backend-reachability.yaml, evals/tasks/observability/queries/crashlooping-pods.yaml, evals/tasks/observability/queries/pods-created.yaml
Introduced cluster health diagnostics: multi-metric cluster assessment, Prometheus backend reachability, crashlooping pod detection over time ranges, and pod creation tracking.
Observability Query Tasks - Error Handling & Advanced Metrics
evals/tasks/observability/queries/high-cardinality-rejection.yaml, evals/tasks/observability/queries/nonexistent-metric.yaml, evals/tasks/observability/queries/nonexistent-namespace.yaml, evals/tasks/observability/queries/namespace-resource-usage.yaml, evals/tasks/observability/queries/time-range-query.yaml, evals/tasks/observability/queries/visualize-cpu-usage.yaml
Added error handling and advanced query tasks: high-cardinality guardrail handling, missing metric detection, empty namespace handling, multi-metric namespace aggregation, time-range CPU trending, and metric visualization.
Observability Query Tasks - Prometheus Internals
evals/tasks/observability/queries/prometheus-head-series.yaml, evals/tasks/observability/queries/prometheus-requests.yaml, evals/tasks/observability/queries/prometheus-wal-size.yaml
Introduced Prometheus TSDB monitoring tasks: head series count, HTTP request rate calculation, and WAL storage size retrieval.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Whiskers twitching, code compiled clear,
Observability tasks now appear!
Alerts and metrics, queries galore,
Twenty-six new tests to explore!
Kubernetes hops through eval delight,
Cluster health monitored just right!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The PR title "GIE-505: add observability toolset evals" clearly and specifically describes the main change—adding observability toolset evaluations. It is concise, directly related to the changeset, and uses a meaningful tracking reference.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 20, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@nader-ziada
Copy link
Copy Markdown

The structure, organization, file layout, and eval config wiring are all solid. The weak spot is the llmJudge.contains values are using too many use generic words that already appear in the prompt, so an agent could pass without actually calling tools or discovering anything.

@slashpai
Copy link
Copy Markdown
Member

The structure, organization, file layout, and eval config wiring are all solid. The weak spot is the llmJudge.contains values are using too many use generic words that already appear in the prompt, so an agent could pass without actually calling tools or discovering anything.

@nader-ziada @tremes I will look at this today and have PR in obs-mcp to update evals and we can update it here once merged.

@nader-ziada
Copy link
Copy Markdown

@slashpai thanks for the updates, LGTM

can you move the PR from draft

@slashpai
Copy link
Copy Markdown
Member

Thank you. I have updated in main repo. Once merged @tremes can update this branch

@tremes tremes marked this pull request as ready for review April 23, 2026 13:59
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2026
@openshift-ci openshift-ci Bot requested review from Cali0707 and matzew April 23, 2026 14:00
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (5)
evals/tasks/observability/queries/memory-usage.yaml (1)

20-25: Top-5 requirement is not explicitly validated.

Current checks confirm metric intent but not that the answer actually returns five ranked results. Consider adding a stricter judge token/pattern tied to ranking output (e.g., ordinal/list structure plus five pod entries).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/memory-usage.yaml` around lines 20 - 25,
The current llmJudge only checks for the word "pod" and doesn't validate that
the answer returns a top-5 ranked list; update the judge block (llmJudge) to
require a pattern that enforces five distinct, ordered pod entries (for example
add a new token/pattern field or replace contains with a regex that matches five
ordinal or numbered lines like "1. <pod>", "2. <pod>" through "5.") so the
evaluation asserts both count and ranking in the response.
evals/tasks/observability/labels/label-names.yaml (1)

21-23: pod check is weak for this prompt context.

Since kube_pod_info appears in the prompt, pod can appear without real discovery. Consider replacing this token with a less prompt-derived label (for example node, uid, or created_by_kind) to reduce false-positive passes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/labels/label-names.yaml` around lines 21 - 23, The
llmJudge label check currently looks for the token "pod" (llmJudge -> contains
-> "pod"), which is too easily satisfied by the prompt; update the label under
llmJudge to check for a less prompt-derived field such as "node", "uid", or
"created_by_kind" instead and update the reason text accordingly (e.g., change
contains to "node" and reason to "Verify the output includes the node label") so
the check reduces false-positive passes.
evals/tasks/observability/queries/network-traffic.yaml (1)

20-22: Consider replacing generic pod token with stronger evidence.

pod alone is easy to satisfy with templated language. A more discriminative check (for example presence of pod-like identifiers or ranked result wording) would reduce false positives.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/network-traffic.yaml` around lines 20 - 22,
The llmJudge check currently uses a weak token (contains: "pod") which yields
false positives; update the llmJudge validation to require stronger evidence
such as a pod-like identifier or structured phrasing (e.g., "pod/<name>" or
Kubernetes-style names), or require ranked/result wording like "top N pods" or
explicit pod names, replacing the simple contains: "pod" condition with a more
discriminative match so the judge only passes when real pod identifiers or
ranked results are present.
evals/tasks/observability/queries/nonexistent-metric.yaml (1)

17-20: The "not found" substring match is brittle—use a semantic judge or enumerate acceptable phrases.

The contains field performs literal substring matching, not semantic evaluation. Agents may naturally say "does not exist", "no such metric", "empty", or "unavailable", causing the check to fail. Either use an LLM judge with a rubric (e.g., llmJudge.llm(rubric="...")) to evaluate semantic meaning, or add multiple llmJudge entries with different contains values (OR-combined per framework semantics).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/nonexistent-metric.yaml` around lines 17 -
20, The current verify block uses a brittle literal match via the contains field
("not found"); update the verify to perform a semantic check by replacing the
single contains check with an llmJudge that uses an LLM rubric (e.g.,
llmJudge.llm(rubric="...")) describing acceptable semantic equivalents like
"does not exist", "no such metric", "unavailable", or "empty", or alternatively
add multiple llmJudge entries each checking different acceptable phrases so the
verification uses semantic judgment rather than a single substring match; look
for the verify -> - llmJudge / contains usage in the file to make this change.
evals/tasks/observability/queries/high-cardinality-rejection.yaml (1)

17-20: Verification is narrower than the stated reason.

The reason claims to verify both (a) the guardrail rejection and (b) that the agent "suggests a scoped alternative", but contains: "guardrail" only checks the first. Consider adding a second check (e.g., contains: "namespace" or contains: "scope"/"smaller"/"resolution") — or replace the contains check with a llmJudge prompt-based rubric that actually asserts both conditions, so a response that just complains about a guardrail without proposing an alternative doesn't pass.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/high-cardinality-rejection.yaml` around
lines 17 - 20, The verification only checks for "guardrail" but the reason
requires both recognizing the guardrail rejection and proposing a scoped
alternative; update the `verify.llmJudge` section in
high-cardinality-rejection.yaml to assert both conditions by either adding a
second `contains` check (e.g., `contains: "namespace"` or `contains: "scope"` /
`"smaller"`) alongside the existing `contains: "guardrail"`, or replace
`contains` with a prompt-based `llmJudge` rubric that explicitly requires the
response to mention the guardrail rejection and to suggest a concrete, scoped
alternative (use the `verify.llmJudge` key to host the new rubric).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/tasks/observability/alerts/alert-investigation.yaml`:
- Around line 21-23: Replace the redundant llmJudge `contains: "alertmanager"`
check with a concrete Alertmanager metric assertion so the judge verifies metric
investigation; specifically update the llmJudge entry (the second one shown) to
assert for a specific metric name such as
`alertmanager_notifications_failed_total` or
`alertmanager_config_last_reload_successful` instead of the substring
"alertmanager" so the check ensures the agent queried Alertmanager metrics.

In `@evals/tasks/observability/alerts/filtered-alerts.yaml`:
- Around line 17-24: The verify step only checks for the presence of
AlertmanagerReceiversNotConfigured in the output and doesn’t assert that the
agent passed a severity=warning filter to the get_alerts tool; update the verify
block to add an assertions.toolsUsed entry that matches toolPattern "get_alerts"
and includes args with severity: "warning" so the eval framework verifies the
agent actually supplied the severity filter, and consider replacing or making
AlertmanagerReceiversNotConfigured configurable (or use a more reliable alert
like "Watchdog") to avoid false failures when that specific alert is absent.

In `@evals/tasks/observability/alerts/get-alerts.yaml`:
- Around line 17-19: The llmJudge assertion is brittle because it checks for a
literal "Watchdog"; change the test to assert on alert structure/content instead
of a single name: modify the llmJudge block to validate that the retrieved
alerts include at least one entry with an "alertname" field and a "status" equal
to "firing" (or equivalent key/value pair), and update the "reason" to reflect
this structural check; reference the llmJudge assertion and its "contains" usage
to replace the string match with a structural/regex or JSON-path style check
that looks for alertname + status=firing.

In `@evals/tasks/observability/alerts/get-silences.yaml`:
- Around line 17-19: The llmJudge check is too generic and subject to
prompt-leakage because it only asserts contains: "silences"; update the llmJudge
assertion to look for more specific, non-leaked indicators such as a phrase or
regex that proves the agent inspected silences (e.g. require "active silences"
or "No active silences exist" OR require presence of matcher fields like
"matchers:" or a matcher key pattern), by replacing the simple contains with a
stricter string or regex match (refer to llmJudge and the contains entry in the
YAML) so only genuine instrumented responses pass.

In `@evals/tasks/observability/labels/get-series.yaml`:
- Around line 17-26: The current llmJudge checks are too weak (they allow
restating "kube_pod_info" or generic "namespace"); update the verify section to
require evidence only a real get_series call can produce by (1) adding an
llmJudge that asserts a numeric cardinality (e.g., a regex or type check for an
integer count) for the kube_pod_info series, (2) requiring at least one concrete
label value observed (e.g., "kube-system" or "openshift-monitoring") and at
least one specific label key from the API (e.g., "pod", "uid", "host_ip"), and
(3) adding a tool-call assertion that the agent invoked get_series (or
list_metrics) so the judge only passes when the tool was actually used. Ensure
references to the metric name kube_pod_info and the tool name get_series are
present in the updated verify checks.

In `@evals/tasks/observability/labels/series-by-namespace.yaml`:
- Around line 19-28: The llmJudge checks ("contains: \"container\"" and
"contains: \"pod\"") are too generic and can be satisfied by the prompt itself;
update the assertions in this YAML to require evidence only obtainable from a
real get_series call (for example assert concrete label values like
namespace="openshift-monitoring", a real pod name such as prometheus-k8s-0, or a
container value like prometheus), or assert a numeric cardinality for
container_cpu_usage_seconds_total in openshift-monitoring; additionally add a
paired tool-call assertion that verifies the get_series tool was invoked for the
openshift-monitoring namespace to prevent false positives (referencing the
llmJudge entries and the prompt inline for where to change the checks).

In `@evals/tasks/observability/metrics/list-metrics.yaml`:
- Around line 12-14: Update the task description and agent prompt to remove
references to the non-existent list_metrics tool and the name_regex parameter
and instead instruct the agent to use the existing prometheus_query (or
prometheus_query_range) tool with a PromQL name-match filter; specifically
mention using a query like `{__name__=~"kube.*"}` to list metrics containing
"kube", and reference the observable tool names prometheus_query and
prometheus_query_range so the task aligns with the actual observability toolset.

In `@evals/tasks/observability/metrics/list-node-metrics.yaml`:
- Around line 17-19: Replace the weak llmJudge.contains "node_" check with a
stricter assertion: either add a second llmJudge entry that checks for a
specific, less-likely-to-be-hallucinated metric name (e.g.,
"node_cpu_seconds_total" or another long metric string) or tighten tool-usage
validation to assert the tool pattern equals "list_metrics" (in addition to the
existing minToolCalls: 1) so the task requires an actual call to list_metrics;
update the list-node-metrics.yaml llmJudge and taskSet entries accordingly to
reference llmJudge.contains and the tool pattern "list_metrics" or add a new
llmJudge with the discriminating substring.

In `@evals/tasks/observability/queries/backend-reachability.yaml`:
- Around line 18-23: The current llmJudge `contains: "prometheus-k8s"` can be
spoofed by echoing the prompt token; update the validation to require evidence
only obtainable from an actual Prometheus query — for example change the judge
rule to assert the numeric up value or an `instance=` label (e.g., expect `"1"`
or `"instance="` in the response) or require the text "up" / "reachable"
alongside a numeric result, and pair this with a tool-usage assertion for the
Prometheus query; modify the `llmJudge` block and the `prompt.inline` (the
`up{job="prometheus-k8s"}` query) so the judge checks the query result (numeric
or instance label) rather than just the literal job name.

In `@evals/tasks/observability/queries/cpu-usage.yaml`:
- Around line 20-22: The llmJudge assertion `contains: "pod"` is too weak and
causes false positives; update the llmJudge in cpu-usage.yaml to assert a
concrete pod identifier (for example require `pod_name`, `pod=`, or a Kubernetes
UID pattern) or remove the `contains: "pod"` check and rely on the existing
`container_cpu_usage_seconds_total` assertion instead so the judge only passes
when an actual pod identifier is present. Target the `llmJudge` entry and
replace the `contains: "pod"` condition with a stricter token such as `contains:
"pod_name"` or `contains: "pod="` (or delete that assertion) to ensure responses
list real pod identifiers.

In `@evals/tasks/observability/queries/diagnose-cluster-health.yaml`:
- Around line 22-24: The llmJudge rule using contains: "kube_" is too
loose—update the llmJudge block to assert a concrete metric (e.g., contains:
"kube_node_status_condition" or "kube_pod_status_phase") and/or add a
complementary tool-usage assertion that checks the metrics query tool output
(for example assert that the metrics tool returned results or that a field like
metrics_query_response contains those metric names); target the llmJudge entry
and replace the generic contains: "kube_" with the chosen concrete metric string
and/or add an additional assertion verifying the tool invocation/response so the
judge cannot pass on narrative mentions alone.

In `@evals/tasks/observability/queries/nonexistent-namespace.yaml`:
- Around line 17-23: The llmJudge check uses a narrow case-sensitive contains:
"no data" which will miss equivalent valid responses; update the check for
llmJudge (the contains criterion) to either match the canonical empty-result
marker returned by the tool (e.g. the JSON token `"result":[]`), or broaden it
to a case-insensitive/semantic matcher that accepts any of the expected phrases
from the reason ("no data", "no results", "no pods found")—for example replace
the single contains with a regex/array of acceptable strings or a
case-insensitive match so the prompt inline "Show me the memory usage..." will
be accepted when the namespace is empty.

In `@evals/tasks/observability/queries/time-range-query.yaml`:
- Around line 21-26: The llmJudge assertion "contains: 'pod'" is too weak
because the prompt already mentions "pods"; update the check to verify actual
query execution or real results by replacing or augmenting llmJudge to (a)
assert a concrete pod-name pattern or list (e.g., require regex like
"pod-[a-z0-9]+"), or (b) assert a tool-call to execute_range_query with
start/end/step parameters (verify presence of execute_range_query and its args),
or (c) require a structured series-formatted result string; locate and change
the llmJudge block (symbol: llmJudge and its contains/reason fields) and/or add
a tool-call assertion referencing execute_range_query to ensure the response
reflects real query output rather than echoing the prompt.

In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml`:
- Around line 12-15: The task description incorrectly references a non-existent
tool show_timeseries; update the description in visualize-cpu-usage.yaml to
reference the actual observability tools (e.g., prometheus_query_range for
time-series retrieval) or alternatively register a new show_timeseries tool in
the observability toolset; specifically, change the text mentioning
show_timeseries to mention prometheus_query_range (or list available tools:
prometheus_query, prometheus_query_range, alertmanager_alerts) so the task
matches the real toolset and can retrieve time-series data programmatically.

---

Nitpick comments:
In `@evals/tasks/observability/labels/label-names.yaml`:
- Around line 21-23: The llmJudge label check currently looks for the token
"pod" (llmJudge -> contains -> "pod"), which is too easily satisfied by the
prompt; update the label under llmJudge to check for a less prompt-derived field
such as "node", "uid", or "created_by_kind" instead and update the reason text
accordingly (e.g., change contains to "node" and reason to "Verify the output
includes the node label") so the check reduces false-positive passes.

In `@evals/tasks/observability/queries/high-cardinality-rejection.yaml`:
- Around line 17-20: The verification only checks for "guardrail" but the reason
requires both recognizing the guardrail rejection and proposing a scoped
alternative; update the `verify.llmJudge` section in
high-cardinality-rejection.yaml to assert both conditions by either adding a
second `contains` check (e.g., `contains: "namespace"` or `contains: "scope"` /
`"smaller"`) alongside the existing `contains: "guardrail"`, or replace
`contains` with a prompt-based `llmJudge` rubric that explicitly requires the
response to mention the guardrail rejection and to suggest a concrete, scoped
alternative (use the `verify.llmJudge` key to host the new rubric).

In `@evals/tasks/observability/queries/memory-usage.yaml`:
- Around line 20-25: The current llmJudge only checks for the word "pod" and
doesn't validate that the answer returns a top-5 ranked list; update the judge
block (llmJudge) to require a pattern that enforces five distinct, ordered pod
entries (for example add a new token/pattern field or replace contains with a
regex that matches five ordinal or numbered lines like "1. <pod>", "2. <pod>"
through "5.") so the evaluation asserts both count and ranking in the response.

In `@evals/tasks/observability/queries/network-traffic.yaml`:
- Around line 20-22: The llmJudge check currently uses a weak token (contains:
"pod") which yields false positives; update the llmJudge validation to require
stronger evidence such as a pod-like identifier or structured phrasing (e.g.,
"pod/<name>" or Kubernetes-style names), or require ranked/result wording like
"top N pods" or explicit pod names, replacing the simple contains: "pod"
condition with a more discriminative match so the judge only passes when real
pod identifiers or ranked results are present.

In `@evals/tasks/observability/queries/nonexistent-metric.yaml`:
- Around line 17-20: The current verify block uses a brittle literal match via
the contains field ("not found"); update the verify to perform a semantic check
by replacing the single contains check with an llmJudge that uses an LLM rubric
(e.g., llmJudge.llm(rubric="...")) describing acceptable semantic equivalents
like "does not exist", "no such metric", "unavailable", or "empty", or
alternatively add multiple llmJudge entries each checking different acceptable
phrases so the verification uses semantic judgment rather than a single
substring match; look for the verify -> - llmJudge / contains usage in the file
to make this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bedf71e1-ce58-4a89-8d3c-6cb91605ff9b

📥 Commits

Reviewing files that changed from the base of the PR and between 118141f and f0113dd.

📒 Files selected for processing (31)
  • evals/claude-code/eval.yaml
  • evals/gemini-agent/eval.yaml
  • evals/openai-agent/eval.yaml
  • evals/tasks/observability/alerts/alert-investigation.yaml
  • evals/tasks/observability/alerts/filtered-alerts.yaml
  • evals/tasks/observability/alerts/get-alerts.yaml
  • evals/tasks/observability/alerts/get-silences.yaml
  • evals/tasks/observability/labels/get-series.yaml
  • evals/tasks/observability/labels/label-names.yaml
  • evals/tasks/observability/labels/label-values.yaml
  • evals/tasks/observability/labels/series-by-namespace.yaml
  • evals/tasks/observability/metrics/list-metrics.yaml
  • evals/tasks/observability/metrics/list-node-metrics.yaml
  • evals/tasks/observability/queries/backend-reachability.yaml
  • evals/tasks/observability/queries/cpu-usage.yaml
  • evals/tasks/observability/queries/crashlooping-pods.yaml
  • evals/tasks/observability/queries/diagnose-cluster-health.yaml
  • evals/tasks/observability/queries/high-cardinality-rejection.yaml
  • evals/tasks/observability/queries/memory-usage.yaml
  • evals/tasks/observability/queries/namespace-pod-count.yaml
  • evals/tasks/observability/queries/namespace-resource-usage.yaml
  • evals/tasks/observability/queries/network-traffic.yaml
  • evals/tasks/observability/queries/nonexistent-metric.yaml
  • evals/tasks/observability/queries/nonexistent-namespace.yaml
  • evals/tasks/observability/queries/pending-pods.yaml
  • evals/tasks/observability/queries/pods-created.yaml
  • evals/tasks/observability/queries/prometheus-head-series.yaml
  • evals/tasks/observability/queries/prometheus-requests.yaml
  • evals/tasks/observability/queries/prometheus-wal-size.yaml
  • evals/tasks/observability/queries/time-range-query.yaml
  • evals/tasks/observability/queries/visualize-cpu-usage.yaml

Comment on lines +21 to +23
- llmJudge:
contains: "alertmanager"
reason: "Verify the agent investigated Alertmanager-related metrics for the alert"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Second judge contains: "alertmanager" is redundant and weak.

Substring alertmanager is already contained in AlertmanagerReceiversNotConfigured (checked above) and appears naturally in any narrative about Alertmanager, so it doesn't independently verify that the agent investigated related metrics. Consider asserting on a concrete Alertmanager metric name the agent should have queried (e.g., alertmanager_notifications_failed_total, alertmanager_config_last_reload_successful) to actually validate the "investigate related metrics" step.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/alerts/alert-investigation.yaml` around lines 21 -
23, Replace the redundant llmJudge `contains: "alertmanager"` check with a
concrete Alertmanager metric assertion so the judge verifies metric
investigation; specifically update the llmJudge entry (the second one shown) to
assert for a specific metric name such as
`alertmanager_notifications_failed_total` or
`alertmanager_config_last_reload_successful` instead of the substring
"alertmanager" so the check ensures the agent queried Alertmanager metrics.

Comment on lines +17 to +24
spec:
verify:
- llmJudge:
contains: "AlertmanagerReceiversNotConfigured"
reason: "Verify the agent filtered for warning-severity alerts and found AlertmanagerReceiversNotConfigured"
prompt:
inline: |
Show me only the active alerts with severity=warning.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether the eval framework supports asserting tool-call arguments (not just tool patterns/counts).
rg -nP -C3 '(toolPattern|minToolCalls|maxToolCalls|toolArgs|arguments)' --type=yaml -g 'evals/**'
rg -nP -C3 '(ToolPattern|ToolArgs|Arguments|MinToolCalls)' -g '!vendor/**' -g '!**/testdata/**'

Repository: openshift/openshift-mcp-server

Length of output: 50389


🏁 Script executed:

cat -n evals/tasks/observability/alerts/filtered-alerts.yaml

Repository: openshift/openshift-mcp-server

Length of output: 1010


🏁 Script executed:

# Search for task.yaml files with tool argument assertions to understand the proper structure
find evals/tasks -name "*.yaml" -type f | head -20 | xargs -I {} sh -c 'echo "=== {} ===" && grep -A 10 "toolPattern\|args:" {} | head -15'

Repository: openshift/openshift-mcp-server

Length of output: 1473


🏁 Script executed:

# Look at the netedge example that showed args field
cat -n evals/tasks/netedge/query_prometheus/task.yaml

Repository: openshift/openshift-mcp-server

Length of output: 519


🏁 Script executed:

# Search for other observability tasks to understand the pattern
find evals/tasks/observability -name "*.yaml" -type f | xargs cat

Repository: openshift/openshift-mcp-server

Length of output: 20635


Use tool-call argument assertion to verify severity=warning filter is passed to get_alerts.

The current verify section only checks that AlertmanagerReceiversNotConfigured appears in the output; it does not verify the agent passed a severity=warning filter to the tool. An agent that fetched all alerts would still pass this check. The eval framework supports argument assertions (see evals/tasks/netedge/query_prometheus/task.yaml for the pattern), so add an args constraint under assertions.toolsUsed to verify the severity parameter was actually supplied:

assertions:
  toolsUsed:
    - server: kubernetes
      toolPattern: "get_alerts"
      args:
        severity: "warning"

Also note: relying on AlertmanagerReceiversNotConfigured to always fire may be fragile. If a cluster configures Alertmanager receivers, this alert will be absent and the eval will fail unrelated to agent behavior. Consider using a more reliable alert (like Watchdog, which is standard across OpenShift clusters) or make the alert name configurable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/alerts/filtered-alerts.yaml` around lines 17 - 24,
The verify step only checks for the presence of
AlertmanagerReceiversNotConfigured in the output and doesn’t assert that the
agent passed a severity=warning filter to the get_alerts tool; update the verify
block to add an assertions.toolsUsed entry that matches toolPattern "get_alerts"
and includes args with severity: "warning" so the eval framework verifies the
agent actually supplied the severity filter, and consider replacing or making
AlertmanagerReceiversNotConfigured configurable (or use a more reliable alert
like "Watchdog") to avoid false failures when that specific alert is absent.

Comment on lines +17 to +19
- llmJudge:
contains: "Watchdog"
reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Watchdog-only assertion is brittle across environments.

Requiring a specific alert name can fail valid runs on clusters where Watchdog is not firing, and can still pass via hallucinated output. Prefer asserting on retrieved alert structure/content (for example alertname + status=firing) rather than one fixed alert name.

Suggested adjustment
   verify:
     - llmJudge:
-        contains: "Watchdog"
-        reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert"
+        contains: "alertname"
+        reason: "Verify the response includes concrete alert fields from retrieved firing alerts"
+    - llmJudge:
+        contains: "firing"
+        reason: "Verify the response reports firing-state alerts from Alertmanager output"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/alerts/get-alerts.yaml` around lines 17 - 19, The
llmJudge assertion is brittle because it checks for a literal "Watchdog"; change
the test to assert on alert structure/content instead of a single name: modify
the llmJudge block to validate that the retrieved alerts include at least one
entry with an "alertname" field and a "status" equal to "firing" (or equivalent
key/value pair), and update the "reason" to reflect this structural check;
reference the llmJudge assertion and its "contains" usage to replace the string
match with a structural/regex or JSON-path style check that looks for alertname
+ status=firing.

Comment on lines +17 to +19
- llmJudge:
contains: "silences"
reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Judge token is prompt-leaked and allows false positives.

contains: "silences" is too generic because the same term is in the prompt, so a non-instrumented answer can pass.

Suggested adjustment
   verify:
     - llmJudge:
-        contains: "silences"
-        reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"
+        contains: "matchers"
+        reason: "Verify the response includes concrete silence details when active silences exist"
+    - llmJudge:
+        contains: "no active silences"
+        reason: "Allow explicit empty-state reporting when no silences are present"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- llmJudge:
contains: "silences"
reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"
- llmJudge:
contains: "matchers"
reason: "Verify the response includes concrete silence details when active silences exist"
- llmJudge:
contains: "no active silences"
reason: "Allow explicit empty-state reporting when no silences are present"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/alerts/get-silences.yaml` around lines 17 - 19, The
llmJudge check is too generic and subject to prompt-leakage because it only
asserts contains: "silences"; update the llmJudge assertion to look for more
specific, non-leaked indicators such as a phrase or regex that proves the agent
inspected silences (e.g. require "active silences" or "No active silences exist"
OR require presence of matcher fields like "matchers:" or a matcher key
pattern), by replacing the simple contains with a stricter string or regex match
(refer to llmJudge and the contains entry in the YAML) so only genuine
instrumented responses pass.

Comment on lines +17 to +26
verify:
- llmJudge:
contains: "namespace"
reason: "Verify the agent retrieved actual series data containing label dimensions like namespace"
- llmJudge:
contains: "kube_pod_info"
reason: "Verify the agent queried the kube_pod_info metric and reported its cardinality"
prompt:
inline: |
How many time series exist for the kube_pod_info metric? Show the count and list the label names present.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Both judge tokens are weak: one is in the prompt, the other is generic.

kube_pod_info is mentioned verbatim in the prompt, so the agent can pass that judge by merely restating the metric name without ever invoking list_metrics or get_series. namespace is a generic word that is almost certainly in any reasonable answer (and is implied by the prompt's "label names"), so it also doesn't prove real tool usage.

Suggest requiring evidence only a real get_series result can produce — e.g., a numeric cardinality count in the response, concrete label values observed in this cluster (kube-system, openshift-monitoring), or specific label keys returned by the API (pod, uid, host_ip). Complementing with a tool-call assertion on get_series would strengthen this further.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/labels/get-series.yaml` around lines 17 - 26, The
current llmJudge checks are too weak (they allow restating "kube_pod_info" or
generic "namespace"); update the verify section to require evidence only a real
get_series call can produce by (1) adding an llmJudge that asserts a numeric
cardinality (e.g., a regex or type check for an integer count) for the
kube_pod_info series, (2) requiring at least one concrete label value observed
(e.g., "kube-system" or "openshift-monitoring") and at least one specific label
key from the API (e.g., "pod", "uid", "host_ip"), and (3) adding a tool-call
assertion that the agent invoked get_series (or list_metrics) so the judge only
passes when the tool was actually used. Ensure references to the metric name
kube_pod_info and the tool name get_series are present in the updated verify
checks.

Comment on lines +20 to +22
- llmJudge:
contains: "pod"
reason: "Verify the response identifies specific pods"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

False-positive risk: contains: "pod" is trivially satisfied.

This is exactly the concern raised by reviewers on the PR: the prompt is "Which pods are using the most CPU?", so any response that merely restates or references the prompt (including a refusal or a wrong answer) will contain the substring pod and pass this assertion without the agent actually identifying real pods.

Consider replacing this with a stricter check — e.g., require a concrete pod-identifying field/label such as pod= or pod_name, or rely solely on the container_cpu_usage_seconds_total assertion and remove this one.

Suggested diff
     - llmJudge:
-        contains: "pod"
-        reason: "Verify the response identifies specific pods"
+        contains: "pod="
+        reason: "Verify the response identifies specific pods by label (e.g., pod=\"<name>\")"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- llmJudge:
contains: "pod"
reason: "Verify the response identifies specific pods"
- llmJudge:
contains: "pod="
reason: "Verify the response identifies specific pods by label (e.g., pod=\"<name>\")"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/cpu-usage.yaml` around lines 20 - 22, The
llmJudge assertion `contains: "pod"` is too weak and causes false positives;
update the llmJudge in cpu-usage.yaml to assert a concrete pod identifier (for
example require `pod_name`, `pod=`, or a Kubernetes UID pattern) or remove the
`contains: "pod"` check and rely on the existing
`container_cpu_usage_seconds_total` assertion instead so the judge only passes
when an actual pod identifier is present. Target the `llmJudge` entry and
replace the `contains: "pod"` condition with a stricter token such as `contains:
"pod_name"` or `contains: "pod="` (or delete that assertion) to ensure responses
list real pod identifiers.

Comment on lines +22 to +24
- llmJudge:
contains: "kube_"
reason: "Verify the agent queried Kubernetes metrics (e.g. kube_node_status_condition) for health assessment"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

contains: "kube_" is a very loose marker.

kube_ will match any casual mention of a kube_* metric name in the agent's narrative (including hypothetical references like "I would check kube_node_status_condition") without proof that the agent actually invoked a metrics tool. Consider tightening to a concrete metric the agent is expected to return results from (e.g., kube_node_status_condition or kube_pod_status_phase), or complement with a tool-usage assertion against the metrics query tool so the judge doesn't pass on prose alone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/diagnose-cluster-health.yaml` around lines
22 - 24, The llmJudge rule using contains: "kube_" is too loose—update the
llmJudge block to assert a concrete metric (e.g., contains:
"kube_node_status_condition" or "kube_pod_status_phase") and/or add a
complementary tool-usage assertion that checks the metrics query tool output
(for example assert that the metrics tool returned results or that a field like
metrics_query_response contains those metric names); target the llmJudge entry
and replace the generic contains: "kube_" with the chosen concrete metric string
and/or add an additional assertion verifying the tool invocation/response so the
judge cannot pass on narrative mentions alone.

Comment on lines +17 to +23
verify:
- llmJudge:
contains: "no data"
reason: "Verify the agent reports no data, no results, or no pods found in the nonexistent namespace"
prompt:
inline: |
Show me the memory usage for all pods in the namespace called totally-fake-namespace-12345.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

contains: "no data" is narrow and may miss valid correct responses.

The stated reason accepts "no data, no results, or no pods found", but a strict substring match on "no data" will fail any of those equivalent phrasings (and is also case-sensitive on literal n/N). Either pick the most universal phrasing used by the tool's actual empty response, or use a broader/semantic judge criterion. If the PromQL/Prometheus tool returns a well-known empty-result marker (e.g. "result": []), asserting on that verbatim marker would be more reliable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/nonexistent-namespace.yaml` around lines 17
- 23, The llmJudge check uses a narrow case-sensitive contains: "no data" which
will miss equivalent valid responses; update the check for llmJudge (the
contains criterion) to either match the canonical empty-result marker returned
by the tool (e.g. the JSON token `"result":[]`), or broaden it to a
case-insensitive/semantic matcher that accepts any of the expected phrases from
the reason ("no data", "no results", "no pods found")—for example replace the
single contains with a regex/array of acceptable strings or a case-insensitive
match so the prompt inline "Show me the memory usage..." will be accepted when
the namespace is empty.

Comment on lines +21 to +26
- llmJudge:
contains: "pod"
reason: "Verify the response includes specific pod names from the query results"
prompt:
inline: |
Show me the CPU usage trend for pods in the default namespace over the last 30 minutes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

contains: "pod" is trivially satisfied by the prompt wording.

The prompt itself mentions "pods in the default namespace", so any reasonable agent response will repeat the word pod regardless of whether it actually executed a range query or returned real results. This does not verify "specific pod names from the query results" as the reason claims. Consider a stricter assertion such as requiring a concrete pod-name pattern, a namespace label value returned by the query, or a series-formatted result string; alternatively, assert the use of execute_range_query with start/end/step via a tool-call assertion rather than relying on output substrings.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/time-range-query.yaml` around lines 21 -
26, The llmJudge assertion "contains: 'pod'" is too weak because the prompt
already mentions "pods"; update the check to verify actual query execution or
real results by replacing or augmenting llmJudge to (a) assert a concrete
pod-name pattern or list (e.g., require regex like "pod-[a-z0-9]+"), or (b)
assert a tool-call to execute_range_query with start/end/step parameters (verify
presence of execute_range_query and its args), or (c) require a structured
series-formatted result string; locate and change the llmJudge block (symbol:
llmJudge and its contains/reason fields) and/or add a tool-call assertion
referencing execute_range_query to ensure the response reflects real query
output rather than echoing the prompt.

Comment on lines +12 to +15
description: |
Tests if the agent uses the show_timeseries tool to visualize CPU usage
as a chart. The agent should discover the metric, then use show_timeseries
to render a time-series visualization.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -nP -C2 '"(show_timeseries|list_metrics|get_alerts|list_node_metrics|get_series|get_silences|label_names|label_values)"' pkg/
ast-grep --pattern 'func init$_() []api.ServerTool { $$$ }'

Repository: openshift/openshift-mcp-server

Length of output: 56


🏁 Script executed:

#!/bin/bash

# Find the toolset.go file
echo "=== Finding toolset.go ==="
find . -name "toolset.go" -path "*/observability/*" 2>/dev/null

# Find the toolset_test.go file
echo "=== Finding toolset_test.go ==="
find . -name "toolset_test.go" -path "*/observability/*" 2>/dev/null

# Find the YAML file
echo "=== Finding visualize-cpu-usage.yaml ==="
find . -name "visualize-cpu-usage.yaml" 2>/dev/null

Repository: openshift/openshift-mcp-server

Length of output: 319


🏁 Script executed:

#!/bin/bash

# Read the YAML file to see the exact description
echo "=== Content of visualize-cpu-usage.yaml ==="
cat evals/tasks/observability/queries/visualize-cpu-usage.yaml

echo -e "\n=== Content of toolset.go ==="
cat pkg/toolsets/observability/toolset.go

echo -e "\n=== Content of toolset_test.go ==="
cat pkg/toolsets/observability/toolset_test.go

Repository: openshift/openshift-mcp-server

Length of output: 6418


🏁 Script executed:

#!/bin/bash

# Search for show_timeseries anywhere in the repo
echo "=== Searching for show_timeseries ==="
rg -r "show_timeseries" . 2>/dev/null | head -50

# Search for what tools are actually defined/registered
echo -e "\n=== Searching for tool registration patterns ==="
rg -n "prometheus_query|alertmanager_alerts|list_metrics" pkg/toolsets/observability/ 2>/dev/null | head -30

Repository: openshift/openshift-mcp-server

Length of output: 27625


Description references a tool that does not exist in the observability toolset.

The description references show_timeseries, but the observability toolset only registers three tools: prometheus_query, prometheus_query_range, and alertmanager_alerts (confirmed in pkg/toolsets/observability/toolset_test.go line 51). Either add the show_timeseries tool to the toolset, or update the description to reference actual tools like prometheus_query_range for retrieving time-series data. As-is, the task can only pass by the judge matching the hardcoded metric name, not by actual visualization capability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml` around lines 12 -
15, The task description incorrectly references a non-existent tool
show_timeseries; update the description in visualize-cpu-usage.yaml to reference
the actual observability tools (e.g., prometheus_query_range for time-series
retrieval) or alternatively register a new show_timeseries tool in the
observability toolset; specifically, change the text mentioning show_timeseries
to mention prometheus_query_range (or list available tools: prometheus_query,
prometheus_query_range, alertmanager_alerts) so the task matches the real
toolset and can retrieve time-series data programmatically.

@slashpai
Copy link
Copy Markdown
Member

@nader-ziada can you review again? It's synced now

Copy link
Copy Markdown

@nader-ziada nader-ziada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Cali0707
Copy link
Copy Markdown

@tremes do we need to install anything into the openshift cluster for these evals to work? Or is everything included by default?

If we need to install anything, can you add a make target?

@slashpai
Copy link
Copy Markdown
Member

we don't need installation extra for these evals

Copy link
Copy Markdown

@Cali0707 Cali0707 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@tremes can you backport this to release-0.3 ?

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Cali0707, tremes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2026
@Cali0707
Copy link
Copy Markdown

/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request"

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 23, 2026

@Cali0707: Overrode contexts on behalf of Cali0707: Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request

Details

In response to this:

/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Cali0707
Copy link
Copy Markdown

@tremes can you set a JIRA reference for this PR?

@slashpai
Copy link
Copy Markdown
Member

/cherry-pick release-0.3

@openshift-cherrypick-robot
Copy link
Copy Markdown

@slashpai: once the present PR merges, I will cherry-pick it on top of release-0.3 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@slashpai
Copy link
Copy Markdown
Member

slashpai commented Apr 24, 2026

/retitle GIE-505: add observability toolset evals

@openshift-ci openshift-ci Bot changed the title feat: add observability toolset evals GIE-505: add observability toolset evals Apr 24, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 24, 2026

@tremes: This pull request references GIE-505 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Requires #226

I have been trying the following so far

make run-server
make run-evals EVAL_LABEL_SELECTOR="suite=observability" EVAL_TASK_FILTER="get-alerts"

Summary by CodeRabbit

Release Notes

  • New Features
  • Added comprehensive observability evaluation suite across multiple agent platforms
  • Introduced 30+ new evaluation tasks testing agent capabilities for monitoring scenarios including alerts, metrics discovery, label exploration, and observability queries
  • Tasks validate proper interaction with Kubernetes monitoring systems and error handling scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 24, 2026

@tremes: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 9077d51 into openshift:main Apr 24, 2026
10 of 11 checks passed
@openshift-cherrypick-robot
Copy link
Copy Markdown

@slashpai: new pull request created: #260

Details

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants