GIE-505: add observability toolset evals#241
GIE-505: add observability toolset evals#241openshift-merge-bot[bot] merged 3 commits intoopenshift:mainfrom
Conversation
📝 WalkthroughWalkthroughThis pull request adds a comprehensive observability evaluation suite by introducing task set configurations across three agent implementations (Claude, Gemini, OpenAI) and defining 26 new observability test tasks spanning alert management, metric discovery, label exploration, and cluster health queries. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Skipping CI for Draft Pull Request. |
|
The structure, organization, file layout, and eval config wiring are all solid. The weak spot is the llmJudge.contains values are using too many use generic words that already appear in the prompt, so an agent could pass without actually calling tools or discovering anything. |
@nader-ziada @tremes I will look at this today and have PR in obs-mcp to update evals and we can update it here once merged. |
|
@slashpai thanks for the updates, LGTM can you move the PR from draft |
|
Thank you. I have updated in main repo. Once merged @tremes can update this branch |
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (5)
evals/tasks/observability/queries/memory-usage.yaml (1)
20-25: Top-5 requirement is not explicitly validated.Current checks confirm metric intent but not that the answer actually returns five ranked results. Consider adding a stricter judge token/pattern tied to ranking output (e.g., ordinal/list structure plus five pod entries).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/memory-usage.yaml` around lines 20 - 25, The current llmJudge only checks for the word "pod" and doesn't validate that the answer returns a top-5 ranked list; update the judge block (llmJudge) to require a pattern that enforces five distinct, ordered pod entries (for example add a new token/pattern field or replace contains with a regex that matches five ordinal or numbered lines like "1. <pod>", "2. <pod>" through "5.") so the evaluation asserts both count and ranking in the response.evals/tasks/observability/labels/label-names.yaml (1)
21-23:podcheck is weak for this prompt context.Since
kube_pod_infoappears in the prompt,podcan appear without real discovery. Consider replacing this token with a less prompt-derived label (for examplenode,uid, orcreated_by_kind) to reduce false-positive passes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/labels/label-names.yaml` around lines 21 - 23, The llmJudge label check currently looks for the token "pod" (llmJudge -> contains -> "pod"), which is too easily satisfied by the prompt; update the label under llmJudge to check for a less prompt-derived field such as "node", "uid", or "created_by_kind" instead and update the reason text accordingly (e.g., change contains to "node" and reason to "Verify the output includes the node label") so the check reduces false-positive passes.evals/tasks/observability/queries/network-traffic.yaml (1)
20-22: Consider replacing genericpodtoken with stronger evidence.
podalone is easy to satisfy with templated language. A more discriminative check (for example presence of pod-like identifiers or ranked result wording) would reduce false positives.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/network-traffic.yaml` around lines 20 - 22, The llmJudge check currently uses a weak token (contains: "pod") which yields false positives; update the llmJudge validation to require stronger evidence such as a pod-like identifier or structured phrasing (e.g., "pod/<name>" or Kubernetes-style names), or require ranked/result wording like "top N pods" or explicit pod names, replacing the simple contains: "pod" condition with a more discriminative match so the judge only passes when real pod identifiers or ranked results are present.evals/tasks/observability/queries/nonexistent-metric.yaml (1)
17-20: The"not found"substring match is brittle—use a semantic judge or enumerate acceptable phrases.The
containsfield performs literal substring matching, not semantic evaluation. Agents may naturally say "does not exist", "no such metric", "empty", or "unavailable", causing the check to fail. Either use an LLM judge with a rubric (e.g.,llmJudge.llm(rubric="...")) to evaluate semantic meaning, or add multiplellmJudgeentries with differentcontainsvalues (OR-combined per framework semantics).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/nonexistent-metric.yaml` around lines 17 - 20, The current verify block uses a brittle literal match via the contains field ("not found"); update the verify to perform a semantic check by replacing the single contains check with an llmJudge that uses an LLM rubric (e.g., llmJudge.llm(rubric="...")) describing acceptable semantic equivalents like "does not exist", "no such metric", "unavailable", or "empty", or alternatively add multiple llmJudge entries each checking different acceptable phrases so the verification uses semantic judgment rather than a single substring match; look for the verify -> - llmJudge / contains usage in the file to make this change.evals/tasks/observability/queries/high-cardinality-rejection.yaml (1)
17-20: Verification is narrower than the stated reason.The
reasonclaims to verify both (a) the guardrail rejection and (b) that the agent "suggests a scoped alternative", butcontains: "guardrail"only checks the first. Consider adding a second check (e.g.,contains: "namespace"orcontains: "scope"/"smaller"/"resolution") — or replace thecontainscheck with allmJudgeprompt-based rubric that actually asserts both conditions, so a response that just complains about a guardrail without proposing an alternative doesn't pass.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/high-cardinality-rejection.yaml` around lines 17 - 20, The verification only checks for "guardrail" but the reason requires both recognizing the guardrail rejection and proposing a scoped alternative; update the `verify.llmJudge` section in high-cardinality-rejection.yaml to assert both conditions by either adding a second `contains` check (e.g., `contains: "namespace"` or `contains: "scope"` / `"smaller"`) alongside the existing `contains: "guardrail"`, or replace `contains` with a prompt-based `llmJudge` rubric that explicitly requires the response to mention the guardrail rejection and to suggest a concrete, scoped alternative (use the `verify.llmJudge` key to host the new rubric).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/tasks/observability/alerts/alert-investigation.yaml`:
- Around line 21-23: Replace the redundant llmJudge `contains: "alertmanager"`
check with a concrete Alertmanager metric assertion so the judge verifies metric
investigation; specifically update the llmJudge entry (the second one shown) to
assert for a specific metric name such as
`alertmanager_notifications_failed_total` or
`alertmanager_config_last_reload_successful` instead of the substring
"alertmanager" so the check ensures the agent queried Alertmanager metrics.
In `@evals/tasks/observability/alerts/filtered-alerts.yaml`:
- Around line 17-24: The verify step only checks for the presence of
AlertmanagerReceiversNotConfigured in the output and doesn’t assert that the
agent passed a severity=warning filter to the get_alerts tool; update the verify
block to add an assertions.toolsUsed entry that matches toolPattern "get_alerts"
and includes args with severity: "warning" so the eval framework verifies the
agent actually supplied the severity filter, and consider replacing or making
AlertmanagerReceiversNotConfigured configurable (or use a more reliable alert
like "Watchdog") to avoid false failures when that specific alert is absent.
In `@evals/tasks/observability/alerts/get-alerts.yaml`:
- Around line 17-19: The llmJudge assertion is brittle because it checks for a
literal "Watchdog"; change the test to assert on alert structure/content instead
of a single name: modify the llmJudge block to validate that the retrieved
alerts include at least one entry with an "alertname" field and a "status" equal
to "firing" (or equivalent key/value pair), and update the "reason" to reflect
this structural check; reference the llmJudge assertion and its "contains" usage
to replace the string match with a structural/regex or JSON-path style check
that looks for alertname + status=firing.
In `@evals/tasks/observability/alerts/get-silences.yaml`:
- Around line 17-19: The llmJudge check is too generic and subject to
prompt-leakage because it only asserts contains: "silences"; update the llmJudge
assertion to look for more specific, non-leaked indicators such as a phrase or
regex that proves the agent inspected silences (e.g. require "active silences"
or "No active silences exist" OR require presence of matcher fields like
"matchers:" or a matcher key pattern), by replacing the simple contains with a
stricter string or regex match (refer to llmJudge and the contains entry in the
YAML) so only genuine instrumented responses pass.
In `@evals/tasks/observability/labels/get-series.yaml`:
- Around line 17-26: The current llmJudge checks are too weak (they allow
restating "kube_pod_info" or generic "namespace"); update the verify section to
require evidence only a real get_series call can produce by (1) adding an
llmJudge that asserts a numeric cardinality (e.g., a regex or type check for an
integer count) for the kube_pod_info series, (2) requiring at least one concrete
label value observed (e.g., "kube-system" or "openshift-monitoring") and at
least one specific label key from the API (e.g., "pod", "uid", "host_ip"), and
(3) adding a tool-call assertion that the agent invoked get_series (or
list_metrics) so the judge only passes when the tool was actually used. Ensure
references to the metric name kube_pod_info and the tool name get_series are
present in the updated verify checks.
In `@evals/tasks/observability/labels/series-by-namespace.yaml`:
- Around line 19-28: The llmJudge checks ("contains: \"container\"" and
"contains: \"pod\"") are too generic and can be satisfied by the prompt itself;
update the assertions in this YAML to require evidence only obtainable from a
real get_series call (for example assert concrete label values like
namespace="openshift-monitoring", a real pod name such as prometheus-k8s-0, or a
container value like prometheus), or assert a numeric cardinality for
container_cpu_usage_seconds_total in openshift-monitoring; additionally add a
paired tool-call assertion that verifies the get_series tool was invoked for the
openshift-monitoring namespace to prevent false positives (referencing the
llmJudge entries and the prompt inline for where to change the checks).
In `@evals/tasks/observability/metrics/list-metrics.yaml`:
- Around line 12-14: Update the task description and agent prompt to remove
references to the non-existent list_metrics tool and the name_regex parameter
and instead instruct the agent to use the existing prometheus_query (or
prometheus_query_range) tool with a PromQL name-match filter; specifically
mention using a query like `{__name__=~"kube.*"}` to list metrics containing
"kube", and reference the observable tool names prometheus_query and
prometheus_query_range so the task aligns with the actual observability toolset.
In `@evals/tasks/observability/metrics/list-node-metrics.yaml`:
- Around line 17-19: Replace the weak llmJudge.contains "node_" check with a
stricter assertion: either add a second llmJudge entry that checks for a
specific, less-likely-to-be-hallucinated metric name (e.g.,
"node_cpu_seconds_total" or another long metric string) or tighten tool-usage
validation to assert the tool pattern equals "list_metrics" (in addition to the
existing minToolCalls: 1) so the task requires an actual call to list_metrics;
update the list-node-metrics.yaml llmJudge and taskSet entries accordingly to
reference llmJudge.contains and the tool pattern "list_metrics" or add a new
llmJudge with the discriminating substring.
In `@evals/tasks/observability/queries/backend-reachability.yaml`:
- Around line 18-23: The current llmJudge `contains: "prometheus-k8s"` can be
spoofed by echoing the prompt token; update the validation to require evidence
only obtainable from an actual Prometheus query — for example change the judge
rule to assert the numeric up value or an `instance=` label (e.g., expect `"1"`
or `"instance="` in the response) or require the text "up" / "reachable"
alongside a numeric result, and pair this with a tool-usage assertion for the
Prometheus query; modify the `llmJudge` block and the `prompt.inline` (the
`up{job="prometheus-k8s"}` query) so the judge checks the query result (numeric
or instance label) rather than just the literal job name.
In `@evals/tasks/observability/queries/cpu-usage.yaml`:
- Around line 20-22: The llmJudge assertion `contains: "pod"` is too weak and
causes false positives; update the llmJudge in cpu-usage.yaml to assert a
concrete pod identifier (for example require `pod_name`, `pod=`, or a Kubernetes
UID pattern) or remove the `contains: "pod"` check and rely on the existing
`container_cpu_usage_seconds_total` assertion instead so the judge only passes
when an actual pod identifier is present. Target the `llmJudge` entry and
replace the `contains: "pod"` condition with a stricter token such as `contains:
"pod_name"` or `contains: "pod="` (or delete that assertion) to ensure responses
list real pod identifiers.
In `@evals/tasks/observability/queries/diagnose-cluster-health.yaml`:
- Around line 22-24: The llmJudge rule using contains: "kube_" is too
loose—update the llmJudge block to assert a concrete metric (e.g., contains:
"kube_node_status_condition" or "kube_pod_status_phase") and/or add a
complementary tool-usage assertion that checks the metrics query tool output
(for example assert that the metrics tool returned results or that a field like
metrics_query_response contains those metric names); target the llmJudge entry
and replace the generic contains: "kube_" with the chosen concrete metric string
and/or add an additional assertion verifying the tool invocation/response so the
judge cannot pass on narrative mentions alone.
In `@evals/tasks/observability/queries/nonexistent-namespace.yaml`:
- Around line 17-23: The llmJudge check uses a narrow case-sensitive contains:
"no data" which will miss equivalent valid responses; update the check for
llmJudge (the contains criterion) to either match the canonical empty-result
marker returned by the tool (e.g. the JSON token `"result":[]`), or broaden it
to a case-insensitive/semantic matcher that accepts any of the expected phrases
from the reason ("no data", "no results", "no pods found")—for example replace
the single contains with a regex/array of acceptable strings or a
case-insensitive match so the prompt inline "Show me the memory usage..." will
be accepted when the namespace is empty.
In `@evals/tasks/observability/queries/time-range-query.yaml`:
- Around line 21-26: The llmJudge assertion "contains: 'pod'" is too weak
because the prompt already mentions "pods"; update the check to verify actual
query execution or real results by replacing or augmenting llmJudge to (a)
assert a concrete pod-name pattern or list (e.g., require regex like
"pod-[a-z0-9]+"), or (b) assert a tool-call to execute_range_query with
start/end/step parameters (verify presence of execute_range_query and its args),
or (c) require a structured series-formatted result string; locate and change
the llmJudge block (symbol: llmJudge and its contains/reason fields) and/or add
a tool-call assertion referencing execute_range_query to ensure the response
reflects real query output rather than echoing the prompt.
In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml`:
- Around line 12-15: The task description incorrectly references a non-existent
tool show_timeseries; update the description in visualize-cpu-usage.yaml to
reference the actual observability tools (e.g., prometheus_query_range for
time-series retrieval) or alternatively register a new show_timeseries tool in
the observability toolset; specifically, change the text mentioning
show_timeseries to mention prometheus_query_range (or list available tools:
prometheus_query, prometheus_query_range, alertmanager_alerts) so the task
matches the real toolset and can retrieve time-series data programmatically.
---
Nitpick comments:
In `@evals/tasks/observability/labels/label-names.yaml`:
- Around line 21-23: The llmJudge label check currently looks for the token
"pod" (llmJudge -> contains -> "pod"), which is too easily satisfied by the
prompt; update the label under llmJudge to check for a less prompt-derived field
such as "node", "uid", or "created_by_kind" instead and update the reason text
accordingly (e.g., change contains to "node" and reason to "Verify the output
includes the node label") so the check reduces false-positive passes.
In `@evals/tasks/observability/queries/high-cardinality-rejection.yaml`:
- Around line 17-20: The verification only checks for "guardrail" but the reason
requires both recognizing the guardrail rejection and proposing a scoped
alternative; update the `verify.llmJudge` section in
high-cardinality-rejection.yaml to assert both conditions by either adding a
second `contains` check (e.g., `contains: "namespace"` or `contains: "scope"` /
`"smaller"`) alongside the existing `contains: "guardrail"`, or replace
`contains` with a prompt-based `llmJudge` rubric that explicitly requires the
response to mention the guardrail rejection and to suggest a concrete, scoped
alternative (use the `verify.llmJudge` key to host the new rubric).
In `@evals/tasks/observability/queries/memory-usage.yaml`:
- Around line 20-25: The current llmJudge only checks for the word "pod" and
doesn't validate that the answer returns a top-5 ranked list; update the judge
block (llmJudge) to require a pattern that enforces five distinct, ordered pod
entries (for example add a new token/pattern field or replace contains with a
regex that matches five ordinal or numbered lines like "1. <pod>", "2. <pod>"
through "5.") so the evaluation asserts both count and ranking in the response.
In `@evals/tasks/observability/queries/network-traffic.yaml`:
- Around line 20-22: The llmJudge check currently uses a weak token (contains:
"pod") which yields false positives; update the llmJudge validation to require
stronger evidence such as a pod-like identifier or structured phrasing (e.g.,
"pod/<name>" or Kubernetes-style names), or require ranked/result wording like
"top N pods" or explicit pod names, replacing the simple contains: "pod"
condition with a more discriminative match so the judge only passes when real
pod identifiers or ranked results are present.
In `@evals/tasks/observability/queries/nonexistent-metric.yaml`:
- Around line 17-20: The current verify block uses a brittle literal match via
the contains field ("not found"); update the verify to perform a semantic check
by replacing the single contains check with an llmJudge that uses an LLM rubric
(e.g., llmJudge.llm(rubric="...")) describing acceptable semantic equivalents
like "does not exist", "no such metric", "unavailable", or "empty", or
alternatively add multiple llmJudge entries each checking different acceptable
phrases so the verification uses semantic judgment rather than a single
substring match; look for the verify -> - llmJudge / contains usage in the file
to make this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: bedf71e1-ce58-4a89-8d3c-6cb91605ff9b
📒 Files selected for processing (31)
evals/claude-code/eval.yamlevals/gemini-agent/eval.yamlevals/openai-agent/eval.yamlevals/tasks/observability/alerts/alert-investigation.yamlevals/tasks/observability/alerts/filtered-alerts.yamlevals/tasks/observability/alerts/get-alerts.yamlevals/tasks/observability/alerts/get-silences.yamlevals/tasks/observability/labels/get-series.yamlevals/tasks/observability/labels/label-names.yamlevals/tasks/observability/labels/label-values.yamlevals/tasks/observability/labels/series-by-namespace.yamlevals/tasks/observability/metrics/list-metrics.yamlevals/tasks/observability/metrics/list-node-metrics.yamlevals/tasks/observability/queries/backend-reachability.yamlevals/tasks/observability/queries/cpu-usage.yamlevals/tasks/observability/queries/crashlooping-pods.yamlevals/tasks/observability/queries/diagnose-cluster-health.yamlevals/tasks/observability/queries/high-cardinality-rejection.yamlevals/tasks/observability/queries/memory-usage.yamlevals/tasks/observability/queries/namespace-pod-count.yamlevals/tasks/observability/queries/namespace-resource-usage.yamlevals/tasks/observability/queries/network-traffic.yamlevals/tasks/observability/queries/nonexistent-metric.yamlevals/tasks/observability/queries/nonexistent-namespace.yamlevals/tasks/observability/queries/pending-pods.yamlevals/tasks/observability/queries/pods-created.yamlevals/tasks/observability/queries/prometheus-head-series.yamlevals/tasks/observability/queries/prometheus-requests.yamlevals/tasks/observability/queries/prometheus-wal-size.yamlevals/tasks/observability/queries/time-range-query.yamlevals/tasks/observability/queries/visualize-cpu-usage.yaml
| - llmJudge: | ||
| contains: "alertmanager" | ||
| reason: "Verify the agent investigated Alertmanager-related metrics for the alert" |
There was a problem hiding this comment.
Second judge contains: "alertmanager" is redundant and weak.
Substring alertmanager is already contained in AlertmanagerReceiversNotConfigured (checked above) and appears naturally in any narrative about Alertmanager, so it doesn't independently verify that the agent investigated related metrics. Consider asserting on a concrete Alertmanager metric name the agent should have queried (e.g., alertmanager_notifications_failed_total, alertmanager_config_last_reload_successful) to actually validate the "investigate related metrics" step.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/alerts/alert-investigation.yaml` around lines 21 -
23, Replace the redundant llmJudge `contains: "alertmanager"` check with a
concrete Alertmanager metric assertion so the judge verifies metric
investigation; specifically update the llmJudge entry (the second one shown) to
assert for a specific metric name such as
`alertmanager_notifications_failed_total` or
`alertmanager_config_last_reload_successful` instead of the substring
"alertmanager" so the check ensures the agent queried Alertmanager metrics.
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "AlertmanagerReceiversNotConfigured" | ||
| reason: "Verify the agent filtered for warning-severity alerts and found AlertmanagerReceiversNotConfigured" | ||
| prompt: | ||
| inline: | | ||
| Show me only the active alerts with severity=warning. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check whether the eval framework supports asserting tool-call arguments (not just tool patterns/counts).
rg -nP -C3 '(toolPattern|minToolCalls|maxToolCalls|toolArgs|arguments)' --type=yaml -g 'evals/**'
rg -nP -C3 '(ToolPattern|ToolArgs|Arguments|MinToolCalls)' -g '!vendor/**' -g '!**/testdata/**'Repository: openshift/openshift-mcp-server
Length of output: 50389
🏁 Script executed:
cat -n evals/tasks/observability/alerts/filtered-alerts.yamlRepository: openshift/openshift-mcp-server
Length of output: 1010
🏁 Script executed:
# Search for task.yaml files with tool argument assertions to understand the proper structure
find evals/tasks -name "*.yaml" -type f | head -20 | xargs -I {} sh -c 'echo "=== {} ===" && grep -A 10 "toolPattern\|args:" {} | head -15'Repository: openshift/openshift-mcp-server
Length of output: 1473
🏁 Script executed:
# Look at the netedge example that showed args field
cat -n evals/tasks/netedge/query_prometheus/task.yamlRepository: openshift/openshift-mcp-server
Length of output: 519
🏁 Script executed:
# Search for other observability tasks to understand the pattern
find evals/tasks/observability -name "*.yaml" -type f | xargs catRepository: openshift/openshift-mcp-server
Length of output: 20635
Use tool-call argument assertion to verify severity=warning filter is passed to get_alerts.
The current verify section only checks that AlertmanagerReceiversNotConfigured appears in the output; it does not verify the agent passed a severity=warning filter to the tool. An agent that fetched all alerts would still pass this check. The eval framework supports argument assertions (see evals/tasks/netedge/query_prometheus/task.yaml for the pattern), so add an args constraint under assertions.toolsUsed to verify the severity parameter was actually supplied:
assertions:
toolsUsed:
- server: kubernetes
toolPattern: "get_alerts"
args:
severity: "warning"Also note: relying on AlertmanagerReceiversNotConfigured to always fire may be fragile. If a cluster configures Alertmanager receivers, this alert will be absent and the eval will fail unrelated to agent behavior. Consider using a more reliable alert (like Watchdog, which is standard across OpenShift clusters) or make the alert name configurable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/alerts/filtered-alerts.yaml` around lines 17 - 24,
The verify step only checks for the presence of
AlertmanagerReceiversNotConfigured in the output and doesn’t assert that the
agent passed a severity=warning filter to the get_alerts tool; update the verify
block to add an assertions.toolsUsed entry that matches toolPattern "get_alerts"
and includes args with severity: "warning" so the eval framework verifies the
agent actually supplied the severity filter, and consider replacing or making
AlertmanagerReceiversNotConfigured configurable (or use a more reliable alert
like "Watchdog") to avoid false failures when that specific alert is absent.
| - llmJudge: | ||
| contains: "Watchdog" | ||
| reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert" |
There was a problem hiding this comment.
Watchdog-only assertion is brittle across environments.
Requiring a specific alert name can fail valid runs on clusters where Watchdog is not firing, and can still pass via hallucinated output. Prefer asserting on retrieved alert structure/content (for example alertname + status=firing) rather than one fixed alert name.
Suggested adjustment
verify:
- llmJudge:
- contains: "Watchdog"
- reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert"
+ contains: "alertname"
+ reason: "Verify the response includes concrete alert fields from retrieved firing alerts"
+ - llmJudge:
+ contains: "firing"
+ reason: "Verify the response reports firing-state alerts from Alertmanager output"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/alerts/get-alerts.yaml` around lines 17 - 19, The
llmJudge assertion is brittle because it checks for a literal "Watchdog"; change
the test to assert on alert structure/content instead of a single name: modify
the llmJudge block to validate that the retrieved alerts include at least one
entry with an "alertname" field and a "status" equal to "firing" (or equivalent
key/value pair), and update the "reason" to reflect this structural check;
reference the llmJudge assertion and its "contains" usage to replace the string
match with a structural/regex or JSON-path style check that looks for alertname
+ status=firing.
| - llmJudge: | ||
| contains: "silences" | ||
| reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes" |
There was a problem hiding this comment.
Judge token is prompt-leaked and allows false positives.
contains: "silences" is too generic because the same term is in the prompt, so a non-instrumented answer can pass.
Suggested adjustment
verify:
- llmJudge:
- contains: "silences"
- reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"
+ contains: "matchers"
+ reason: "Verify the response includes concrete silence details when active silences exist"
+ - llmJudge:
+ contains: "no active silences"
+ reason: "Allow explicit empty-state reporting when no silences are present"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - llmJudge: | |
| contains: "silences" | |
| reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes" | |
| - llmJudge: | |
| contains: "matchers" | |
| reason: "Verify the response includes concrete silence details when active silences exist" | |
| - llmJudge: | |
| contains: "no active silences" | |
| reason: "Allow explicit empty-state reporting when no silences are present" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/alerts/get-silences.yaml` around lines 17 - 19, The
llmJudge check is too generic and subject to prompt-leakage because it only
asserts contains: "silences"; update the llmJudge assertion to look for more
specific, non-leaked indicators such as a phrase or regex that proves the agent
inspected silences (e.g. require "active silences" or "No active silences exist"
OR require presence of matcher fields like "matchers:" or a matcher key
pattern), by replacing the simple contains with a stricter string or regex match
(refer to llmJudge and the contains entry in the YAML) so only genuine
instrumented responses pass.
| verify: | ||
| - llmJudge: | ||
| contains: "namespace" | ||
| reason: "Verify the agent retrieved actual series data containing label dimensions like namespace" | ||
| - llmJudge: | ||
| contains: "kube_pod_info" | ||
| reason: "Verify the agent queried the kube_pod_info metric and reported its cardinality" | ||
| prompt: | ||
| inline: | | ||
| How many time series exist for the kube_pod_info metric? Show the count and list the label names present. |
There was a problem hiding this comment.
Both judge tokens are weak: one is in the prompt, the other is generic.
kube_pod_info is mentioned verbatim in the prompt, so the agent can pass that judge by merely restating the metric name without ever invoking list_metrics or get_series. namespace is a generic word that is almost certainly in any reasonable answer (and is implied by the prompt's "label names"), so it also doesn't prove real tool usage.
Suggest requiring evidence only a real get_series result can produce — e.g., a numeric cardinality count in the response, concrete label values observed in this cluster (kube-system, openshift-monitoring), or specific label keys returned by the API (pod, uid, host_ip). Complementing with a tool-call assertion on get_series would strengthen this further.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/labels/get-series.yaml` around lines 17 - 26, The
current llmJudge checks are too weak (they allow restating "kube_pod_info" or
generic "namespace"); update the verify section to require evidence only a real
get_series call can produce by (1) adding an llmJudge that asserts a numeric
cardinality (e.g., a regex or type check for an integer count) for the
kube_pod_info series, (2) requiring at least one concrete label value observed
(e.g., "kube-system" or "openshift-monitoring") and at least one specific label
key from the API (e.g., "pod", "uid", "host_ip"), and (3) adding a tool-call
assertion that the agent invoked get_series (or list_metrics) so the judge only
passes when the tool was actually used. Ensure references to the metric name
kube_pod_info and the tool name get_series are present in the updated verify
checks.
| - llmJudge: | ||
| contains: "pod" | ||
| reason: "Verify the response identifies specific pods" |
There was a problem hiding this comment.
False-positive risk: contains: "pod" is trivially satisfied.
This is exactly the concern raised by reviewers on the PR: the prompt is "Which pods are using the most CPU?", so any response that merely restates or references the prompt (including a refusal or a wrong answer) will contain the substring pod and pass this assertion without the agent actually identifying real pods.
Consider replacing this with a stricter check — e.g., require a concrete pod-identifying field/label such as pod= or pod_name, or rely solely on the container_cpu_usage_seconds_total assertion and remove this one.
Suggested diff
- llmJudge:
- contains: "pod"
- reason: "Verify the response identifies specific pods"
+ contains: "pod="
+ reason: "Verify the response identifies specific pods by label (e.g., pod=\"<name>\")"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - llmJudge: | |
| contains: "pod" | |
| reason: "Verify the response identifies specific pods" | |
| - llmJudge: | |
| contains: "pod=" | |
| reason: "Verify the response identifies specific pods by label (e.g., pod=\"<name>\")" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/queries/cpu-usage.yaml` around lines 20 - 22, The
llmJudge assertion `contains: "pod"` is too weak and causes false positives;
update the llmJudge in cpu-usage.yaml to assert a concrete pod identifier (for
example require `pod_name`, `pod=`, or a Kubernetes UID pattern) or remove the
`contains: "pod"` check and rely on the existing
`container_cpu_usage_seconds_total` assertion instead so the judge only passes
when an actual pod identifier is present. Target the `llmJudge` entry and
replace the `contains: "pod"` condition with a stricter token such as `contains:
"pod_name"` or `contains: "pod="` (or delete that assertion) to ensure responses
list real pod identifiers.
| - llmJudge: | ||
| contains: "kube_" | ||
| reason: "Verify the agent queried Kubernetes metrics (e.g. kube_node_status_condition) for health assessment" |
There was a problem hiding this comment.
contains: "kube_" is a very loose marker.
kube_ will match any casual mention of a kube_* metric name in the agent's narrative (including hypothetical references like "I would check kube_node_status_condition") without proof that the agent actually invoked a metrics tool. Consider tightening to a concrete metric the agent is expected to return results from (e.g., kube_node_status_condition or kube_pod_status_phase), or complement with a tool-usage assertion against the metrics query tool so the judge doesn't pass on prose alone.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/queries/diagnose-cluster-health.yaml` around lines
22 - 24, The llmJudge rule using contains: "kube_" is too loose—update the
llmJudge block to assert a concrete metric (e.g., contains:
"kube_node_status_condition" or "kube_pod_status_phase") and/or add a
complementary tool-usage assertion that checks the metrics query tool output
(for example assert that the metrics tool returned results or that a field like
metrics_query_response contains those metric names); target the llmJudge entry
and replace the generic contains: "kube_" with the chosen concrete metric string
and/or add an additional assertion verifying the tool invocation/response so the
judge cannot pass on narrative mentions alone.
| verify: | ||
| - llmJudge: | ||
| contains: "no data" | ||
| reason: "Verify the agent reports no data, no results, or no pods found in the nonexistent namespace" | ||
| prompt: | ||
| inline: | | ||
| Show me the memory usage for all pods in the namespace called totally-fake-namespace-12345. |
There was a problem hiding this comment.
contains: "no data" is narrow and may miss valid correct responses.
The stated reason accepts "no data, no results, or no pods found", but a strict substring match on "no data" will fail any of those equivalent phrasings (and is also case-sensitive on literal n/N). Either pick the most universal phrasing used by the tool's actual empty response, or use a broader/semantic judge criterion. If the PromQL/Prometheus tool returns a well-known empty-result marker (e.g. "result": []), asserting on that verbatim marker would be more reliable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/queries/nonexistent-namespace.yaml` around lines 17
- 23, The llmJudge check uses a narrow case-sensitive contains: "no data" which
will miss equivalent valid responses; update the check for llmJudge (the
contains criterion) to either match the canonical empty-result marker returned
by the tool (e.g. the JSON token `"result":[]`), or broaden it to a
case-insensitive/semantic matcher that accepts any of the expected phrases from
the reason ("no data", "no results", "no pods found")—for example replace the
single contains with a regex/array of acceptable strings or a case-insensitive
match so the prompt inline "Show me the memory usage..." will be accepted when
the namespace is empty.
| - llmJudge: | ||
| contains: "pod" | ||
| reason: "Verify the response includes specific pod names from the query results" | ||
| prompt: | ||
| inline: | | ||
| Show me the CPU usage trend for pods in the default namespace over the last 30 minutes. |
There was a problem hiding this comment.
contains: "pod" is trivially satisfied by the prompt wording.
The prompt itself mentions "pods in the default namespace", so any reasonable agent response will repeat the word pod regardless of whether it actually executed a range query or returned real results. This does not verify "specific pod names from the query results" as the reason claims. Consider a stricter assertion such as requiring a concrete pod-name pattern, a namespace label value returned by the query, or a series-formatted result string; alternatively, assert the use of execute_range_query with start/end/step via a tool-call assertion rather than relying on output substrings.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/queries/time-range-query.yaml` around lines 21 -
26, The llmJudge assertion "contains: 'pod'" is too weak because the prompt
already mentions "pods"; update the check to verify actual query execution or
real results by replacing or augmenting llmJudge to (a) assert a concrete
pod-name pattern or list (e.g., require regex like "pod-[a-z0-9]+"), or (b)
assert a tool-call to execute_range_query with start/end/step parameters (verify
presence of execute_range_query and its args), or (c) require a structured
series-formatted result string; locate and change the llmJudge block (symbol:
llmJudge and its contains/reason fields) and/or add a tool-call assertion
referencing execute_range_query to ensure the response reflects real query
output rather than echoing the prompt.
| description: | | ||
| Tests if the agent uses the show_timeseries tool to visualize CPU usage | ||
| as a chart. The agent should discover the metric, then use show_timeseries | ||
| to render a time-series visualization. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP -C2 '"(show_timeseries|list_metrics|get_alerts|list_node_metrics|get_series|get_silences|label_names|label_values)"' pkg/
ast-grep --pattern 'func init$_() []api.ServerTool { $$$ }'Repository: openshift/openshift-mcp-server
Length of output: 56
🏁 Script executed:
#!/bin/bash
# Find the toolset.go file
echo "=== Finding toolset.go ==="
find . -name "toolset.go" -path "*/observability/*" 2>/dev/null
# Find the toolset_test.go file
echo "=== Finding toolset_test.go ==="
find . -name "toolset_test.go" -path "*/observability/*" 2>/dev/null
# Find the YAML file
echo "=== Finding visualize-cpu-usage.yaml ==="
find . -name "visualize-cpu-usage.yaml" 2>/dev/nullRepository: openshift/openshift-mcp-server
Length of output: 319
🏁 Script executed:
#!/bin/bash
# Read the YAML file to see the exact description
echo "=== Content of visualize-cpu-usage.yaml ==="
cat evals/tasks/observability/queries/visualize-cpu-usage.yaml
echo -e "\n=== Content of toolset.go ==="
cat pkg/toolsets/observability/toolset.go
echo -e "\n=== Content of toolset_test.go ==="
cat pkg/toolsets/observability/toolset_test.goRepository: openshift/openshift-mcp-server
Length of output: 6418
🏁 Script executed:
#!/bin/bash
# Search for show_timeseries anywhere in the repo
echo "=== Searching for show_timeseries ==="
rg -r "show_timeseries" . 2>/dev/null | head -50
# Search for what tools are actually defined/registered
echo -e "\n=== Searching for tool registration patterns ==="
rg -n "prometheus_query|alertmanager_alerts|list_metrics" pkg/toolsets/observability/ 2>/dev/null | head -30Repository: openshift/openshift-mcp-server
Length of output: 27625
Description references a tool that does not exist in the observability toolset.
The description references show_timeseries, but the observability toolset only registers three tools: prometheus_query, prometheus_query_range, and alertmanager_alerts (confirmed in pkg/toolsets/observability/toolset_test.go line 51). Either add the show_timeseries tool to the toolset, or update the description to reference actual tools like prometheus_query_range for retrieving time-series data. As-is, the task can only pass by the judge matching the hardcoded metric name, not by actual visualization capability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml` around lines 12 -
15, The task description incorrectly references a non-existent tool
show_timeseries; update the description in visualize-cpu-usage.yaml to reference
the actual observability tools (e.g., prometheus_query_range for time-series
retrieval) or alternatively register a new show_timeseries tool in the
observability toolset; specifically, change the text mentioning show_timeseries
to mention prometheus_query_range (or list available tools: prometheus_query,
prometheus_query_range, alertmanager_alerts) so the task matches the real
toolset and can retrieve time-series data programmatically.
|
@nader-ziada can you review again? It's synced now |
|
@tremes do we need to install anything into the openshift cluster for these evals to work? Or is everything included by default? If we need to install anything, can you add a make target? |
|
we don't need installation extra for these evals |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Cali0707, tremes The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request" |
|
@Cali0707: Overrode contexts on behalf of Cali0707: Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@tremes can you set a JIRA reference for this PR? |
|
/cherry-pick release-0.3 |
|
@slashpai: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retitle GIE-505: add observability toolset evals |
|
@tremes: This pull request references GIE-505 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tremes: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
9077d51
into
openshift:main
|
@slashpai: new pull request created: #260 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Requires #226
I have been trying the following so far
make run-servermake run-evals EVAL_LABEL_SELECTOR="suite=observability" EVAL_TASK_FILTER="get-alerts"Summary by CodeRabbit
Release Notes