Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions evals/claude-code/eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,13 @@ config:
toolPattern: ".*"
minToolCalls: 1
maxToolCalls: 20
# Observability tasks
- glob: ../tasks/observability/*/*.yaml
labelSelector:
suite: observability
assertions:
toolsUsed:
- server: kubernetes
toolPattern: ".*"
minToolCalls: 1
maxToolCalls: 20
10 changes: 10 additions & 0 deletions evals/gemini-agent/eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,13 @@ config:
toolsUsed:
- server: kubernetes
toolPattern: ".*"
# Observability tasks
- glob: ../tasks/observability/*/*.yaml
labelSelector:
suite: observability
assertions:
toolsUsed:
- server: kubernetes
toolPattern: ".*"
minToolCalls: 1
maxToolCalls: 20
10 changes: 10 additions & 0 deletions evals/openai-agent/eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,13 @@ config:
toolPattern: ".*"
minToolCalls: 1
maxToolCalls: 20
# Observability tasks
- glob: ../tasks/observability/*/*.yaml
labelSelector:
suite: observability
assertions:
toolsUsed:
- server: kubernetes
toolPattern: ".*"
minToolCalls: 1
maxToolCalls: 20
28 changes: 28 additions & 0 deletions evals/tasks/observability/alerts/alert-investigation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "alert-investigation"
difficulty: medium
parallel: true
runs: 1
labels:
category: alerts
suite: observability
toolType: multi-step
description: |
Tests if the agent can perform multi-step alert triage: first retrieving
alerts from Alertmanager, then investigating related metrics for a
firing alert using queries.
spec:
verify:
- llmJudge:
contains: "AlertmanagerReceiversNotConfigured"
reason: "Verify the agent retrieved firing alerts and identified AlertmanagerReceiversNotConfigured"
- llmJudge:
contains: "alertmanager"
reason: "Verify the agent investigated Alertmanager-related metrics for the alert"
Comment on lines +21 to +23
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Second judge contains: "alertmanager" is redundant and weak.

Substring alertmanager is already contained in AlertmanagerReceiversNotConfigured (checked above) and appears naturally in any narrative about Alertmanager, so it doesn't independently verify that the agent investigated related metrics. Consider asserting on a concrete Alertmanager metric name the agent should have queried (e.g., alertmanager_notifications_failed_total, alertmanager_config_last_reload_successful) to actually validate the "investigate related metrics" step.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/alerts/alert-investigation.yaml` around lines 21 -
23, Replace the redundant llmJudge `contains: "alertmanager"` check with a
concrete Alertmanager metric assertion so the judge verifies metric
investigation; specifically update the llmJudge entry (the second one shown) to
assert for a specific metric name such as
`alertmanager_notifications_failed_total` or
`alertmanager_config_last_reload_successful` instead of the substring
"alertmanager" so the check ensures the agent queried Alertmanager metrics.

prompt:
inline: |
Check if there are any firing alerts. If there are, investigate
the related metrics for the most critical alert and summarize
what's happening.
24 changes: 24 additions & 0 deletions evals/tasks/observability/alerts/filtered-alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "filtered-alerts"
difficulty: medium
parallel: true
runs: 1
labels:
category: alerts
suite: observability
toolType: alertmanager
description: |
Tests if the agent uses the get_alerts tool with filter parameters
to retrieve only active alerts matching a specific severity. The agent
should pass appropriate filter arguments rather than fetching all alerts
and filtering client-side.
spec:
verify:
- llmJudge:
contains: "AlertmanagerReceiversNotConfigured"
reason: "Verify the agent filtered for warning-severity alerts and found AlertmanagerReceiversNotConfigured"
prompt:
inline: |
Show me only the active alerts with severity=warning.
Comment on lines +17 to +24
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether the eval framework supports asserting tool-call arguments (not just tool patterns/counts).
rg -nP -C3 '(toolPattern|minToolCalls|maxToolCalls|toolArgs|arguments)' --type=yaml -g 'evals/**'
rg -nP -C3 '(ToolPattern|ToolArgs|Arguments|MinToolCalls)' -g '!vendor/**' -g '!**/testdata/**'

Repository: openshift/openshift-mcp-server

Length of output: 50389


🏁 Script executed:

cat -n evals/tasks/observability/alerts/filtered-alerts.yaml

Repository: openshift/openshift-mcp-server

Length of output: 1010


🏁 Script executed:

# Search for task.yaml files with tool argument assertions to understand the proper structure
find evals/tasks -name "*.yaml" -type f | head -20 | xargs -I {} sh -c 'echo "=== {} ===" && grep -A 10 "toolPattern\|args:" {} | head -15'

Repository: openshift/openshift-mcp-server

Length of output: 1473


🏁 Script executed:

# Look at the netedge example that showed args field
cat -n evals/tasks/netedge/query_prometheus/task.yaml

Repository: openshift/openshift-mcp-server

Length of output: 519


🏁 Script executed:

# Search for other observability tasks to understand the pattern
find evals/tasks/observability -name "*.yaml" -type f | xargs cat

Repository: openshift/openshift-mcp-server

Length of output: 20635


Use tool-call argument assertion to verify severity=warning filter is passed to get_alerts.

The current verify section only checks that AlertmanagerReceiversNotConfigured appears in the output; it does not verify the agent passed a severity=warning filter to the tool. An agent that fetched all alerts would still pass this check. The eval framework supports argument assertions (see evals/tasks/netedge/query_prometheus/task.yaml for the pattern), so add an args constraint under assertions.toolsUsed to verify the severity parameter was actually supplied:

assertions:
  toolsUsed:
    - server: kubernetes
      toolPattern: "get_alerts"
      args:
        severity: "warning"

Also note: relying on AlertmanagerReceiversNotConfigured to always fire may be fragile. If a cluster configures Alertmanager receivers, this alert will be absent and the eval will fail unrelated to agent behavior. Consider using a more reliable alert (like Watchdog, which is standard across OpenShift clusters) or make the alert name configurable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/alerts/filtered-alerts.yaml` around lines 17 - 24,
The verify step only checks for the presence of
AlertmanagerReceiversNotConfigured in the output and doesn’t assert that the
agent passed a severity=warning filter to the get_alerts tool; update the verify
block to add an assertions.toolsUsed entry that matches toolPattern "get_alerts"
and includes args with severity: "warning" so the eval framework verifies the
agent actually supplied the severity filter, and consider replacing or making
AlertmanagerReceiversNotConfigured configurable (or use a more reliable alert
like "Watchdog") to avoid false failures when that specific alert is absent.

22 changes: 22 additions & 0 deletions evals/tasks/observability/alerts/get-alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "get-alerts"
difficulty: easy
parallel: true
runs: 1
labels:
category: alerts
suite: observability
toolType: alertmanager
description: |
Tests if the agent can discover and use the get_alerts tool to retrieve
currently firing alerts from Alertmanager.
spec:
verify:
- llmJudge:
contains: "Watchdog"
reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert"
Comment on lines +17 to +19
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Watchdog-only assertion is brittle across environments.

Requiring a specific alert name can fail valid runs on clusters where Watchdog is not firing, and can still pass via hallucinated output. Prefer asserting on retrieved alert structure/content (for example alertname + status=firing) rather than one fixed alert name.

Suggested adjustment
   verify:
     - llmJudge:
-        contains: "Watchdog"
-        reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert"
+        contains: "alertname"
+        reason: "Verify the response includes concrete alert fields from retrieved firing alerts"
+    - llmJudge:
+        contains: "firing"
+        reason: "Verify the response reports firing-state alerts from Alertmanager output"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/alerts/get-alerts.yaml` around lines 17 - 19, The
llmJudge assertion is brittle because it checks for a literal "Watchdog"; change
the test to assert on alert structure/content instead of a single name: modify
the llmJudge block to validate that the retrieved alerts include at least one
entry with an "alertname" field and a "status" equal to "firing" (or equivalent
key/value pair), and update the "reason" to reflect this structural check;
reference the llmJudge assertion and its "contains" usage to replace the string
match with a structural/regex or JSON-path style check that looks for alertname
+ status=firing.

prompt:
inline: |
Check the cluster for any firing alerts and report what you find.
22 changes: 22 additions & 0 deletions evals/tasks/observability/alerts/get-silences.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "get-silences"
difficulty: easy
parallel: true
runs: 1
labels:
category: alerts
suite: observability
toolType: alertmanager
description: |
Tests if the agent can discover and use the get_silences tool to retrieve
active silences from Alertmanager.
spec:
verify:
- llmJudge:
contains: "silences"
reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"
Comment on lines +17 to +19
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Judge token is prompt-leaked and allows false positives.

contains: "silences" is too generic because the same term is in the prompt, so a non-instrumented answer can pass.

Suggested adjustment
   verify:
     - llmJudge:
-        contains: "silences"
-        reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"
+        contains: "matchers"
+        reason: "Verify the response includes concrete silence details when active silences exist"
+    - llmJudge:
+        contains: "no active silences"
+        reason: "Allow explicit empty-state reporting when no silences are present"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- llmJudge:
contains: "silences"
reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"
- llmJudge:
contains: "matchers"
reason: "Verify the response includes concrete silence details when active silences exist"
- llmJudge:
contains: "no active silences"
reason: "Allow explicit empty-state reporting when no silences are present"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/alerts/get-silences.yaml` around lines 17 - 19, The
llmJudge check is too generic and subject to prompt-leakage because it only
asserts contains: "silences"; update the llmJudge assertion to look for more
specific, non-leaked indicators such as a phrase or regex that proves the agent
inspected silences (e.g. require "active silences" or "No active silences exist"
OR require presence of matcher fields like "matchers:" or a matcher key
pattern), by replacing the simple contains with a stricter string or regex match
(refer to llmJudge and the contains entry in the YAML) so only genuine
instrumented responses pass.

prompt:
inline: |
Are there any active silences in Alertmanager?
26 changes: 26 additions & 0 deletions evals/tasks/observability/labels/get-series.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "get-series-cardinality"
difficulty: medium
parallel: true
runs: 1
labels:
category: labels
suite: observability
toolType: exploration
description: |
Tests if the agent can use the get_series tool to check cardinality for a metric.
The agent should first verify the metric exists via list_metrics, then use
get_series to retrieve matching time series and report the count.
spec:
verify:
- llmJudge:
contains: "namespace"
reason: "Verify the agent retrieved actual series data containing label dimensions like namespace"
- llmJudge:
contains: "kube_pod_info"
reason: "Verify the agent queried the kube_pod_info metric and reported its cardinality"
prompt:
inline: |
How many time series exist for the kube_pod_info metric? Show the count and list the label names present.
Comment on lines +17 to +26
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Both judge tokens are weak: one is in the prompt, the other is generic.

kube_pod_info is mentioned verbatim in the prompt, so the agent can pass that judge by merely restating the metric name without ever invoking list_metrics or get_series. namespace is a generic word that is almost certainly in any reasonable answer (and is implied by the prompt's "label names"), so it also doesn't prove real tool usage.

Suggest requiring evidence only a real get_series result can produce — e.g., a numeric cardinality count in the response, concrete label values observed in this cluster (kube-system, openshift-monitoring), or specific label keys returned by the API (pod, uid, host_ip). Complementing with a tool-call assertion on get_series would strengthen this further.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/labels/get-series.yaml` around lines 17 - 26, The
current llmJudge checks are too weak (they allow restating "kube_pod_info" or
generic "namespace"); update the verify section to require evidence only a real
get_series call can produce by (1) adding an llmJudge that asserts a numeric
cardinality (e.g., a regex or type check for an integer count) for the
kube_pod_info series, (2) requiring at least one concrete label value observed
(e.g., "kube-system" or "openshift-monitoring") and at least one specific label
key from the API (e.g., "pod", "uid", "host_ip"), and (3) adding a tool-call
assertion that the agent invoked get_series (or list_metrics) so the judge only
passes when the tool was actually used. Ensure references to the metric name
kube_pod_info and the tool name get_series are present in the updated verify
checks.

26 changes: 26 additions & 0 deletions evals/tasks/observability/labels/label-names.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "label-names"
difficulty: easy
parallel: true
runs: 1
labels:
category: labels
suite: observability
toolType: exploration
description: |
Tests if the agent follows the correct workflow: first calling list_metrics to
verify kube_pod_info exists, then calling get_label_names to discover available
labels for that metric.
spec:
verify:
- llmJudge:
contains: "namespace"
reason: "Verify the output includes the namespace label which is a standard Kubernetes label"
- llmJudge:
contains: "pod"
reason: "Verify the output includes the pod label"
prompt:
inline: |
What labels are available for the kube_pod_info metric?
22 changes: 22 additions & 0 deletions evals/tasks/observability/labels/label-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "label-values"
difficulty: medium
parallel: true
runs: 1
labels:
category: labels
suite: observability
toolType: exploration
description: |
Tests the full discovery workflow: list_metrics to verify the metric, then
get_label_values to retrieve unique namespace values for kube_pod_info.
spec:
verify:
- llmJudge:
contains: "kube-system"
reason: "Verify the output lists actual namespace values from the cluster such as kube-system"
prompt:
inline: |
What are the unique namespace values for the kube_pod_info metric?
28 changes: 28 additions & 0 deletions evals/tasks/observability/labels/series-by-namespace.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "series-by-namespace"
difficulty: medium
parallel: true
runs: 1
labels:
category: labels
suite: observability
toolType: exploration
description: |
Tests if the agent can use the get_series tool with a label selector
to find time series scoped to a specific namespace. The agent should
first verify the metric exists, then use get_series with a namespace
matcher to report the cardinality within that scope.
spec:
verify:
- llmJudge:
contains: "pod"
reason: "Verify the agent retrieved actual series data containing label dimensions like pod"
- llmJudge:
contains: "container"
reason: "Verify the agent reported series with container label values from the namespace"
prompt:
inline: |
How many time series exist for container_cpu_usage_seconds_total
in the openshift-monitoring namespace?
Comment on lines +19 to +28
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

contains: "container" is trivially satisfied by the prompt.

container_cpu_usage_seconds_total is in the prompt itself, so any response that even restates the metric name passes the container judge without proving the agent ever called get_series or returned real label values. pod is also a very generic token likely to appear in any plausible narrative.

Recommend asserting on evidence only obtainable from a real get_series call against openshift-monitoring, e.g. concrete label values (namespace="openshift-monitoring", a real pod name like prometheus-k8s-0, or a container value like prometheus), or assert a numeric cardinality is reported. A paired tool-call assertion on get_series would further reduce false positives.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/labels/series-by-namespace.yaml` around lines 19 -
28, The llmJudge checks ("contains: \"container\"" and "contains: \"pod\"") are
too generic and can be satisfied by the prompt itself; update the assertions in
this YAML to require evidence only obtainable from a real get_series call (for
example assert concrete label values like namespace="openshift-monitoring", a
real pod name such as prometheus-k8s-0, or a container value like prometheus),
or assert a numeric cardinality for container_cpu_usage_seconds_total in
openshift-monitoring; additionally add a paired tool-call assertion that
verifies the get_series tool was invoked for the openshift-monitoring namespace
to prevent false positives (referencing the llmJudge entries and the prompt
inline for where to change the checks).

22 changes: 22 additions & 0 deletions evals/tasks/observability/metrics/list-metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "list-kube-metrics"
difficulty: easy
parallel: true
runs: 1
labels:
category: metrics
suite: observability
toolType: discovery
description: |
Tests if the agent can discover the list_metrics tool to find Kubernetes-related
metrics. The agent should use the name_regex parameter to filter for kube metrics.
Comment on lines +12 to +14
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -nP -C2 '"list_metrics"|name_regex' pkg/
rg -nP -C3 'initPrometheus|initAlertmanager' pkg/toolsets/observability/

Repository: openshift/openshift-mcp-server

Length of output: 3364


🏁 Script executed:

cat -n evals/tasks/observability/metrics/list-metrics.yaml | head -25

Repository: openshift/openshift-mcp-server

Length of output: 882


🏁 Script executed:

rg -nP 'label.?values|__name__|ListMetrics' pkg/toolsets/observability/ pkg/prometheus/

Repository: openshift/openshift-mcp-server

Length of output: 474


🏁 Script executed:

sed -n '23,60p' pkg/toolsets/observability/prometheus.go
sed -n '63,120p' pkg/toolsets/observability/prometheus.go

Repository: openshift/openshift-mcp-server

Length of output: 3696


Description references a list_metrics tool that doesn't exist in the observability toolset.

The registered tools are only prometheus_query, prometheus_query_range, and alertmanager_alerts. There is no list_metrics tool and no name_regex parameter. To list metrics matching a pattern (e.g., containing 'kube'), the agent would need to use a PromQL query like {__name__=~"kube.*"} with the available prometheus_query tool. Update the task description and prompt to match the actual tools exposed by the server.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/metrics/list-metrics.yaml` around lines 12 - 14,
Update the task description and agent prompt to remove references to the
non-existent list_metrics tool and the name_regex parameter and instead instruct
the agent to use the existing prometheus_query (or prometheus_query_range) tool
with a PromQL name-match filter; specifically mention using a query like
`{__name__=~"kube.*"}` to list metrics containing "kube", and reference the
observable tool names prometheus_query and prometheus_query_range so the task
aligns with the actual observability toolset.

spec:
verify:
- llmJudge:
contains: "kube_pod_info"
reason: "Verify the output lists specific kube metrics discovered from Prometheus"
prompt:
inline: |
List all available Prometheus metrics that contain 'kube' in the name.
22 changes: 22 additions & 0 deletions evals/tasks/observability/metrics/list-node-metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "list-node-metrics"
difficulty: easy
parallel: true
runs: 1
labels:
category: metrics
suite: observability
toolType: discovery
description: |
Tests if the agent can discover node-related metrics using the list_metrics tool
with a regex filter for node metrics.
spec:
verify:
- llmJudge:
contains: "node_"
reason: "Verify the output lists specific node metrics discovered from Prometheus (e.g. node_cpu_seconds_total)"
Comment on lines +17 to +19
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Weak llmJudge.contains assertion — prone to false-positives.

The substring node_ is a very common Prometheus metric prefix and can easily appear in a response even when the agent does not actually call list_metrics (e.g., the model can recall node_cpu_seconds_total from training). This echoes the reviewer feedback on the PR about overly generic contains values allowing agents to pass without real tool use. Consider pairing with a tool-usage assertion in the taskSet (already enforced via minToolCalls: 1, but you may want to further constrain the tool pattern to list_metrics for discovery tasks) or add a second llmJudge with a more discriminating substring (e.g., a metric name unlikely to be hallucinated, or a phrase only present in live list_metrics output).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/metrics/list-node-metrics.yaml` around lines 17 -
19, Replace the weak llmJudge.contains "node_" check with a stricter assertion:
either add a second llmJudge entry that checks for a specific,
less-likely-to-be-hallucinated metric name (e.g., "node_cpu_seconds_total" or
another long metric string) or tighten tool-usage validation to assert the tool
pattern equals "list_metrics" (in addition to the existing minToolCalls: 1) so
the task requires an actual call to list_metrics; update the
list-node-metrics.yaml llmJudge and taskSet entries accordingly to reference
llmJudge.contains and the tool pattern "list_metrics" or add a new llmJudge with
the discriminating substring.

prompt:
inline: |
What node-related metrics are available in Prometheus?
23 changes: 23 additions & 0 deletions evals/tasks/observability/queries/backend-reachability.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "backend-reachability"
difficulty: easy
parallel: true
runs: 1
labels:
category: queries
suite: observability
toolType: smoke-test
description: |
Smoke test to verify the agent can reach the metrics backend and get a
valid response. The agent should query the up metric and confirm the
backend is reachable before any complex eval tasks run.
spec:
verify:
- llmJudge:
contains: "prometheus-k8s"
reason: "Verify the agent queried the up metric and reported on the prometheus-k8s target"
prompt:
inline: |
Is the metrics backend reachable? Query up{job="prometheus-k8s"} and report the status.
Comment on lines +18 to +23
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

False-positive risk: judge token is present verbatim in the prompt.

The prompt contains up{job="prometheus-k8s"}, so the agent can satisfy contains: "prometheus-k8s" by merely repeating the prompt (e.g., "I will query up{job=\"prometheus-k8s\"}…") without actually reaching the metrics backend. This is the same class of false-positive pass that earlier reviewers flagged on other tasks.

Consider validating on something that can only be known from a successful tool call — for example, the numeric result (1 / up state), an instance= label value returned by Prometheus, or require the response to state that the target is up/reachable. Pairing with a tool-usage assertion on the Prometheus query tool would make this much more robust.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/backend-reachability.yaml` around lines 18
- 23, The current llmJudge `contains: "prometheus-k8s"` can be spoofed by
echoing the prompt token; update the validation to require evidence only
obtainable from an actual Prometheus query — for example change the judge rule
to assert the numeric up value or an `instance=` label (e.g., expect `"1"` or
`"instance="` in the response) or require the text "up" / "reachable" alongside
a numeric result, and pair this with a tool-usage assertion for the Prometheus
query; modify the `llmJudge` block and the `prompt.inline` (the
`up{job="prometheus-k8s"}` query) so the judge checks the query result (numeric
or instance label) rather than just the literal job name.

25 changes: 25 additions & 0 deletions evals/tasks/observability/queries/cpu-usage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "cpu-usage"
difficulty: medium
parallel: true
runs: 1
labels:
category: queries
suite: observability
toolType: query
description: |
Tests if the agent can find the appropriate CPU usage metric and query it to
determine which pods are consuming the most CPU resources.
spec:
verify:
- llmJudge:
contains: "container_cpu_usage_seconds_total"
reason: "Verify the agent queried the correct CPU metric"
- llmJudge:
contains: "pod"
reason: "Verify the response identifies specific pods"
Comment on lines +20 to +22
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

False-positive risk: contains: "pod" is trivially satisfied.

This is exactly the concern raised by reviewers on the PR: the prompt is "Which pods are using the most CPU?", so any response that merely restates or references the prompt (including a refusal or a wrong answer) will contain the substring pod and pass this assertion without the agent actually identifying real pods.

Consider replacing this with a stricter check — e.g., require a concrete pod-identifying field/label such as pod= or pod_name, or rely solely on the container_cpu_usage_seconds_total assertion and remove this one.

Suggested diff
     - llmJudge:
-        contains: "pod"
-        reason: "Verify the response identifies specific pods"
+        contains: "pod="
+        reason: "Verify the response identifies specific pods by label (e.g., pod=\"<name>\")"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- llmJudge:
contains: "pod"
reason: "Verify the response identifies specific pods"
- llmJudge:
contains: "pod="
reason: "Verify the response identifies specific pods by label (e.g., pod=\"<name>\")"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/cpu-usage.yaml` around lines 20 - 22, The
llmJudge assertion `contains: "pod"` is too weak and causes false positives;
update the llmJudge in cpu-usage.yaml to assert a concrete pod identifier (for
example require `pod_name`, `pod=`, or a Kubernetes UID pattern) or remove the
`contains: "pod"` check and rely on the existing
`container_cpu_usage_seconds_total` assertion instead so the judge only passes
when an actual pod identifier is present. Target the `llmJudge` entry and
replace the `contains: "pod"` condition with a stricter token such as `contains:
"pod_name"` or `contains: "pod="` (or delete that assertion) to ensure responses
list real pod identifiers.

prompt:
inline: |
Which pods are using the most CPU?
22 changes: 22 additions & 0 deletions evals/tasks/observability/queries/crashlooping-pods.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "crashlooping-pods"
difficulty: medium
parallel: true
runs: 1
labels:
category: queries
suite: observability
toolType: range-query
description: |
Tests if the agent can identify crashlooping pods by discovering and querying
the kube_pod_container_status_restarts_total metric over a time range.
spec:
verify:
- llmJudge:
contains: "kube_pod_container_status_restarts_total"
reason: "Verify the agent used the correct metric for container restart counts"
prompt:
inline: |
Which pods were crashlooping in the last 5 minutes?
28 changes: 28 additions & 0 deletions evals/tasks/observability/queries/diagnose-cluster-health.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
kind: Task
apiVersion: mcpchecker/v1alpha2
metadata:
name: "diagnose-cluster-health"
difficulty: hard
parallel: true
runs: 1
labels:
category: queries
suite: observability
toolType: diagnostic
description: |
Tests the agent's ability to handle an ambiguous diagnostic prompt.
The agent must autonomously decide which tools and metrics to check,
ideally covering alerts, node status, and pod health without being
told exactly what to look for.
spec:
verify:
- llmJudge:
contains: "Watchdog"
reason: "Verify the agent checked alerts and reported specific alert names such as Watchdog"
- llmJudge:
contains: "kube_"
reason: "Verify the agent queried Kubernetes metrics (e.g. kube_node_status_condition) for health assessment"
Comment on lines +22 to +24
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

contains: "kube_" is a very loose marker.

kube_ will match any casual mention of a kube_* metric name in the agent's narrative (including hypothetical references like "I would check kube_node_status_condition") without proof that the agent actually invoked a metrics tool. Consider tightening to a concrete metric the agent is expected to return results from (e.g., kube_node_status_condition or kube_pod_status_phase), or complement with a tool-usage assertion against the metrics query tool so the judge doesn't pass on prose alone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/diagnose-cluster-health.yaml` around lines
22 - 24, The llmJudge rule using contains: "kube_" is too loose—update the
llmJudge block to assert a concrete metric (e.g., contains:
"kube_node_status_condition" or "kube_pod_status_phase") and/or add a
complementary tool-usage assertion that checks the metrics query tool output
(for example assert that the metrics tool returned results or that a field like
metrics_query_response contains those metric names); target the llmJudge entry
and replace the generic contains: "kube_" with the chosen concrete metric string
and/or add an additional assertion verifying the tool invocation/response so the
judge cannot pass on narrative mentions alone.

prompt:
inline: |
Is the cluster healthy? Give me an overview of any issues.
Check alerts, node status, and pod health.
Loading