evals: replace generic llmJudge contains values with tool-derived checks by slashpai · Pull Request #71 · rhobs/obs-mcp

slashpai · 2026-04-22T05:43:28Z

address openshift/openshift-mcp-server#241 (comment)

openshift-ci · 2026-04-22T05:43:31Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tremes · 2026-04-22T08:00:18Z

-        contains: "alert"
-        reason: "Verify the agent retrieved alerts from Alertmanager"
+        contains: "alertname"
+        reason: "Verify the agent retrieved alerts and reported specific alert names from Alertmanager"


What about something like? 🤔

Suggested change

reason: "Verify the agent retrieved alerts and reported specific alert names from Alertmanager"

contains: "alertname"

reason: "Verify the agent called get_alerts and reported the results: either listing specific alert names from Alertmanager, or explicitly confirming that no alerts are currently firing"

Generic contains values like "alert", "cpu", "memory", "namespace", and "node" already appear in the prompt, so an agent could pass by echoing the prompt without calling tools. Replace with specific metric names, label values, and data patterns that require actual tool interaction. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Use OpenShift-specific alert names (Watchdog, AlertmanagerReceiversNotConfigured) in contains checks since evals run against OpenShift clusters where these alerts reliably fire. Remove the no-alerts branch from alert-investigation to avoid the agent asking for permission instead of acting. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

- series-by-namespace: use openshift-monitoring namespace instead of non-existent monitoring namespace - get-series: check for kube_pod_info metric name instead of node label which agents omit when reporting cardinality counts - diagnose-cluster-health: check for Watchdog alert name instead of alertname label key which agents don't echo literally Signed-off-by: Jayapriya Pai <janantha@redhat.com>

- Accept either container_memory_working_set_bytes or container_memory_usage_bytes in memory-usage and namespace-resource-usage tasks - Update get-series prompt to ask for label names so the namespace contains check has something to match Signed-off-by: Jayapriya Pai <janantha@redhat.com>

slashpai · 2026-04-23T07:22:46Z

On ROSA OpenShift 4.21 with no Thanos TSDB status endpoint so high-cardinality-rejection failure is expected

make run-no-guardrails

─░▒▓ ~/github.com/slashpai/obs-mcp  mcp-evals *2 !3 ···································································· ✔  3.13  Py  1.25.5 Go  12:31:22 ▓▒░─╮
╰─ make run-mcpchecker-eval                                                                                                                                   ─╯
cd evals/mcpchecker && /Users/jayapriyapai/github.com/slashpai/obs-mcp/tmp/bin/mcpchecker check eval.yaml --runs 1 --parallel 4

=== Starting Evaluation ===

[crashlooping-pods] Starting (parallel, medium)

[high-cardinality-rejection] Starting (parallel, medium)

[namespace-pod-count] Starting (parallel, medium)

[label-names] Starting (parallel, easy)
[crashlooping-pods] → Running agent...
[label-names] → Running agent...
[high-cardinality-rejection] → Running agent...
[namespace-pod-count] → Running agent...
2026/04/23 12:31:51 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:31:51 INFO connection closed cause="io: read/write on closed pipe"
[label-names] → Verifying results...
2026/04/23 12:31:58 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:31:58 INFO connection closed cause="io: read/write on closed pipe"
[namespace-pod-count] → Verifying results...
2026/04/23 12:32:14 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:32:14 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:32:31 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:32:31 INFO connection closed cause="io: read/write on closed pipe"
[crashlooping-pods] → Verifying results...
2026/04/23 12:32:47 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:32:47 INFO connection closed cause="io: read/write on closed pipe"
[crashlooping-pods] ✓ Task passed

[time-range-query] Starting (parallel, medium)
[time-range-query] → Running agent...
2026/04/23 12:33:18 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:18 INFO connection closed cause="io: read/write on closed pipe"
[high-cardinality-rejection] → Verifying results...
2026/04/23 12:33:21 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:21 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:35 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:35 INFO connection closed cause="io: read/write on closed pipe"
[high-cardinality-rejection] ✗ Task failed
[high-cardinality-rejection]   Error: one or more verification steps failed

[diagnose-cluster-health] Starting (parallel, hard)
[diagnose-cluster-health] → Running agent...
2026/04/23 12:33:35 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:35 INFO connection closed cause="io: read/write on closed pipe"
[label-names] ✓ Task passed

[visualize-cpu-usage] Starting (parallel, medium)
[visualize-cpu-usage] → Running agent...
2026/04/23 12:34:08 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:08 INFO connection closed cause="io: read/write on closed pipe"
[namespace-pod-count] ✗ Task failed
[namespace-pod-count]   Error: one or more verification steps failed

[filtered-alerts] Starting (parallel, medium)
[filtered-alerts] → Running agent...
2026/04/23 12:34:28 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:28 INFO connection closed cause="io: read/write on closed pipe"
[visualize-cpu-usage] → Verifying results...
2026/04/23 12:34:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:29 INFO connection closed cause="io: read/write on closed pipe"
[time-range-query] → Verifying results...
2026/04/23 12:34:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:29 INFO connection closed cause="io: read/write on closed pipe"
[filtered-alerts] → Verifying results...
2026/04/23 12:34:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:43 INFO connection closed cause="io: read/write on closed pipe"
[filtered-alerts] ✓ Task passed

[get-alerts] Starting (parallel, easy)
[get-alerts] → Running agent...
2026/04/23 12:34:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:43 INFO connection closed cause="io: read/write on closed pipe"
[diagnose-cluster-health] → Verifying results...
2026/04/23 12:34:53 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:53 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:57 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:57 INFO connection closed cause="io: read/write on closed pipe"
[visualize-cpu-usage] ✓ Task passed

[alert-investigation] Starting (parallel, medium)
[alert-investigation] → Running agent...
2026/04/23 12:35:05 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:05 INFO connection closed cause="io: read/write on closed pipe"
[get-alerts] → Verifying results...
2026/04/23 12:35:14 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:14 INFO connection closed cause="io: read/write on closed pipe"
[time-range-query] ✓ Task passed

[nonexistent-metric] Starting (parallel, easy)
[nonexistent-metric] → Running agent...
2026/04/23 12:35:16 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:16 INFO connection closed cause="io: read/write on closed pipe"
[get-alerts] ✓ Task passed

[memory-usage] Starting (parallel, medium)
[memory-usage] → Running agent...
2026/04/23 12:35:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:29 INFO connection closed cause="io: read/write on closed pipe"
[nonexistent-metric] → Verifying results...
2026/04/23 12:35:48 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:48 INFO connection closed cause="io: read/write on closed pipe"
[nonexistent-metric] ✓ Task passed

[label-values] Starting (parallel, medium)
[label-values] → Running agent...
2026/04/23 12:36:06 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:06 INFO connection closed cause="io: read/write on closed pipe"
[label-values] → Verifying results...
2026/04/23 12:36:13 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:13 INFO connection closed cause="io: read/write on closed pipe"
[alert-investigation] → Verifying results...
2026/04/23 12:36:17 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:17 INFO connection closed cause="io: read/write on closed pipe"
[memory-usage] → Verifying results...
2026/04/23 12:36:22 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:22 INFO connection closed cause="io: read/write on closed pipe"
[label-values] ✓ Task passed

[get-series-cardinality] Starting (parallel, medium)
[get-series-cardinality] → Running agent...
2026/04/23 12:36:26 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:26 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:37 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:37 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:44 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:44 INFO connection closed cause="io: read/write on closed pipe"
[get-series-cardinality] → Verifying results...
2026/04/23 12:36:58 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:58 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:01 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:01 INFO connection closed cause="io: read/write on closed pipe"
[memory-usage] ✓ Task passed

[series-by-namespace] Starting (parallel, medium)
[series-by-namespace] → Running agent...
2026/04/23 12:37:21 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:21 INFO connection closed cause="io: read/write on closed pipe"
[series-by-namespace] → Verifying results...
2026/04/23 12:37:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:29 INFO connection closed cause="io: read/write on closed pipe"
[alert-investigation] ✓ Task passed

[pending-pods] Starting (parallel, medium)
[pending-pods] → Running agent...
2026/04/23 12:37:39 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:39 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:38:27 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:38:27 INFO connection closed cause="io: read/write on closed pipe"
[pending-pods] → Verifying results...
2026/04/23 12:38:44 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:38:44 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:39:07 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:39:07 INFO connection closed cause="io: read/write on closed pipe"
[get-series-cardinality] ✓ Task passed

[cpu-usage] Starting (parallel, medium)
[cpu-usage] → Running agent...
2026/04/23 12:39:10 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:39:10 INFO connection closed cause="io: read/write on closed pipe"
[pending-pods] ✓ Task passed

[nonexistent-namespace] Starting (parallel, easy)
[nonexistent-namespace] → Running agent...
2026/04/23 12:39:44 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:39:44 INFO connection closed cause="io: read/write on closed pipe"
[cpu-usage] → Verifying results...
2026/04/23 12:40:09 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:40:09 INFO connection closed cause="io: read/write on closed pipe"
[nonexistent-namespace] → Verifying results...
2026/04/23 12:40:28 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:40:28 INFO connection closed cause="io: read/write on closed pipe"
[nonexistent-namespace] ✓ Task passed

[get-silences] Starting (parallel, easy)
[get-silences] → Running agent...
2026/04/23 12:40:32 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:40:32 INFO connection closed cause="io: read/write on closed pipe"
[diagnose-cluster-health] ✗ Task failed
[diagnose-cluster-health]   Error: one or more verification steps failed

[namespace-resource-usage] Starting (parallel, hard)
[namespace-resource-usage] → Running agent...
2026/04/23 12:40:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:40:43 INFO connection closed cause="io: read/write on closed pipe"
[get-silences] → Verifying results...
2026/04/23 12:41:05 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:05 INFO connection closed cause="io: read/write on closed pipe"
[get-silences] ✓ Task passed

[prometheus-requests] Starting (parallel, medium)
[prometheus-requests] → Running agent...
2026/04/23 12:41:10 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:10 INFO connection closed cause="io: read/write on closed pipe"
[series-by-namespace] ✗ Task failed
[series-by-namespace]   Error: one or more verification steps failed

[network-traffic] Starting (parallel, medium)
[network-traffic] → Running agent...
2026/04/23 12:41:49 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:49 INFO connection closed cause="io: read/write on closed pipe"
[network-traffic] → Verifying results...
2026/04/23 12:41:49 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:49 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-requests] → Verifying results...
2026/04/23 12:41:54 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:54 INFO connection closed cause="io: read/write on closed pipe"
[namespace-resource-usage] → Verifying results...
2026/04/23 12:42:17 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:17 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:18 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:18 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-requests] ✓ Task passed

[prometheus-head-series] Starting (parallel, easy)
[prometheus-head-series] → Running agent...
2026/04/23 12:42:40 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:40 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-head-series] → Verifying results...
2026/04/23 12:42:41 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:41 INFO connection closed cause="io: read/write on closed pipe"
[cpu-usage] ✓ Task passed

[backend-reachability] Starting (parallel, easy)
[backend-reachability] → Running agent...
2026/04/23 12:42:42 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:42 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:58 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:58 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-head-series] ✗ Task failed
[prometheus-head-series]   Error: one or more verification steps failed

[prometheus-wal-size] Starting (parallel, easy)
[prometheus-wal-size] → Running agent...
2026/04/23 12:42:59 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:59 INFO connection closed cause="io: read/write on closed pipe"
[backend-reachability] → Verifying results...
2026/04/23 12:43:01 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:01 INFO connection closed cause="io: read/write on closed pipe"
[network-traffic] ✓ Task passed

[list-node-metrics] Starting (parallel, easy)
[list-node-metrics] → Running agent...
2026/04/23 12:43:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:29 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-wal-size] → Verifying results...
2026/04/23 12:43:31 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:31 INFO connection closed cause="io: read/write on closed pipe"
[list-node-metrics] → Verifying results...
2026/04/23 12:43:33 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:33 INFO connection closed cause="io: read/write on closed pipe"
[backend-reachability] ✓ Task passed

[pods-created] Starting (parallel, medium)
[pods-created] → Running agent...
2026/04/23 12:43:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:43 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-wal-size] ✓ Task passed

[list-kube-metrics] Starting (parallel, easy)
[list-kube-metrics] → Running agent...
2026/04/23 12:43:49 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:49 INFO connection closed cause="io: read/write on closed pipe"
[list-node-metrics] ✓ Task passed
2026/04/23 12:44:27 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:44:27 INFO connection closed cause="io: read/write on closed pipe"
[pods-created] → Verifying results...
2026/04/23 12:44:50 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:44:50 INFO connection closed cause="io: read/write on closed pipe"
[list-kube-metrics] → Verifying results...
2026/04/23 12:45:21 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:45:21 INFO connection closed cause="io: read/write on closed pipe"
[list-kube-metrics] ✓ Task passed
2026/04/23 12:45:45 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:45:45 INFO connection closed cause="io: read/write on closed pipe"
[namespace-resource-usage] ✗ Task failed
[namespace-resource-usage]   Error: one or more verification steps failed
2026/04/23 12:46:12 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:46:12 INFO connection closed cause="io: read/write on closed pipe"
[pods-created] ~ Task passed but assertions failed

=== Evaluation Complete ===

📄 Results saved to: mcpchecker-obs-mcp-tools-out.json

=== Results Summary ===

Task: crashlooping-pods
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/crashlooping-pods.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: high-cardinality-rejection
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/high-cardinality-rejection.yaml
  Difficulty: medium
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (3/3)

Task: label-names
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-names.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: namespace-pod-count
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-pod-count.yaml
  Difficulty: medium
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: filtered-alerts
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/filtered-alerts.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: visualize-cpu-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/visualize-cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: time-range-query
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/time-range-query.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: get-alerts
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-alerts.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: nonexistent-metric
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-metric.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: label-values
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-values.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: memory-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/memory-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: alert-investigation
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/alert-investigation.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: get-series-cardinality
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/get-series.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: pending-pods
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pending-pods.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: nonexistent-namespace
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-namespace.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: diagnose-cluster-health
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/diagnose-cluster-health.yaml
  Difficulty: hard
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (3/3)

Task: get-silences
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-silences.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: series-by-namespace
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/series-by-namespace.yaml
  Difficulty: medium
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: prometheus-requests
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-requests.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: cpu-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: prometheus-head-series
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-head-series.yaml
  Difficulty: easy
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: network-traffic
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/network-traffic.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: backend-reachability
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/backend-reachability.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: prometheus-wal-size
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-wal-size.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: list-node-metrics
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-node-metrics.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: list-kube-metrics
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-metrics.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: namespace-resource-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-resource-usage.yaml
  Difficulty: hard
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: pods-created
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pods-created.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: FAILED (2/4)
    - ToolsUsed: Required tool not called: server=obs, tool=execute_range_query, pattern=
    - CallOrder: Expected call order not satisfied. Got to 1/2

=== Overall Statistics ===
Total Tasks: 28
Tasks Passed: 22/28
Assertions Passed: 100/102

Tasks where verification failed but assertions passed: 6
  Assertions in these tasks: 22/22
Tokens:     ~747266 (estimate - excludes system prompt & cache)
MCP schemas: ~83216 (included in token total)

=== Statistics by Difficulty ===

easy:
  Tasks: 9/10
  Assertions: 34/34

medium:
  Tasks: 13/16
  Assertions: 59/61

hard:
  Tasks: 0/2
  Assertions: 7/7
⏱️  Completed in 14m35s
╭─░▒▓ ~/github.com/slashpai/obs-mcp  mcp-evals *2 ···········

22 out of 28 — 78.6% pass rate.

If you exclude high-cardinality-rejection (expected fail without guardrails), it's 22 out of 27 — 81.5%.

slashpai · 2026-04-23T08:02:34Z

@nader-ziada Could you please take a look at this? This change is to address openshift/openshift-mcp-server#241 (comment)

slashpai · 2026-04-23T08:13:21Z

/hold for review from mcp team as well

tremes · 2026-04-23T08:16:15Z

  prompt:
    inline: |
-      Are there any firing alerts with severity=critical? Show only active alerts.
+      Show me only the active alerts with severity=warning.


Should we have the severity=warning here? 🤔

I was trying to be specific here as we know there are two warning alerts always fire in default cluster which is WatchDog and AleermanagerReceiverNotConfigured

openshift-ci · 2026-04-23T13:31:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iNecas, slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [iNecas,slashpai]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slashpai · 2026-04-23T13:32:14Z

/hold cancel

openshift-ci Bot added do-not-merge/work-in-progress approved labels Apr 22, 2026

tremes reviewed Apr 22, 2026

View reviewed changes

slashpai force-pushed the mcp-evals branch 2 times, most recently from f93c8c7 to 26de704 Compare April 22, 2026 10:37

slashpai added 4 commits April 23, 2026 12:03

slashpai force-pushed the mcp-evals branch from 4a7146c to 519583b Compare April 23, 2026 07:19

slashpai marked this pull request as ready for review April 23, 2026 08:01

slashpai requested a review from a team April 23, 2026 08:01

openshift-ci Bot removed the do-not-merge/work-in-progress label Apr 23, 2026

openshift-ci Bot requested review from iNecas and rexagod April 23, 2026 08:01

openshift-ci Bot added the do-not-merge/hold label Apr 23, 2026

tremes reviewed Apr 23, 2026

View reviewed changes

iNecas approved these changes Apr 23, 2026

View reviewed changes

openshift-ci Bot assigned iNecas Apr 23, 2026

openshift-ci Bot added the lgtm label Apr 23, 2026

openshift-ci Bot removed the do-not-merge/hold label Apr 23, 2026

slashpai merged commit 0533989 into rhobs:main Apr 23, 2026
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: replace generic llmJudge contains values with tool-derived checks#71

evals: replace generic llmJudge contains values with tool-derived checks#71
slashpai merged 4 commits intorhobs:mainfrom
slashpai:mcp-evals

slashpai commented Apr 22, 2026

Uh oh!

openshift-ci Bot commented Apr 22, 2026

Uh oh!

tremes Apr 22, 2026

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

tremes Apr 23, 2026

Uh oh!

slashpai Apr 23, 2026

Uh oh!

openshift-ci Bot commented Apr 23, 2026

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

slashpai commented Apr 22, 2026

Uh oh!

openshift-ci Bot commented Apr 22, 2026

Uh oh!

tremes Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

tremes Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

slashpai Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Apr 23, 2026

Uh oh!

slashpai commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants