Skip to content

evals: replace generic llmJudge contains values with tool-derived checks#71

Merged
slashpai merged 4 commits intorhobs:mainfrom
slashpai:mcp-evals
Apr 23, 2026
Merged

evals: replace generic llmJudge contains values with tool-derived checks#71
slashpai merged 4 commits intorhobs:mainfrom
slashpai:mcp-evals

Conversation

@slashpai
Copy link
Copy Markdown
Member

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

contains: "alert"
reason: "Verify the agent retrieved alerts from Alertmanager"
contains: "alertname"
reason: "Verify the agent retrieved alerts and reported specific alert names from Alertmanager"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like? 🤔

Suggested change
reason: "Verify the agent retrieved alerts and reported specific alert names from Alertmanager"
contains: "alertname"
reason: "Verify the agent called get_alerts and reported the results: either listing specific alert names from Alertmanager, or explicitly confirming that no alerts are currently firing"

@slashpai slashpai force-pushed the mcp-evals branch 2 times, most recently from f93c8c7 to 26de704 Compare April 22, 2026 10:37
Generic contains values like "alert", "cpu", "memory", "namespace", and
"node" already appear in the prompt, so an agent could pass by echoing
the prompt without calling tools. Replace with specific metric names,
label values, and data patterns that require actual tool interaction.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Use OpenShift-specific alert names (Watchdog,
AlertmanagerReceiversNotConfigured) in contains checks since evals
run against OpenShift clusters where these alerts reliably fire.
Remove the no-alerts branch from alert-investigation to avoid the
agent asking for permission instead of acting.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- series-by-namespace: use openshift-monitoring namespace instead of
  non-existent monitoring namespace
- get-series: check for kube_pod_info metric name instead of node label
  which agents omit when reporting cardinality counts
- diagnose-cluster-health: check for Watchdog alert name instead of
  alertname label key which agents don't echo literally

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- Accept either container_memory_working_set_bytes or
  container_memory_usage_bytes in memory-usage and
  namespace-resource-usage tasks
- Update get-series prompt to ask for label names so the
  namespace contains check has something to match

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
@slashpai
Copy link
Copy Markdown
Member Author

On ROSA OpenShift 4.21 with no Thanos TSDB status endpoint so high-cardinality-rejection failure is expected

make run-no-guardrails
─░▒▓ ~/github.com/slashpai/obs-mcp  mcp-evals *2 !3 ···································································· ✔  3.13  Py  1.25.5 Go  12:31:22 ▓▒░─╮
╰─ make run-mcpchecker-eval                                                                                                                                   ─╯
cd evals/mcpchecker && /Users/jayapriyapai/github.com/slashpai/obs-mcp/tmp/bin/mcpchecker check eval.yaml --runs 1 --parallel 4

=== Starting Evaluation ===

[crashlooping-pods] Starting (parallel, medium)

[high-cardinality-rejection] Starting (parallel, medium)

[namespace-pod-count] Starting (parallel, medium)

[label-names] Starting (parallel, easy)
[crashlooping-pods] → Running agent...
[label-names] → Running agent...
[high-cardinality-rejection] → Running agent...
[namespace-pod-count] → Running agent...
2026/04/23 12:31:51 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:31:51 INFO connection closed cause="io: read/write on closed pipe"
[label-names] → Verifying results...
2026/04/23 12:31:58 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:31:58 INFO connection closed cause="io: read/write on closed pipe"
[namespace-pod-count] → Verifying results...
2026/04/23 12:32:14 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:32:14 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:32:31 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:32:31 INFO connection closed cause="io: read/write on closed pipe"
[crashlooping-pods] → Verifying results...
2026/04/23 12:32:47 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:32:47 INFO connection closed cause="io: read/write on closed pipe"
[crashlooping-pods] ✓ Task passed

[time-range-query] Starting (parallel, medium)
[time-range-query] → Running agent...
2026/04/23 12:33:18 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:18 INFO connection closed cause="io: read/write on closed pipe"
[high-cardinality-rejection] → Verifying results...
2026/04/23 12:33:21 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:21 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:35 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:35 INFO connection closed cause="io: read/write on closed pipe"
[high-cardinality-rejection] ✗ Task failed
[high-cardinality-rejection]   Error: one or more verification steps failed

[diagnose-cluster-health] Starting (parallel, hard)
[diagnose-cluster-health] → Running agent...
2026/04/23 12:33:35 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:33:35 INFO connection closed cause="io: read/write on closed pipe"
[label-names] ✓ Task passed

[visualize-cpu-usage] Starting (parallel, medium)
[visualize-cpu-usage] → Running agent...
2026/04/23 12:34:08 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:08 INFO connection closed cause="io: read/write on closed pipe"
[namespace-pod-count] ✗ Task failed
[namespace-pod-count]   Error: one or more verification steps failed

[filtered-alerts] Starting (parallel, medium)
[filtered-alerts] → Running agent...
2026/04/23 12:34:28 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:28 INFO connection closed cause="io: read/write on closed pipe"
[visualize-cpu-usage] → Verifying results...
2026/04/23 12:34:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:29 INFO connection closed cause="io: read/write on closed pipe"
[time-range-query] → Verifying results...
2026/04/23 12:34:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:29 INFO connection closed cause="io: read/write on closed pipe"
[filtered-alerts] → Verifying results...
2026/04/23 12:34:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:43 INFO connection closed cause="io: read/write on closed pipe"
[filtered-alerts] ✓ Task passed

[get-alerts] Starting (parallel, easy)
[get-alerts] → Running agent...
2026/04/23 12:34:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:43 INFO connection closed cause="io: read/write on closed pipe"
[diagnose-cluster-health] → Verifying results...
2026/04/23 12:34:53 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:53 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:57 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:34:57 INFO connection closed cause="io: read/write on closed pipe"
[visualize-cpu-usage] ✓ Task passed

[alert-investigation] Starting (parallel, medium)
[alert-investigation] → Running agent...
2026/04/23 12:35:05 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:05 INFO connection closed cause="io: read/write on closed pipe"
[get-alerts] → Verifying results...
2026/04/23 12:35:14 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:14 INFO connection closed cause="io: read/write on closed pipe"
[time-range-query] ✓ Task passed

[nonexistent-metric] Starting (parallel, easy)
[nonexistent-metric] → Running agent...
2026/04/23 12:35:16 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:16 INFO connection closed cause="io: read/write on closed pipe"
[get-alerts] ✓ Task passed

[memory-usage] Starting (parallel, medium)
[memory-usage] → Running agent...
2026/04/23 12:35:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:29 INFO connection closed cause="io: read/write on closed pipe"
[nonexistent-metric] → Verifying results...
2026/04/23 12:35:48 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:35:48 INFO connection closed cause="io: read/write on closed pipe"
[nonexistent-metric] ✓ Task passed

[label-values] Starting (parallel, medium)
[label-values] → Running agent...
2026/04/23 12:36:06 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:06 INFO connection closed cause="io: read/write on closed pipe"
[label-values] → Verifying results...
2026/04/23 12:36:13 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:13 INFO connection closed cause="io: read/write on closed pipe"
[alert-investigation] → Verifying results...
2026/04/23 12:36:17 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:17 INFO connection closed cause="io: read/write on closed pipe"
[memory-usage] → Verifying results...
2026/04/23 12:36:22 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:22 INFO connection closed cause="io: read/write on closed pipe"
[label-values] ✓ Task passed

[get-series-cardinality] Starting (parallel, medium)
[get-series-cardinality] → Running agent...
2026/04/23 12:36:26 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:26 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:37 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:37 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:44 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:44 INFO connection closed cause="io: read/write on closed pipe"
[get-series-cardinality] → Verifying results...
2026/04/23 12:36:58 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:36:58 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:01 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:01 INFO connection closed cause="io: read/write on closed pipe"
[memory-usage] ✓ Task passed

[series-by-namespace] Starting (parallel, medium)
[series-by-namespace] → Running agent...
2026/04/23 12:37:21 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:21 INFO connection closed cause="io: read/write on closed pipe"
[series-by-namespace] → Verifying results...
2026/04/23 12:37:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:29 INFO connection closed cause="io: read/write on closed pipe"
[alert-investigation] ✓ Task passed

[pending-pods] Starting (parallel, medium)
[pending-pods] → Running agent...
2026/04/23 12:37:39 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:37:39 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:38:27 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:38:27 INFO connection closed cause="io: read/write on closed pipe"
[pending-pods] → Verifying results...
2026/04/23 12:38:44 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:38:44 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:39:07 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:39:07 INFO connection closed cause="io: read/write on closed pipe"
[get-series-cardinality] ✓ Task passed

[cpu-usage] Starting (parallel, medium)
[cpu-usage] → Running agent...
2026/04/23 12:39:10 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:39:10 INFO connection closed cause="io: read/write on closed pipe"
[pending-pods] ✓ Task passed

[nonexistent-namespace] Starting (parallel, easy)
[nonexistent-namespace] → Running agent...
2026/04/23 12:39:44 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:39:44 INFO connection closed cause="io: read/write on closed pipe"
[cpu-usage] → Verifying results...
2026/04/23 12:40:09 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:40:09 INFO connection closed cause="io: read/write on closed pipe"
[nonexistent-namespace] → Verifying results...
2026/04/23 12:40:28 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:40:28 INFO connection closed cause="io: read/write on closed pipe"
[nonexistent-namespace] ✓ Task passed

[get-silences] Starting (parallel, easy)
[get-silences] → Running agent...
2026/04/23 12:40:32 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:40:32 INFO connection closed cause="io: read/write on closed pipe"
[diagnose-cluster-health] ✗ Task failed
[diagnose-cluster-health]   Error: one or more verification steps failed

[namespace-resource-usage] Starting (parallel, hard)
[namespace-resource-usage] → Running agent...
2026/04/23 12:40:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:40:43 INFO connection closed cause="io: read/write on closed pipe"
[get-silences] → Verifying results...
2026/04/23 12:41:05 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:05 INFO connection closed cause="io: read/write on closed pipe"
[get-silences] ✓ Task passed

[prometheus-requests] Starting (parallel, medium)
[prometheus-requests] → Running agent...
2026/04/23 12:41:10 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:10 INFO connection closed cause="io: read/write on closed pipe"
[series-by-namespace] ✗ Task failed
[series-by-namespace]   Error: one or more verification steps failed

[network-traffic] Starting (parallel, medium)
[network-traffic] → Running agent...
2026/04/23 12:41:49 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:49 INFO connection closed cause="io: read/write on closed pipe"
[network-traffic] → Verifying results...
2026/04/23 12:41:49 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:49 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-requests] → Verifying results...
2026/04/23 12:41:54 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:41:54 INFO connection closed cause="io: read/write on closed pipe"
[namespace-resource-usage] → Verifying results...
2026/04/23 12:42:17 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:17 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:18 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:18 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-requests] ✓ Task passed

[prometheus-head-series] Starting (parallel, easy)
[prometheus-head-series] → Running agent...
2026/04/23 12:42:40 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:40 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-head-series] → Verifying results...
2026/04/23 12:42:41 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:41 INFO connection closed cause="io: read/write on closed pipe"
[cpu-usage] ✓ Task passed

[backend-reachability] Starting (parallel, easy)
[backend-reachability] → Running agent...
2026/04/23 12:42:42 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:42 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:58 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:58 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-head-series] ✗ Task failed
[prometheus-head-series]   Error: one or more verification steps failed

[prometheus-wal-size] Starting (parallel, easy)
[prometheus-wal-size] → Running agent...
2026/04/23 12:42:59 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:42:59 INFO connection closed cause="io: read/write on closed pipe"
[backend-reachability] → Verifying results...
2026/04/23 12:43:01 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:01 INFO connection closed cause="io: read/write on closed pipe"
[network-traffic] ✓ Task passed

[list-node-metrics] Starting (parallel, easy)
[list-node-metrics] → Running agent...
2026/04/23 12:43:29 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:29 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-wal-size] → Verifying results...
2026/04/23 12:43:31 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:31 INFO connection closed cause="io: read/write on closed pipe"
[list-node-metrics] → Verifying results...
2026/04/23 12:43:33 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:33 INFO connection closed cause="io: read/write on closed pipe"
[backend-reachability] ✓ Task passed

[pods-created] Starting (parallel, medium)
[pods-created] → Running agent...
2026/04/23 12:43:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:43 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:43 INFO connection closed cause="io: read/write on closed pipe"
[prometheus-wal-size] ✓ Task passed

[list-kube-metrics] Starting (parallel, easy)
[list-kube-metrics] → Running agent...
2026/04/23 12:43:49 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:43:49 INFO connection closed cause="io: read/write on closed pipe"
[list-node-metrics] ✓ Task passed
2026/04/23 12:44:27 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:44:27 INFO connection closed cause="io: read/write on closed pipe"
[pods-created] → Verifying results...
2026/04/23 12:44:50 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:44:50 INFO connection closed cause="io: read/write on closed pipe"
[list-kube-metrics] → Verifying results...
2026/04/23 12:45:21 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:45:21 INFO connection closed cause="io: read/write on closed pipe"
[list-kube-metrics] ✓ Task passed
2026/04/23 12:45:45 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:45:45 INFO connection closed cause="io: read/write on closed pipe"
[namespace-resource-usage] ✗ Task failed
[namespace-resource-usage]   Error: one or more verification steps failed
2026/04/23 12:46:12 INFO connection closed cause="io: read/write on closed pipe"
2026/04/23 12:46:12 INFO connection closed cause="io: read/write on closed pipe"
[pods-created] ~ Task passed but assertions failed

=== Evaluation Complete ===

📄 Results saved to: mcpchecker-obs-mcp-tools-out.json

=== Results Summary ===

Task: crashlooping-pods
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/crashlooping-pods.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: high-cardinality-rejection
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/high-cardinality-rejection.yaml
  Difficulty: medium
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (3/3)

Task: label-names
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-names.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: namespace-pod-count
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-pod-count.yaml
  Difficulty: medium
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: filtered-alerts
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/filtered-alerts.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: visualize-cpu-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/visualize-cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: time-range-query
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/time-range-query.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: get-alerts
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-alerts.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: nonexistent-metric
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-metric.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: label-values
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-values.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: memory-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/memory-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: alert-investigation
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/alert-investigation.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: get-series-cardinality
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/get-series.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: pending-pods
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pending-pods.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: nonexistent-namespace
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-namespace.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: diagnose-cluster-health
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/diagnose-cluster-health.yaml
  Difficulty: hard
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (3/3)

Task: get-silences
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-silences.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: series-by-namespace
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/series-by-namespace.yaml
  Difficulty: medium
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: prometheus-requests
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-requests.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: cpu-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: prometheus-head-series
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-head-series.yaml
  Difficulty: easy
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: network-traffic
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/network-traffic.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: backend-reachability
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/backend-reachability.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: prometheus-wal-size
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-wal-size.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: list-node-metrics
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-node-metrics.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: list-kube-metrics
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-metrics.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: namespace-resource-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-resource-usage.yaml
  Difficulty: hard
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: pods-created
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pods-created.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: FAILED (2/4)
    - ToolsUsed: Required tool not called: server=obs, tool=execute_range_query, pattern=
    - CallOrder: Expected call order not satisfied. Got to 1/2

=== Overall Statistics ===
Total Tasks: 28
Tasks Passed: 22/28
Assertions Passed: 100/102

Tasks where verification failed but assertions passed: 6
  Assertions in these tasks: 22/22
Tokens:     ~747266 (estimate - excludes system prompt & cache)
MCP schemas: ~83216 (included in token total)

=== Statistics by Difficulty ===

easy:
  Tasks: 9/10
  Assertions: 34/34

medium:
  Tasks: 13/16
  Assertions: 59/61

hard:
  Tasks: 0/2
  Assertions: 7/7
⏱️  Completed in 14m35s
╭─░▒▓ ~/github.com/slashpai/obs-mcp  mcp-evals *2 ···········

22 out of 28 — 78.6% pass rate.

If you exclude high-cardinality-rejection (expected fail without guardrails), it's 22 out of 27 — 81.5%.

@slashpai slashpai marked this pull request as ready for review April 23, 2026 08:01
@slashpai slashpai requested a review from a team April 23, 2026 08:01
@openshift-ci openshift-ci Bot requested review from iNecas and rexagod April 23, 2026 08:01
@slashpai
Copy link
Copy Markdown
Member Author

@nader-ziada Could you please take a look at this? This change is to address openshift/openshift-mcp-server#241 (comment)

@slashpai
Copy link
Copy Markdown
Member Author

/hold for review from mcp team as well

prompt:
inline: |
Are there any firing alerts with severity=critical? Show only active alerts.
Show me only the active alerts with severity=warning.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have the severity=warning here? 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be specific here as we know there are two warning alerts always fire in default cluster which is WatchDog and AleermanagerReceiverNotConfigured

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iNecas, slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slashpai
Copy link
Copy Markdown
Member Author

/hold cancel

@slashpai slashpai merged commit 0533989 into rhobs:main Apr 23, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants