-
Notifications
You must be signed in to change notification settings - Fork 43
GIE-505: add observability toolset evals #241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "alert-investigation" | ||
| difficulty: medium | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: alerts | ||
| suite: observability | ||
| toolType: multi-step | ||
| description: | | ||
| Tests if the agent can perform multi-step alert triage: first retrieving | ||
| alerts from Alertmanager, then investigating related metrics for a | ||
| firing alert using queries. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "AlertmanagerReceiversNotConfigured" | ||
| reason: "Verify the agent retrieved firing alerts and identified AlertmanagerReceiversNotConfigured" | ||
| - llmJudge: | ||
| contains: "alertmanager" | ||
| reason: "Verify the agent investigated Alertmanager-related metrics for the alert" | ||
| prompt: | ||
| inline: | | ||
| Check if there are any firing alerts. If there are, investigate | ||
| the related metrics for the most critical alert and summarize | ||
| what's happening. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "filtered-alerts" | ||
| difficulty: medium | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: alerts | ||
| suite: observability | ||
| toolType: alertmanager | ||
| description: | | ||
| Tests if the agent uses the get_alerts tool with filter parameters | ||
| to retrieve only active alerts matching a specific severity. The agent | ||
| should pass appropriate filter arguments rather than fetching all alerts | ||
| and filtering client-side. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "AlertmanagerReceiversNotConfigured" | ||
| reason: "Verify the agent filtered for warning-severity alerts and found AlertmanagerReceiversNotConfigured" | ||
| prompt: | ||
| inline: | | ||
| Show me only the active alerts with severity=warning. | ||
|
Comment on lines
+17
to
+24
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Check whether the eval framework supports asserting tool-call arguments (not just tool patterns/counts).
rg -nP -C3 '(toolPattern|minToolCalls|maxToolCalls|toolArgs|arguments)' --type=yaml -g 'evals/**'
rg -nP -C3 '(ToolPattern|ToolArgs|Arguments|MinToolCalls)' -g '!vendor/**' -g '!**/testdata/**'Repository: openshift/openshift-mcp-server Length of output: 50389 🏁 Script executed: cat -n evals/tasks/observability/alerts/filtered-alerts.yamlRepository: openshift/openshift-mcp-server Length of output: 1010 🏁 Script executed: # Search for task.yaml files with tool argument assertions to understand the proper structure
find evals/tasks -name "*.yaml" -type f | head -20 | xargs -I {} sh -c 'echo "=== {} ===" && grep -A 10 "toolPattern\|args:" {} | head -15'Repository: openshift/openshift-mcp-server Length of output: 1473 🏁 Script executed: # Look at the netedge example that showed args field
cat -n evals/tasks/netedge/query_prometheus/task.yamlRepository: openshift/openshift-mcp-server Length of output: 519 🏁 Script executed: # Search for other observability tasks to understand the pattern
find evals/tasks/observability -name "*.yaml" -type f | xargs catRepository: openshift/openshift-mcp-server Length of output: 20635 Use tool-call argument assertion to verify The current verify section only checks that assertions:
toolsUsed:
- server: kubernetes
toolPattern: "get_alerts"
args:
severity: "warning"Also note: relying on 🤖 Prompt for AI Agents |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "get-alerts" | ||
| difficulty: easy | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: alerts | ||
| suite: observability | ||
| toolType: alertmanager | ||
| description: | | ||
| Tests if the agent can discover and use the get_alerts tool to retrieve | ||
| currently firing alerts from Alertmanager. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "Watchdog" | ||
| reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert" | ||
|
Comment on lines
+17
to
+19
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Requiring a specific alert name can fail valid runs on clusters where Suggested adjustment verify:
- llmJudge:
- contains: "Watchdog"
- reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert"
+ contains: "alertname"
+ reason: "Verify the response includes concrete alert fields from retrieved firing alerts"
+ - llmJudge:
+ contains: "firing"
+ reason: "Verify the response reports firing-state alerts from Alertmanager output"🤖 Prompt for AI Agents |
||
| prompt: | ||
| inline: | | ||
| Check the cluster for any firing alerts and report what you find. | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,22 @@ | ||||||||||||||||||||
| kind: Task | ||||||||||||||||||||
| apiVersion: mcpchecker/v1alpha2 | ||||||||||||||||||||
| metadata: | ||||||||||||||||||||
| name: "get-silences" | ||||||||||||||||||||
| difficulty: easy | ||||||||||||||||||||
| parallel: true | ||||||||||||||||||||
| runs: 1 | ||||||||||||||||||||
| labels: | ||||||||||||||||||||
| category: alerts | ||||||||||||||||||||
| suite: observability | ||||||||||||||||||||
| toolType: alertmanager | ||||||||||||||||||||
| description: | | ||||||||||||||||||||
| Tests if the agent can discover and use the get_silences tool to retrieve | ||||||||||||||||||||
| active silences from Alertmanager. | ||||||||||||||||||||
| spec: | ||||||||||||||||||||
| verify: | ||||||||||||||||||||
| - llmJudge: | ||||||||||||||||||||
| contains: "silences" | ||||||||||||||||||||
| reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes" | ||||||||||||||||||||
|
Comment on lines
+17
to
+19
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Judge token is prompt-leaked and allows false positives.
Suggested adjustment verify:
- llmJudge:
- contains: "silences"
- reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"
+ contains: "matchers"
+ reason: "Verify the response includes concrete silence details when active silences exist"
+ - llmJudge:
+ contains: "no active silences"
+ reason: "Allow explicit empty-state reporting when no silences are present"📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||
| prompt: | ||||||||||||||||||||
| inline: | | ||||||||||||||||||||
| Are there any active silences in Alertmanager? | ||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "get-series-cardinality" | ||
| difficulty: medium | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: labels | ||
| suite: observability | ||
| toolType: exploration | ||
| description: | | ||
| Tests if the agent can use the get_series tool to check cardinality for a metric. | ||
| The agent should first verify the metric exists via list_metrics, then use | ||
| get_series to retrieve matching time series and report the count. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "namespace" | ||
| reason: "Verify the agent retrieved actual series data containing label dimensions like namespace" | ||
| - llmJudge: | ||
| contains: "kube_pod_info" | ||
| reason: "Verify the agent queried the kube_pod_info metric and reported its cardinality" | ||
| prompt: | ||
| inline: | | ||
| How many time series exist for the kube_pod_info metric? Show the count and list the label names present. | ||
|
Comment on lines
+17
to
+26
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both judge tokens are weak: one is in the prompt, the other is generic.
Suggest requiring evidence only a real 🤖 Prompt for AI Agents |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "label-names" | ||
| difficulty: easy | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: labels | ||
| suite: observability | ||
| toolType: exploration | ||
| description: | | ||
| Tests if the agent follows the correct workflow: first calling list_metrics to | ||
| verify kube_pod_info exists, then calling get_label_names to discover available | ||
| labels for that metric. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "namespace" | ||
| reason: "Verify the output includes the namespace label which is a standard Kubernetes label" | ||
| - llmJudge: | ||
| contains: "pod" | ||
| reason: "Verify the output includes the pod label" | ||
| prompt: | ||
| inline: | | ||
| What labels are available for the kube_pod_info metric? |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "label-values" | ||
| difficulty: medium | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: labels | ||
| suite: observability | ||
| toolType: exploration | ||
| description: | | ||
| Tests the full discovery workflow: list_metrics to verify the metric, then | ||
| get_label_values to retrieve unique namespace values for kube_pod_info. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "kube-system" | ||
| reason: "Verify the output lists actual namespace values from the cluster such as kube-system" | ||
| prompt: | ||
| inline: | | ||
| What are the unique namespace values for the kube_pod_info metric? |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "series-by-namespace" | ||
| difficulty: medium | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: labels | ||
| suite: observability | ||
| toolType: exploration | ||
| description: | | ||
| Tests if the agent can use the get_series tool with a label selector | ||
| to find time series scoped to a specific namespace. The agent should | ||
| first verify the metric exists, then use get_series with a namespace | ||
| matcher to report the cardinality within that scope. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "pod" | ||
| reason: "Verify the agent retrieved actual series data containing label dimensions like pod" | ||
| - llmJudge: | ||
| contains: "container" | ||
| reason: "Verify the agent reported series with container label values from the namespace" | ||
| prompt: | ||
| inline: | | ||
| How many time series exist for container_cpu_usage_seconds_total | ||
| in the openshift-monitoring namespace? | ||
|
Comment on lines
+19
to
+28
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Recommend asserting on evidence only obtainable from a real 🤖 Prompt for AI Agents |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "list-kube-metrics" | ||
| difficulty: easy | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: metrics | ||
| suite: observability | ||
| toolType: discovery | ||
| description: | | ||
| Tests if the agent can discover the list_metrics tool to find Kubernetes-related | ||
| metrics. The agent should use the name_regex parameter to filter for kube metrics. | ||
|
Comment on lines
+12
to
+14
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
rg -nP -C2 '"list_metrics"|name_regex' pkg/
rg -nP -C3 'initPrometheus|initAlertmanager' pkg/toolsets/observability/Repository: openshift/openshift-mcp-server Length of output: 3364 🏁 Script executed: cat -n evals/tasks/observability/metrics/list-metrics.yaml | head -25Repository: openshift/openshift-mcp-server Length of output: 882 🏁 Script executed: rg -nP 'label.?values|__name__|ListMetrics' pkg/toolsets/observability/ pkg/prometheus/Repository: openshift/openshift-mcp-server Length of output: 474 🏁 Script executed: sed -n '23,60p' pkg/toolsets/observability/prometheus.go
sed -n '63,120p' pkg/toolsets/observability/prometheus.goRepository: openshift/openshift-mcp-server Length of output: 3696 Description references a The registered tools are only 🤖 Prompt for AI Agents |
||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "kube_pod_info" | ||
| reason: "Verify the output lists specific kube metrics discovered from Prometheus" | ||
| prompt: | ||
| inline: | | ||
| List all available Prometheus metrics that contain 'kube' in the name. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "list-node-metrics" | ||
| difficulty: easy | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: metrics | ||
| suite: observability | ||
| toolType: discovery | ||
| description: | | ||
| Tests if the agent can discover node-related metrics using the list_metrics tool | ||
| with a regex filter for node metrics. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "node_" | ||
| reason: "Verify the output lists specific node metrics discovered from Prometheus (e.g. node_cpu_seconds_total)" | ||
|
Comment on lines
+17
to
+19
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Weak The substring 🤖 Prompt for AI Agents |
||
| prompt: | ||
| inline: | | ||
| What node-related metrics are available in Prometheus? | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "backend-reachability" | ||
| difficulty: easy | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: queries | ||
| suite: observability | ||
| toolType: smoke-test | ||
| description: | | ||
| Smoke test to verify the agent can reach the metrics backend and get a | ||
| valid response. The agent should query the up metric and confirm the | ||
| backend is reachable before any complex eval tasks run. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "prometheus-k8s" | ||
| reason: "Verify the agent queried the up metric and reported on the prometheus-k8s target" | ||
| prompt: | ||
| inline: | | ||
| Is the metrics backend reachable? Query up{job="prometheus-k8s"} and report the status. | ||
|
Comment on lines
+18
to
+23
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. False-positive risk: judge token is present verbatim in the prompt. The prompt contains Consider validating on something that can only be known from a successful tool call — for example, the numeric result ( 🤖 Prompt for AI Agents |
||
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,25 @@ | ||||||||||||||
| kind: Task | ||||||||||||||
| apiVersion: mcpchecker/v1alpha2 | ||||||||||||||
| metadata: | ||||||||||||||
| name: "cpu-usage" | ||||||||||||||
| difficulty: medium | ||||||||||||||
| parallel: true | ||||||||||||||
| runs: 1 | ||||||||||||||
| labels: | ||||||||||||||
| category: queries | ||||||||||||||
| suite: observability | ||||||||||||||
| toolType: query | ||||||||||||||
| description: | | ||||||||||||||
| Tests if the agent can find the appropriate CPU usage metric and query it to | ||||||||||||||
| determine which pods are consuming the most CPU resources. | ||||||||||||||
| spec: | ||||||||||||||
| verify: | ||||||||||||||
| - llmJudge: | ||||||||||||||
| contains: "container_cpu_usage_seconds_total" | ||||||||||||||
| reason: "Verify the agent queried the correct CPU metric" | ||||||||||||||
| - llmJudge: | ||||||||||||||
| contains: "pod" | ||||||||||||||
| reason: "Verify the response identifies specific pods" | ||||||||||||||
|
Comment on lines
+20
to
+22
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. False-positive risk: This is exactly the concern raised by reviewers on the PR: the prompt is "Which pods are using the most CPU?", so any response that merely restates or references the prompt (including a refusal or a wrong answer) will contain the substring Consider replacing this with a stricter check — e.g., require a concrete pod-identifying field/label such as Suggested diff - llmJudge:
- contains: "pod"
- reason: "Verify the response identifies specific pods"
+ contains: "pod="
+ reason: "Verify the response identifies specific pods by label (e.g., pod=\"<name>\")"📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||
| prompt: | ||||||||||||||
| inline: | | ||||||||||||||
| Which pods are using the most CPU? | ||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "crashlooping-pods" | ||
| difficulty: medium | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: queries | ||
| suite: observability | ||
| toolType: range-query | ||
| description: | | ||
| Tests if the agent can identify crashlooping pods by discovering and querying | ||
| the kube_pod_container_status_restarts_total metric over a time range. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "kube_pod_container_status_restarts_total" | ||
| reason: "Verify the agent used the correct metric for container restart counts" | ||
| prompt: | ||
| inline: | | ||
| Which pods were crashlooping in the last 5 minutes? |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| kind: Task | ||
| apiVersion: mcpchecker/v1alpha2 | ||
| metadata: | ||
| name: "diagnose-cluster-health" | ||
| difficulty: hard | ||
| parallel: true | ||
| runs: 1 | ||
| labels: | ||
| category: queries | ||
| suite: observability | ||
| toolType: diagnostic | ||
| description: | | ||
| Tests the agent's ability to handle an ambiguous diagnostic prompt. | ||
| The agent must autonomously decide which tools and metrics to check, | ||
| ideally covering alerts, node status, and pod health without being | ||
| told exactly what to look for. | ||
| spec: | ||
| verify: | ||
| - llmJudge: | ||
| contains: "Watchdog" | ||
| reason: "Verify the agent checked alerts and reported specific alert names such as Watchdog" | ||
| - llmJudge: | ||
| contains: "kube_" | ||
| reason: "Verify the agent queried Kubernetes metrics (e.g. kube_node_status_condition) for health assessment" | ||
|
Comment on lines
+22
to
+24
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
🤖 Prompt for AI Agents |
||
| prompt: | ||
| inline: | | ||
| Is the cluster healthy? Give me an overview of any issues. | ||
| Check alerts, node status, and pod health. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second judge
contains: "alertmanager"is redundant and weak.Substring
alertmanageris already contained inAlertmanagerReceiversNotConfigured(checked above) and appears naturally in any narrative about Alertmanager, so it doesn't independently verify that the agent investigated related metrics. Consider asserting on a concrete Alertmanager metric name the agent should have queried (e.g.,alertmanager_notifications_failed_total,alertmanager_config_last_reload_successful) to actually validate the "investigate related metrics" step.🤖 Prompt for AI Agents