Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ CONTAINER_CLI ?= docker
IMAGE ?= ghcr.io/rhobs/obs-mcp
TAG ?= $(shell git rev-parse --short HEAD)
TOOLS_DIR := hack/tools
MCPCHECKER_VERSION ?= 0.0.15
MCPCHECKER_VERSION ?= 0.0.16

ROOT_DIR := $(shell pwd)
TOOLS_BIN_DIR := $(ROOT_DIR)/tmp/bin
Expand Down
2 changes: 1 addition & 1 deletion evals/mcpchecker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Evaluations for obs-mcp using [mcpchecker](https://github.com/mcpchecker/mcpchec

## Pre-requisites

- [mcpchecker](https://github.com/mcpchecker/mcpchecker#install) installed (v0.0.15+) — run `make install-mcpchecker` from the repo root
- [mcpchecker](https://github.com/mcpchecker/mcpchecker#install) installed (v0.0.16+) — run `make install-mcpchecker` from the repo root
- A Kubernetes/OpenShift cluster with Prometheus and Alertmanager running
- obs-mcp server deployed and accessible (see [Backend Setup](#backend-setup))

Expand Down
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/alerts/alert-investigation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: alerts
suite: observability
toolType: multi-step
description: |
Tests if the agent can perform multi-step alert triage: first retrieving
Expand Down
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/alerts/filtered-alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: alerts
suite: observability
toolType: alertmanager
description: |
Tests if the agent uses the get_alerts tool with filter parameters
Expand Down
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/alerts/get-alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: alerts
suite: observability
toolType: alertmanager
description: |
Tests if the agent can discover and use the get_alerts tool to retrieve
Expand Down
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/alerts/get-silences.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: alerts
suite: observability
toolType: alertmanager
description: |
Tests if the agent can discover and use the get_silences tool to retrieve
Expand Down
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/labels/get-series.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: labels
suite: observability
toolType: exploration
description: |
Tests if the agent can use the get_series tool to check cardinality for a metric.
Expand Down
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/labels/label-names.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: labels
suite: observability
toolType: exploration
description: |
Tests if the agent follows the correct workflow: first calling list_metrics to
Expand Down
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/labels/label-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: labels
suite: observability
toolType: exploration
description: |
Tests the full discovery workflow: list_metrics to verify the metric, then
Expand Down
4 changes: 3 additions & 1 deletion evals/mcpchecker/tasks/labels/series-by-namespace.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: labels
suite: observability
toolType: exploration
description: |
Tests if the agent can use the get_series tool with a label selector
Expand All @@ -24,4 +25,5 @@ spec:
prompt:
inline: |
How many time series exist for container_cpu_usage_seconds_total
in the openshift-monitoring namespace?
in the openshift-monitoring namespace? List the label dimensions
(such as pod, container) present in the series.
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/metrics/list-metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: metrics
suite: observability
toolType: discovery
description: |
Tests if the agent can discover the list_metrics tool to find Kubernetes-related
Expand Down
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/metrics/list-node-metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: metrics
suite: observability
toolType: discovery
description: |
Tests if the agent can discover node-related metrics using the list_metrics tool
Expand Down
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/queries/backend-reachability.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: smoke-test
description: |
Smoke test to verify the agent can reach the metrics backend and get a
Expand Down
4 changes: 3 additions & 1 deletion evals/mcpchecker/tasks/queries/cpu-usage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: query
description: |
Tests if the agent can find the appropriate CPU usage metric and query it to
Expand All @@ -21,4 +22,5 @@ spec:
reason: "Verify the response identifies specific pods"
prompt:
inline: |
Which pods are using the most CPU?
Which pods are using the most CPU? Show me the top 5.
Use Prometheus metrics and include the metric name and PromQL query you used.
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/crashlooping-pods.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: range-query
description: |
Tests if the agent can identify crashlooping pods by discovering and querying
Expand All @@ -19,3 +20,4 @@ spec:
prompt:
inline: |
Which pods were crashlooping in the last 5 minutes?
Use Prometheus metrics and include the metric name and PromQL query you used.
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: diagnostic
description: |
Tests the agent's ability to handle an ambiguous diagnostic prompt.
Expand All @@ -24,4 +25,5 @@ spec:
prompt:
inline: |
Is the cluster healthy? Give me an overview of any issues.
Check alerts, node status, and pod health.
Check alerts, node status, and pod health using Prometheus metrics.
Include the metric names and PromQL queries you used.
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: error-handling
description: |
Tests agent behavior when a query is rejected by obs-mcp guardrails
Expand Down
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/memory-usage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: query
description: |
Tests if the agent can find the appropriate memory usage metric and query it
Expand All @@ -22,3 +23,4 @@ spec:
prompt:
inline: |
Which pods are using the most memory? Show me the top 5.
Use Prometheus metrics and include the metric name and PromQL query you used.
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/namespace-pod-count.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: multi-step
description: |
Tests multi-step reasoning: the agent must discover a suitable metric
Expand All @@ -24,3 +25,4 @@ spec:
prompt:
inline: |
Which namespaces have the most running pods? Show me the top 5.
Use Prometheus metrics and include the metric name and PromQL query you used.
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: multi-step
description: |
Tests multi-step reasoning: the agent must discover CPU and memory metrics,
Expand All @@ -25,3 +26,4 @@ spec:
inline: |
Which namespace is consuming the most CPU and memory?
Show me the top namespace for each.
Use Prometheus metrics and include the metric names and PromQL queries you used.
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/network-traffic.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: query
description: |
Tests if the agent can discover network-related metrics and query them to find
Expand All @@ -22,3 +23,4 @@ spec:
prompt:
inline: |
Which pods are receiving the most network traffic?
Use Prometheus metrics and include the metric name and PromQL query you used.
1 change: 1 addition & 0 deletions evals/mcpchecker/tasks/queries/nonexistent-metric.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: error-handling
description: |
Tests agent recovery when querying a metric that does not exist.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: error-handling
description: |
Tests agent behavior when querying for resources in a namespace that
Expand Down
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/pending-pods.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: query
description: |
Tests if the agent can identify pods stuck in pending state by first discovering
Expand All @@ -20,3 +21,4 @@ spec:
prompt:
inline: |
Which pods are stuck in pending state?
Use Prometheus metrics and include the metric name and PromQL query you used.
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/pods-created.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: range-query
description: |
Tests if the agent can use a range query to find recently created pods by
Expand All @@ -19,3 +20,4 @@ spec:
prompt:
inline: |
How many pods were created in the last 5 minutes?
Include the Prometheus metric name and PromQL query you used.
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/prometheus-head-series.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: query
description: |
Tests if the agent can query Prometheus internal metrics to report the current
Expand All @@ -19,3 +20,4 @@ spec:
prompt:
inline: |
How many head series does Prometheus have?
Include the Prometheus metric name and PromQL query you used.
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/prometheus-requests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: query
description: |
Tests if the agent can calculate the request rate to Prometheus by discovering
Expand All @@ -19,3 +20,4 @@ spec:
prompt:
inline: |
How many requests per second are being made to Prometheus?
Include the Prometheus metric name and PromQL query you used.
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/prometheus-wal-size.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: query
description: |
Tests if the agent can query the current Prometheus WAL storage size using
Expand All @@ -19,3 +20,4 @@ spec:
prompt:
inline: |
What is the current storage size of the Prometheus WAL?
Include the Prometheus metric name and PromQL query you used.
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/time-range-query.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: multi-step
description: |
Tests whether the agent correctly uses execute_range_query with
Expand All @@ -23,3 +24,4 @@ spec:
prompt:
inline: |
Show me the CPU usage trend for pods in the default namespace over the last 30 minutes.
Include the Prometheus metric name and PromQL query you used.
2 changes: 2 additions & 0 deletions evals/mcpchecker/tasks/queries/visualize-cpu-usage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ metadata:
runs: 1
labels:
category: queries
suite: observability
toolType: visualization
description: |
Tests if the agent uses the show_timeseries tool to visualize CPU usage
Expand All @@ -20,3 +21,4 @@ spec:
prompt:
inline: |
Visualize the CPU usage for pods in the default namespace over the last 30 minutes.
Include the Prometheus metric name and PromQL query you used.