Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions evals/mcpchecker/tasks/alerts/alert-investigation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,18 @@ metadata:
toolType: multi-step
description: |
Tests if the agent can perform multi-step alert triage: first retrieving
alerts from Alertmanager, then investigating related metrics for the most
critical alert using Prometheus queries.
alerts from Alertmanager, then investigating related metrics for a
firing alert using queries.
spec:
verify:
- llmJudge:
contains: "alert"
reason: "Verify the agent retrieved alerts from Alertmanager"
contains: "AlertmanagerReceiversNotConfigured"
reason: "Verify the agent retrieved firing alerts and identified AlertmanagerReceiversNotConfigured"
- llmJudge:
contains: "metric"
reason: "Verify the agent investigated a related Prometheus metric"
contains: "alertmanager"
reason: "Verify the agent investigated Alertmanager-related metrics for the alert"
prompt:
inline: |
Check if there are any firing alerts. If there are, investigate
the related metrics for the most critical alert and summarize
what's happening. If there are no firing alerts, check cluster
health metrics instead.
what's happening.
9 changes: 3 additions & 6 deletions evals/mcpchecker/tasks/alerts/filtered-alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "alert"
reason: "Verify the agent retrieved and reported on alerts"
- llmJudge:
contains: "critical"
reason: "Verify the agent filtered or addressed the critical severity"
contains: "AlertmanagerReceiversNotConfigured"
reason: "Verify the agent filtered for warning-severity alerts and found AlertmanagerReceiversNotConfigured"
prompt:
inline: |
Are there any firing alerts with severity=critical? Show only active alerts.
Show me only the active alerts with severity=warning.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have the severity=warning here? 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be specific here as we know there are two warning alerts always fire in default cluster which is WatchDog and AleermanagerReceiverNotConfigured

4 changes: 2 additions & 2 deletions evals/mcpchecker/tasks/alerts/get-alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "alerts"
reason: "Verify the output reports on the alert state, whether alerts are firing or not"
contains: "Watchdog"
reason: "Verify the agent retrieved firing alerts and reported the Watchdog alert"
prompt:
inline: |
Check the cluster for any firing alerts and report what you find.
4 changes: 2 additions & 2 deletions evals/mcpchecker/tasks/alerts/get-silences.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "silence"
reason: "Verify the output discusses alert silences, including when no active silences exist"
contains: "silences"
reason: "Verify the response mentions silences. The agent should report either active silences with their matchers or that no active silences exist — both are valid outcomes"
prompt:
inline: |
Are there any active silences in Alertmanager?
8 changes: 4 additions & 4 deletions evals/mcpchecker/tasks/labels/get-series.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ metadata:
spec:
verify:
- llmJudge:
contains: "series"
reason: "Verify the output reports time series information"
contains: "namespace"
reason: "Verify the agent retrieved actual series data containing label dimensions like namespace"
- llmJudge:
contains: "kube_pod_info"
reason: "Verify the agent queried the correct metric"
reason: "Verify the agent queried the kube_pod_info metric and reported its cardinality"
prompt:
inline: |
How many time series exist for the kube_pod_info metric? Show the cardinality.
How many time series exist for the kube_pod_info metric? Show the count and list the label names present.
4 changes: 2 additions & 2 deletions evals/mcpchecker/tasks/labels/label-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "namespace"
reason: "Verify the output lists actual namespace values from the cluster"
contains: "kube-system"
reason: "Verify the output lists actual namespace values from the cluster such as kube-system"
prompt:
inline: |
What are the unique namespace values for the kube_pod_info metric?
10 changes: 5 additions & 5 deletions evals/mcpchecker/tasks/labels/series-by-namespace.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ metadata:
spec:
verify:
- llmJudge:
contains: "series"
reason: "Verify the agent reported series information"
contains: "pod"
reason: "Verify the agent retrieved actual series data containing label dimensions like pod"
- llmJudge:
contains: "monitoring"
reason: "Verify the agent scoped to the monitoring namespace"
contains: "container"
reason: "Verify the agent reported series with container label values from the namespace"
prompt:
inline: |
How many time series exist for container_cpu_usage_seconds_total
in the monitoring namespace?
in the openshift-monitoring namespace?
4 changes: 2 additions & 2 deletions evals/mcpchecker/tasks/metrics/list-metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "kube"
reason: "Verify the output lists Kubernetes metrics matching the kube prefix"
contains: "kube_pod_info"
reason: "Verify the output lists specific kube metrics discovered from Prometheus"
prompt:
inline: |
List all available Prometheus metrics that contain 'kube' in the name.
4 changes: 2 additions & 2 deletions evals/mcpchecker/tasks/metrics/list-node-metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "node"
reason: "Verify the output lists node-related metrics"
contains: "node_"
reason: "Verify the output lists specific node metrics discovered from Prometheus (e.g. node_cpu_seconds_total)"
prompt:
inline: |
What node-related metrics are available in Prometheus?
4 changes: 2 additions & 2 deletions evals/mcpchecker/tasks/queries/backend-reachability.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "up"
reason: "Verify the agent confirmed the metrics backend is reachable and targets are up"
contains: "prometheus-k8s"
reason: "Verify the agent queried the up metric and reported on the prometheus-k8s target"
prompt:
inline: |
Is the metrics backend reachable? Query up{job="prometheus-k8s"} and report the status.
8 changes: 4 additions & 4 deletions evals/mcpchecker/tasks/queries/diagnose-cluster-health.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ metadata:
spec:
verify:
- llmJudge:
contains: "cluster"
reason: "Verify the response provides an overall cluster health assessment"
contains: "Watchdog"
reason: "Verify the agent checked alerts and reported specific alert names such as Watchdog"
- llmJudge:
contains: "node"
reason: "Verify the agent checked node-level health indicators"
contains: "kube_"
reason: "Verify the agent queried Kubernetes metrics (e.g. kube_node_status_condition) for health assessment"
prompt:
inline: |
Is the cluster healthy? Give me an overview of any issues.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "namespace"
reason: "Verify the agent suggests scoping the query by namespace or another dimension"
contains: "guardrail"
reason: "Verify the agent explains the query was rejected by a guardrail and suggests a scoped alternative"
prompt:
inline: |
Show me the CPU, memory, network, and disk usage for every single container
Expand Down
4 changes: 2 additions & 2 deletions evals/mcpchecker/tasks/queries/memory-usage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "container_memory_working_set_bytes"
reason: "Verify the agent queried the correct memory metric"
contains: "container_memory"
reason: "Verify the agent queried a container memory metric (working_set_bytes or usage_bytes)"
- llmJudge:
contains: "pod"
reason: "Verify the response identifies specific pods"
Expand Down
8 changes: 4 additions & 4 deletions evals/mcpchecker/tasks/queries/namespace-pod-count.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ metadata:
spec:
verify:
- llmJudge:
contains: "namespace"
reason: "Verify the response lists namespaces"
contains: "kube-system"
reason: "Verify the response lists actual namespace names from the cluster such as kube-system"
- llmJudge:
contains: "pod"
reason: "Verify the response includes pod counts"
contains: "kube_pod"
reason: "Verify the agent used a kube_pod metric to count pods"
prompt:
inline: |
Which namespaces have the most running pods? Show me the top 5.
11 changes: 4 additions & 7 deletions evals/mcpchecker/tasks/queries/namespace-resource-usage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,11 @@ metadata:
spec:
verify:
- llmJudge:
contains: "namespace"
reason: "Verify the response breaks down resource usage by namespace"
contains: "container_cpu_usage_seconds_total"
reason: "Verify the agent discovered and used the correct CPU metric"
- llmJudge:
contains: "cpu"
reason: "Verify the response includes CPU usage data"
- llmJudge:
contains: "memory"
reason: "Verify the response includes memory usage data"
contains: "container_memory"
reason: "Verify the agent discovered and used a container memory metric (working_set_bytes or usage_bytes)"
prompt:
inline: |
Which namespace is consuming the most CPU and memory?
Expand Down
2 changes: 1 addition & 1 deletion evals/mcpchecker/tasks/queries/pods-created.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ spec:
verify:
- llmJudge:
contains: "kube_pod_created"
reason: "Verify the agent used the correct metric for pod creation timestamps"
reason: "Verify the agent discovered and used the kube_pod_created metric"
prompt:
inline: |
How many pods were created in the last 5 minutes?
8 changes: 4 additions & 4 deletions evals/mcpchecker/tasks/queries/time-range-query.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ metadata:
spec:
verify:
- llmJudge:
contains: "30 minutes"
reason: "Verify the agent honoured the requested 30-minute time window"
contains: "container_cpu_usage_seconds_total"
reason: "Verify the agent discovered and used the correct CPU metric"
- llmJudge:
contains: "cpu"
reason: "Verify the response includes CPU usage data"
contains: "pod"
reason: "Verify the response includes specific pod names from the query results"
prompt:
inline: |
Show me the CPU usage trend for pods in the default namespace over the last 30 minutes.
4 changes: 2 additions & 2 deletions evals/mcpchecker/tasks/queries/visualize-cpu-usage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ metadata:
spec:
verify:
- llmJudge:
contains: "cpu"
reason: "Verify the agent queried and visualized CPU usage data"
contains: "container_cpu_usage_seconds_total"
reason: "Verify the agent discovered and visualized the correct CPU metric"
prompt:
inline: |
Visualize the CPU usage for pods in the default namespace over the last 30 minutes.