evals: replace generic llmJudge contains values with tool-derived checks#71
evals: replace generic llmJudge contains values with tool-derived checks#71slashpai merged 4 commits intorhobs:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
| contains: "alert" | ||
| reason: "Verify the agent retrieved alerts from Alertmanager" | ||
| contains: "alertname" | ||
| reason: "Verify the agent retrieved alerts and reported specific alert names from Alertmanager" |
There was a problem hiding this comment.
What about something like? 🤔
| reason: "Verify the agent retrieved alerts and reported specific alert names from Alertmanager" | |
| contains: "alertname" | |
| reason: "Verify the agent called get_alerts and reported the results: either listing specific alert names from Alertmanager, or explicitly confirming that no alerts are currently firing" | |
f93c8c7 to
26de704
Compare
Generic contains values like "alert", "cpu", "memory", "namespace", and "node" already appear in the prompt, so an agent could pass by echoing the prompt without calling tools. Replace with specific metric names, label values, and data patterns that require actual tool interaction. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Use OpenShift-specific alert names (Watchdog, AlertmanagerReceiversNotConfigured) in contains checks since evals run against OpenShift clusters where these alerts reliably fire. Remove the no-alerts branch from alert-investigation to avoid the agent asking for permission instead of acting. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- series-by-namespace: use openshift-monitoring namespace instead of non-existent monitoring namespace - get-series: check for kube_pod_info metric name instead of node label which agents omit when reporting cardinality counts - diagnose-cluster-health: check for Watchdog alert name instead of alertname label key which agents don't echo literally Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- Accept either container_memory_working_set_bytes or container_memory_usage_bytes in memory-usage and namespace-resource-usage tasks - Update get-series prompt to ask for label names so the namespace contains check has something to match Signed-off-by: Jayapriya Pai <janantha@redhat.com>
|
On ROSA OpenShift 4.21 with no Thanos TSDB status endpoint so 22 out of 28 — 78.6% pass rate. If you exclude high-cardinality-rejection (expected fail without guardrails), it's 22 out of 27 — 81.5%. |
|
@nader-ziada Could you please take a look at this? This change is to address openshift/openshift-mcp-server#241 (comment) |
|
/hold for review from mcp team as well |
| prompt: | ||
| inline: | | ||
| Are there any firing alerts with severity=critical? Show only active alerts. | ||
| Show me only the active alerts with severity=warning. |
There was a problem hiding this comment.
Should we have the severity=warning here? 🤔
There was a problem hiding this comment.
I was trying to be specific here as we know there are two warning alerts always fire in default cluster which is WatchDog and AleermanagerReceiverNotConfigured
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: iNecas, slashpai The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
address openshift/openshift-mcp-server#241 (comment)