GIE-507: update observability task prompts to request metric names and PromQL queries#274
Conversation
…PromQL queries The observability eval tasks were failing because agents correctly answered questions but summarized results in natural language without restating the Prometheus metric name. The llmJudge contains assertions check the agent's response text, not tool call arguments, so correct answers were marked as failures. Update 15 task prompts to explicitly ask the agent to include the metric name and PromQL query used. This ensures the metric name appears in the response for the judge to verify, while keeping the contains assertions strict and non-overlapping with prompt text. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
📝 WalkthroughWalkthroughTask prompts across 15 observability evaluation files are modified to require agents to explicitly report Prometheus metric names and PromQL queries used in their responses, alongside answering the primary query objective. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
evals/tasks/observability/queries/visualize-cpu-usage.yaml (1)
24-24: Standardize response shape to make judge matching more deterministic.Good change. To further reduce residual false negatives, require a fixed response format (
Metric,PromQL,Result) instead of free-form prose.Suggested prompt tweak
- Include the Prometheus metric name and PromQL query you used. + Respond in this format: + - Metric: <metric_name> + - PromQL: <exact_query> + - Result: <brief finding>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml` at line 24, Update the task to require a fixed response JSON/YAML shape with three keys "Metric", "PromQL", and "Result" and ensure the prompt explicitly asks the model to return the Prometheus metric name (e.g., node_cpu_seconds_total) under "Metric", the exact PromQL query used under "PromQL", and the numeric or time-series output under "Result"; modify visualize-cpu-usage.yaml to include this schema requirement and an explicit instruction to include the Prometheus metric name and the exact PromQL query string.evals/tasks/observability/queries/namespace-pod-count.yaml (1)
27-28: Tighten “running pods” semantics to reduce ambiguous query choices.Given this task is still failing, consider explicitly requiring a
phase="Running"filter (or equivalent) in the PromQL instruction.Suggested prompt tweak
- Which namespaces have the most running pods? Show me the top 5. - Use Prometheus metrics and include the metric name and PromQL query you used. + Which namespaces have the most running pods? Show me the top 5. + Use Prometheus metrics and ensure your query filters for running phase (for example, phase="Running"). + Include the metric name and exact PromQL query you used.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/tasks/observability/queries/namespace-pod-count.yaml` around lines 27 - 28, Update the task prompt that asks "Which namespaces have the most running pods? Show me the top 5." to explicitly require using the pod phase filter (e.g., phase="Running") in the PromQL instruction: ask the model to use the kube_pod_status_phase metric (or equivalent) and show the metric name plus the exact PromQL query including `phase="Running"`, then sort and limit to top 5 namespaces (for example using topk(5, sum by(namespace) (kube_pod_status_phase{phase="Running"} == 1))). Ensure the prompt clearly demands the metric name and the full PromQL expression.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@evals/tasks/observability/queries/namespace-pod-count.yaml`:
- Around line 27-28: Update the task prompt that asks "Which namespaces have the
most running pods? Show me the top 5." to explicitly require using the pod phase
filter (e.g., phase="Running") in the PromQL instruction: ask the model to use
the kube_pod_status_phase metric (or equivalent) and show the metric name plus
the exact PromQL query including `phase="Running"`, then sort and limit to top 5
namespaces (for example using topk(5, sum by(namespace)
(kube_pod_status_phase{phase="Running"} == 1))). Ensure the prompt clearly
demands the metric name and the full PromQL expression.
In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml`:
- Line 24: Update the task to require a fixed response JSON/YAML shape with
three keys "Metric", "PromQL", and "Result" and ensure the prompt explicitly
asks the model to return the Prometheus metric name (e.g.,
node_cpu_seconds_total) under "Metric", the exact PromQL query used under
"PromQL", and the numeric or time-series output under "Result"; modify
visualize-cpu-usage.yaml to include this schema requirement and an explicit
instruction to include the Prometheus metric name and the exact PromQL query
string.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 32fe6019-6a11-4f5b-9150-c4cbf20806fd
📒 Files selected for processing (15)
evals/tasks/observability/labels/series-by-namespace.yamlevals/tasks/observability/queries/cpu-usage.yamlevals/tasks/observability/queries/crashlooping-pods.yamlevals/tasks/observability/queries/diagnose-cluster-health.yamlevals/tasks/observability/queries/memory-usage.yamlevals/tasks/observability/queries/namespace-pod-count.yamlevals/tasks/observability/queries/namespace-resource-usage.yamlevals/tasks/observability/queries/network-traffic.yamlevals/tasks/observability/queries/pending-pods.yamlevals/tasks/observability/queries/pods-created.yamlevals/tasks/observability/queries/prometheus-head-series.yamlevals/tasks/observability/queries/prometheus-requests.yamlevals/tasks/observability/queries/prometheus-wal-size.yamlevals/tasks/observability/queries/time-range-query.yamlevals/tasks/observability/queries/visualize-cpu-usage.yaml
|
@nader-ziada @Cali0707 Since you reviewed last PR can you help with this as well? We saw this gave better results |
|
/lgtm |
|
@slashpai: This pull request references GIE-507 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-0.3 |
|
@slashpai: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request" |
|
Thanks @slashpai ! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Cali0707, slashpai The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@Cali0707: Overrode contexts on behalf of Cali0707: Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
ce503a7
into
openshift:main
|
@slashpai: new pull request created: #276 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@slashpai: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
After running the evals with gpt 5.5 (60 %) I saw evals pass percenatage lowered than gpt-5-nano (64%)
A few observability eval tasks were failing even when the agent got the correct answer. The agent gets the right answer but summarizes in natural language without restating the metric name.
So updated prompts to explicitly ask the agent to report the PromQL query/metric name used to force the metric name into the response
I ran with gpt-5-nano and got these results after updating evals
Summary by CodeRabbit