Skip to content

GIE-507: update observability task prompts to request metric names and PromQL queries#274

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
slashpai:update-obs-evals
Apr 27, 2026
Merged

GIE-507: update observability task prompts to request metric names and PromQL queries#274
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
slashpai:update-obs-evals

Conversation

@slashpai
Copy link
Copy Markdown
Member

@slashpai slashpai commented Apr 27, 2026

After running the evals with gpt 5.5 (60 %) I saw evals pass percenatage lowered than gpt-5-nano (64%)

A few observability eval tasks were failing even when the agent got the correct answer. The agent gets the right answer but summarizes in natural language without restating the metric name.

So updated prompts to explicitly ask the agent to report the PromQL query/metric name used to force the metric name into the response

I ran with gpt-5-nano and got these results after updating evals

make run-evals EVAL_LABEL_SELECTOR=suite=observability
cat mcpchecker-openai-agent-kubernetes-test-out.json | jq '{                                                                                       ─╯
  total: length,
  passed: [.[] | select(.taskPassed == true)] | length,
  failed: [.[] | select(.taskPassed == false)] | length,
  pass_rate: (([.[] | select(.taskPassed == true)] | length) * 100.0 / length | round),
  failed_tasks: [.[] | select(.taskPassed == false) | .taskName]
}'
{
  "total": 28,
  "passed": 25,
  "failed": 3,
  "pass_rate": 89,
  "failed_tasks": [
    "memory-usage",
    "pods-created",
    "namespace-pod-count"
  ]

Summary by CodeRabbit

  • Chores
    • Updated observability evaluation tasks to require explicit reporting of Prometheus metrics and PromQL queries. This enhances evaluation transparency and ensures AI responses demonstrate methodology alongside results for cluster monitoring queries (CPU usage, memory usage, network traffic, pod counts, and health diagnostics).

…PromQL queries

The observability eval tasks were failing because agents correctly answered
questions but summarized results in natural language without restating the
Prometheus metric name. The llmJudge contains assertions check the agent's
response text, not tool call arguments, so correct answers were marked as
failures.

Update 15 task prompts to explicitly ask the agent to include the metric
name and PromQL query used. This ensures the metric name appears in the
response for the judge to verify, while keeping the contains assertions
strict and non-overlapping with prompt text.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

Task prompts across 15 observability evaluation files are modified to require agents to explicitly report Prometheus metric names and PromQL queries used in their responses, alongside answering the primary query objective.

Changes

Cohort / File(s) Summary
Pod Metrics Queries
evals/tasks/observability/queries/cpu-usage.yaml, memory-usage.yaml, network-traffic.yaml, crashlooping-pods.yaml, pending-pods.yaml, pods-created.yaml
Prompt updated to require reporting metric name and PromQL query; cpu-usage.yaml and memory-usage.yaml also require top 5 pods result.
Cluster & Namespace Queries
evals/tasks/observability/queries/diagnose-cluster-health.yaml, namespace-pod-count.yaml, namespace-resource-usage.yaml, time-range-query.yaml, visualize-cpu-usage.yaml
Prompt updated to require explicit reporting of Prometheus metric names and PromQL queries used in health diagnostics or resource analysis.
Prometheus Internals Queries
evals/tasks/observability/queries/prometheus-head-series.yaml, prometheus-requests.yaml, prometheus-wal-size.yaml
Prompt updated to require metric name and PromQL query reporting alongside numerical responses about Prometheus state.
Label & Series Query
evals/tasks/observability/labels/series-by-namespace.yaml
Prompt updated to require enumeration of label dimensions (e.g., pod, container) in container_cpu_usage_seconds_total series rather than only count reporting.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 Whiskers twitched with glee,
For prompts now ask with clarity—
"Show your metrics, show your query true!"
Each task now shines, transparent through,
PromQL queries traced for me and you! 📊✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main change: updating observability task prompts to request metric names and PromQL queries.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from Kaustubh-pande and manusa April 27, 2026 15:39
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
evals/tasks/observability/queries/visualize-cpu-usage.yaml (1)

24-24: Standardize response shape to make judge matching more deterministic.

Good change. To further reduce residual false negatives, require a fixed response format (Metric, PromQL, Result) instead of free-form prose.

Suggested prompt tweak
-      Include the Prometheus metric name and PromQL query you used.
+      Respond in this format:
+      - Metric: <metric_name>
+      - PromQL: <exact_query>
+      - Result: <brief finding>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml` at line 24,
Update the task to require a fixed response JSON/YAML shape with three keys
"Metric", "PromQL", and "Result" and ensure the prompt explicitly asks the model
to return the Prometheus metric name (e.g., node_cpu_seconds_total) under
"Metric", the exact PromQL query used under "PromQL", and the numeric or
time-series output under "Result"; modify visualize-cpu-usage.yaml to include
this schema requirement and an explicit instruction to include the Prometheus
metric name and the exact PromQL query string.
evals/tasks/observability/queries/namespace-pod-count.yaml (1)

27-28: Tighten “running pods” semantics to reduce ambiguous query choices.

Given this task is still failing, consider explicitly requiring a phase="Running" filter (or equivalent) in the PromQL instruction.

Suggested prompt tweak
-      Which namespaces have the most running pods? Show me the top 5.
-      Use Prometheus metrics and include the metric name and PromQL query you used.
+      Which namespaces have the most running pods? Show me the top 5.
+      Use Prometheus metrics and ensure your query filters for running phase (for example, phase="Running").
+      Include the metric name and exact PromQL query you used.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/namespace-pod-count.yaml` around lines 27 -
28, Update the task prompt that asks "Which namespaces have the most running
pods? Show me the top 5." to explicitly require using the pod phase filter
(e.g., phase="Running") in the PromQL instruction: ask the model to use the
kube_pod_status_phase metric (or equivalent) and show the metric name plus the
exact PromQL query including `phase="Running"`, then sort and limit to top 5
namespaces (for example using topk(5, sum by(namespace)
(kube_pod_status_phase{phase="Running"} == 1))). Ensure the prompt clearly
demands the metric name and the full PromQL expression.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@evals/tasks/observability/queries/namespace-pod-count.yaml`:
- Around line 27-28: Update the task prompt that asks "Which namespaces have the
most running pods? Show me the top 5." to explicitly require using the pod phase
filter (e.g., phase="Running") in the PromQL instruction: ask the model to use
the kube_pod_status_phase metric (or equivalent) and show the metric name plus
the exact PromQL query including `phase="Running"`, then sort and limit to top 5
namespaces (for example using topk(5, sum by(namespace)
(kube_pod_status_phase{phase="Running"} == 1))). Ensure the prompt clearly
demands the metric name and the full PromQL expression.

In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml`:
- Line 24: Update the task to require a fixed response JSON/YAML shape with
three keys "Metric", "PromQL", and "Result" and ensure the prompt explicitly
asks the model to return the Prometheus metric name (e.g.,
node_cpu_seconds_total) under "Metric", the exact PromQL query used under
"PromQL", and the numeric or time-series output under "Result"; modify
visualize-cpu-usage.yaml to include this schema requirement and an explicit
instruction to include the Prometheus metric name and the exact PromQL query
string.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 32fe6019-6a11-4f5b-9150-c4cbf20806fd

📥 Commits

Reviewing files that changed from the base of the PR and between 58c6ebe and 76b8153.

📒 Files selected for processing (15)
  • evals/tasks/observability/labels/series-by-namespace.yaml
  • evals/tasks/observability/queries/cpu-usage.yaml
  • evals/tasks/observability/queries/crashlooping-pods.yaml
  • evals/tasks/observability/queries/diagnose-cluster-health.yaml
  • evals/tasks/observability/queries/memory-usage.yaml
  • evals/tasks/observability/queries/namespace-pod-count.yaml
  • evals/tasks/observability/queries/namespace-resource-usage.yaml
  • evals/tasks/observability/queries/network-traffic.yaml
  • evals/tasks/observability/queries/pending-pods.yaml
  • evals/tasks/observability/queries/pods-created.yaml
  • evals/tasks/observability/queries/prometheus-head-series.yaml
  • evals/tasks/observability/queries/prometheus-requests.yaml
  • evals/tasks/observability/queries/prometheus-wal-size.yaml
  • evals/tasks/observability/queries/time-range-query.yaml
  • evals/tasks/observability/queries/visualize-cpu-usage.yaml

@slashpai
Copy link
Copy Markdown
Member Author

slashpai commented Apr 27, 2026

@nader-ziada @Cali0707 Since you reviewed last PR can you help with this as well? We saw this gave better results

@mvinkler
Copy link
Copy Markdown

/lgtm

@slashpai slashpai changed the title evals: update observability task prompts to request metric names and PromQL queries GIE-507: update observability task prompts to request metric names and PromQL queries Apr 27, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 27, 2026

@slashpai: This pull request references GIE-507 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

After running the evals with gpt 5.5 (60 %) I saw evals pass percenatage lowered than gpt-5-nano (64%)

A few observability eval tasks were failing even when the agent got the correct answer. The agent gets the right answer but summarizes in natural language without restating the metric name.

So updated prompts to explicitly ask the agent to report the PromQL query/metric name used to force the metric name into the response

I ran with gpt-5-nano and got these results after updating evals

make run-evals EVAL_LABEL_SELECTOR=suite=observability
cat mcpchecker-openai-agent-kubernetes-test-out.json | jq '{                                                                                       ─╯
 total: length,
 passed: [.[] | select(.taskPassed == true)] | length,
 failed: [.[] | select(.taskPassed == false)] | length,
 pass_rate: (([.[] | select(.taskPassed == true)] | length) * 100.0 / length | round),
 failed_tasks: [.[] | select(.taskPassed == false) | .taskName]
}'
{
 "total": 28,
 "passed": 25,
 "failed": 3,
 "pass_rate": 89,
 "failed_tasks": [
   "memory-usage",
   "pods-created",
   "namespace-pod-count"
 ]

Summary by CodeRabbit

  • Chores
  • Updated observability evaluation tasks to require explicit reporting of Prometheus metrics and PromQL queries. This enhances evaluation transparency and ensures AI responses demonstrate methodology alongside results for cluster monitoring queries (CPU usage, memory usage, network traffic, pod counts, and health diagnostics).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@slashpai
Copy link
Copy Markdown
Member Author

/cherry-pick release-0.3

@openshift-cherrypick-robot
Copy link
Copy Markdown

@slashpai: once the present PR merges, I will cherry-pick it on top of release-0.3 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@slashpai
Copy link
Copy Markdown
Member Author

cc: @iNecas @tremes

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2026
Copy link
Copy Markdown

@Cali0707 Cali0707 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@Cali0707
Copy link
Copy Markdown

/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request"

@Cali0707
Copy link
Copy Markdown

Thanks @slashpai !

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Cali0707, slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 27, 2026

@Cali0707: Overrode contexts on behalf of Cali0707: Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request

Details

In response to this:

/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit ce503a7 into openshift:main Apr 27, 2026
10 of 11 checks passed
@openshift-cherrypick-robot
Copy link
Copy Markdown

@slashpai: new pull request created: #276

Details

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 27, 2026

@slashpai: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants