GIE-507: update observability task prompts to request metric names and PromQL queries by slashpai · Pull Request #274 · openshift/openshift-mcp-server

slashpai · 2026-04-27T15:39:40Z

After running the evals with gpt 5.5 (60 %) I saw evals pass percenatage lowered than gpt-5-nano (64%)

A few observability eval tasks were failing even when the agent got the correct answer. The agent gets the right answer but summarizes in natural language without restating the metric name.

So updated prompts to explicitly ask the agent to report the PromQL query/metric name used to force the metric name into the response

I ran with gpt-5-nano and got these results after updating evals

make run-evals EVAL_LABEL_SELECTOR=suite=observability

cat mcpchecker-openai-agent-kubernetes-test-out.json | jq '{                                                                                       ─╯
  total: length,
  passed: [.[] | select(.taskPassed == true)] | length,
  failed: [.[] | select(.taskPassed == false)] | length,
  pass_rate: (([.[] | select(.taskPassed == true)] | length) * 100.0 / length | round),
  failed_tasks: [.[] | select(.taskPassed == false) | .taskName]
}'
{
  "total": 28,
  "passed": 25,
  "failed": 3,
  "pass_rate": 89,
  "failed_tasks": [
    "memory-usage",
    "pods-created",
    "namespace-pod-count"
  ]

Summary by CodeRabbit

Chores
- Updated observability evaluation tasks to require explicit reporting of Prometheus metrics and PromQL queries. This enhances evaluation transparency and ensures AI responses demonstrate methodology alongside results for cluster monitoring queries (CPU usage, memory usage, network traffic, pod counts, and health diagnostics).

…PromQL queries The observability eval tasks were failing because agents correctly answered questions but summarized results in natural language without restating the Prometheus metric name. The llmJudge contains assertions check the agent's response text, not tool call arguments, so correct answers were marked as failures. Update 15 task prompts to explicitly ask the agent to include the metric name and PromQL query used. This ensures the metric name appears in the response for the judge to verify, while keeping the contains assertions strict and non-overlapping with prompt text. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

coderabbitai · 2026-04-27T15:39:54Z

📝 Walkthrough

Walkthrough

Task prompts across 15 observability evaluation files are modified to require agents to explicitly report Prometheus metric names and PromQL queries used in their responses, alongside answering the primary query objective.

Changes

Cohort / File(s)	Summary
Pod Metrics Queries `evals/tasks/observability/queries/cpu-usage.yaml`, `memory-usage.yaml`, `network-traffic.yaml`, `crashlooping-pods.yaml`, `pending-pods.yaml`, `pods-created.yaml`	Prompt updated to require reporting metric name and PromQL query; `cpu-usage.yaml` and `memory-usage.yaml` also require top 5 pods result.
Cluster & Namespace Queries `evals/tasks/observability/queries/diagnose-cluster-health.yaml`, `namespace-pod-count.yaml`, `namespace-resource-usage.yaml`, `time-range-query.yaml`, `visualize-cpu-usage.yaml`	Prompt updated to require explicit reporting of Prometheus metric names and PromQL queries used in health diagnostics or resource analysis.
Prometheus Internals Queries `evals/tasks/observability/queries/prometheus-head-series.yaml`, `prometheus-requests.yaml`, `prometheus-wal-size.yaml`	Prompt updated to require metric name and PromQL query reporting alongside numerical responses about Prometheus state.
Label & Series Query `evals/tasks/observability/labels/series-by-namespace.yaml`	Prompt updated to require enumeration of label dimensions (e.g., `pod`, `container`) in `container_cpu_usage_seconds_total` series rather than only count reporting.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 Whiskers twitched with glee,
For prompts now ask with clarity—
"Show your metrics, show your query true!"
Each task now shines, transparent through,
PromQL queries traced for me and you! 📊✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately summarizes the main change: updating observability task prompts to request metric names and PromQL queries.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

evals/tasks/observability/queries/visualize-cpu-usage.yaml (1)

24-24: Standardize response shape to make judge matching more deterministic.

Good change. To further reduce residual false negatives, require a fixed response format (Metric, PromQL, Result) instead of free-form prose.

Suggested prompt tweak

-      Include the Prometheus metric name and PromQL query you used.
+      Respond in this format:
+      - Metric: <metric_name>
+      - PromQL: <exact_query>
+      - Result: <brief finding>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml` at line 24,
Update the task to require a fixed response JSON/YAML shape with three keys
"Metric", "PromQL", and "Result" and ensure the prompt explicitly asks the model
to return the Prometheus metric name (e.g., node_cpu_seconds_total) under
"Metric", the exact PromQL query used under "PromQL", and the numeric or
time-series output under "Result"; modify visualize-cpu-usage.yaml to include
this schema requirement and an explicit instruction to include the Prometheus
metric name and the exact PromQL query string.

evals/tasks/observability/queries/namespace-pod-count.yaml (1)

27-28: Tighten “running pods” semantics to reduce ambiguous query choices.

Given this task is still failing, consider explicitly requiring a phase="Running" filter (or equivalent) in the PromQL instruction.

Suggested prompt tweak

-      Which namespaces have the most running pods? Show me the top 5.
-      Use Prometheus metrics and include the metric name and PromQL query you used.
+      Which namespaces have the most running pods? Show me the top 5.
+      Use Prometheus metrics and ensure your query filters for running phase (for example, phase="Running").
+      Include the metric name and exact PromQL query you used.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evals/tasks/observability/queries/namespace-pod-count.yaml` around lines 27 -
28, Update the task prompt that asks "Which namespaces have the most running
pods? Show me the top 5." to explicitly require using the pod phase filter
(e.g., phase="Running") in the PromQL instruction: ask the model to use the
kube_pod_status_phase metric (or equivalent) and show the metric name plus the
exact PromQL query including `phase="Running"`, then sort and limit to top 5
namespaces (for example using topk(5, sum by(namespace)
(kube_pod_status_phase{phase="Running"} == 1))). Ensure the prompt clearly
demands the metric name and the full PromQL expression.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@evals/tasks/observability/queries/namespace-pod-count.yaml`:
- Around line 27-28: Update the task prompt that asks "Which namespaces have the
most running pods? Show me the top 5." to explicitly require using the pod phase
filter (e.g., phase="Running") in the PromQL instruction: ask the model to use
the kube_pod_status_phase metric (or equivalent) and show the metric name plus
the exact PromQL query including `phase="Running"`, then sort and limit to top 5
namespaces (for example using topk(5, sum by(namespace)
(kube_pod_status_phase{phase="Running"} == 1))). Ensure the prompt clearly
demands the metric name and the full PromQL expression.

In `@evals/tasks/observability/queries/visualize-cpu-usage.yaml`:
- Line 24: Update the task to require a fixed response JSON/YAML shape with
three keys "Metric", "PromQL", and "Result" and ensure the prompt explicitly
asks the model to return the Prometheus metric name (e.g.,
node_cpu_seconds_total) under "Metric", the exact PromQL query used under
"PromQL", and the numeric or time-series output under "Result"; modify
visualize-cpu-usage.yaml to include this schema requirement and an explicit
instruction to include the Prometheus metric name and the exact PromQL query
string.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 32fe6019-6a11-4f5b-9150-c4cbf20806fd

📥 Commits

Reviewing files that changed from the base of the PR and between 58c6ebe and 76b8153.

📒 Files selected for processing (15)

evals/tasks/observability/labels/series-by-namespace.yaml
evals/tasks/observability/queries/cpu-usage.yaml
evals/tasks/observability/queries/crashlooping-pods.yaml
evals/tasks/observability/queries/diagnose-cluster-health.yaml
evals/tasks/observability/queries/memory-usage.yaml
evals/tasks/observability/queries/namespace-pod-count.yaml
evals/tasks/observability/queries/namespace-resource-usage.yaml
evals/tasks/observability/queries/network-traffic.yaml
evals/tasks/observability/queries/pending-pods.yaml
evals/tasks/observability/queries/pods-created.yaml
evals/tasks/observability/queries/prometheus-head-series.yaml
evals/tasks/observability/queries/prometheus-requests.yaml
evals/tasks/observability/queries/prometheus-wal-size.yaml
evals/tasks/observability/queries/time-range-query.yaml
evals/tasks/observability/queries/visualize-cpu-usage.yaml

slashpai · 2026-04-27T15:46:39Z

@nader-ziada @Cali0707 Since you reviewed last PR can you help with this as well? We saw this gave better results

mvinkler · 2026-04-27T15:54:21Z

/lgtm

openshift-ci-robot · 2026-04-27T16:01:30Z

@slashpai: This pull request references GIE-507 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

After running the evals with gpt 5.5 (60 %) I saw evals pass percenatage lowered than gpt-5-nano (64%)

A few observability eval tasks were failing even when the agent got the correct answer. The agent gets the right answer but summarizes in natural language without restating the metric name.

So updated prompts to explicitly ask the agent to report the PromQL query/metric name used to force the metric name into the response

I ran with gpt-5-nano and got these results after updating evals
make run-evals EVAL_LABEL_SELECTOR=suite=observability
cat mcpchecker-openai-agent-kubernetes-test-out.json | jq '{                                                                                       ─╯
 total: length,
 passed: [.[] | select(.taskPassed == true)] | length,
 failed: [.[] | select(.taskPassed == false)] | length,
 pass_rate: (([.[] | select(.taskPassed == true)] | length) * 100.0 / length | round),
 failed_tasks: [.[] | select(.taskPassed == false) | .taskName]
}'
{
 "total": 28,
 "passed": 25,
 "failed": 3,
 "pass_rate": 89,
 "failed_tasks": [
   "memory-usage",
   "pods-created",
   "namespace-pod-count"
 ]
Summary by CodeRabbit

Chores

Updated observability evaluation tasks to require explicit reporting of Prometheus metrics and PromQL queries. This enhances evaluation transparency and ensures AI responses demonstrate methodology alongside results for cluster monitoring queries (CPU usage, memory usage, network traffic, pod counts, and health diagnostics).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

slashpai · 2026-04-27T16:02:14Z

/cherry-pick release-0.3

openshift-cherrypick-robot · 2026-04-27T16:02:18Z

@slashpai: once the present PR merges, I will cherry-pick it on top of release-0.3 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

slashpai · 2026-04-27T16:02:30Z

cc: @iNecas @tremes

Cali0707

/approve

Cali0707 · 2026-04-27T18:16:29Z

/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request"

Cali0707 · 2026-04-27T18:16:43Z

Thanks @slashpai !

openshift-ci · 2026-04-27T18:16:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Cali0707, slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Cali0707]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-04-27T18:17:02Z

@Cali0707: Overrode contexts on behalf of Cali0707: Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request

Details

In response to this:

/override "Konflux kflux-prd-rh02 / openshift-mcp-server-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2026-04-27T18:49:01Z

@slashpai: new pull request created: #276

Details

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-04-27T19:01:56Z

@slashpai: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from Kaustubh-pande and manusa April 27, 2026 15:39

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

slashpai changed the title ~~evals: update observability task prompts to request metric names and PromQL queries~~ GIE-507: update observability task prompts to request metric names and PromQL queries Apr 27, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 27, 2026

openshift-ci Bot assigned mvinkler Apr 27, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2026

Cali0707 approved these changes Apr 27, 2026

View reviewed changes

openshift-ci Bot assigned Cali0707 Apr 27, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2026

openshift-merge-bot Bot merged commit ce503a7 into openshift:main Apr 27, 2026
10 of 11 checks passed

openshift-cherrypick-robot mentioned this pull request Apr 27, 2026

GIE-507: [release-0.3] update observability task prompts to request metric names and PromQL queries #276

Merged

slashpai mentioned this pull request Apr 28, 2026

GIE-507: sync evals task definitions with openshift-mcp-server rhobs/obs-mcp#79

Merged

Conversation

slashpai commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

slashpai commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvinkler commented Apr 27, 2026

Uh oh!

openshift-ci-robot commented Apr 27, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

slashpai commented Apr 27, 2026

Uh oh!

openshift-cherrypick-robot commented Apr 27, 2026

Uh oh!

slashpai commented Apr 27, 2026

Uh oh!

Cali0707 left a comment

Choose a reason for hiding this comment

Uh oh!

Cali0707 commented Apr 27, 2026

Uh oh!

Cali0707 commented Apr 27, 2026

Uh oh!

openshift-ci Bot commented Apr 27, 2026

Uh oh!

openshift-ci Bot commented Apr 27, 2026

Uh oh!

Uh oh!

openshift-cherrypick-robot commented Apr 27, 2026

Uh oh!

openshift-ci Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

slashpai commented Apr 27, 2026 •

edited

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

slashpai commented Apr 27, 2026 •

edited

Loading

openshift-ci-robot commented Apr 27, 2026 •

edited by openshift-ci Bot

Loading