Skip to content

feat(mcp): classify MCP tool errors by category for alerting#2570

Open
brandon-pereira wants to merge 5 commits into
mainfrom
claude/mcp-error-categorization
Open

feat(mcp): classify MCP tool errors by category for alerting#2570
brandon-pereira wants to merge 5 commits into
mainfrom
claude/mcp-error-categorization

Conversation

@brandon-pereira

@brandon-pereira brandon-pereira commented Jul 1, 2026

Copy link
Copy Markdown
Member

What

Classify MCP tool errors into user vs server categories so we can alert on infrastructure failures without noise from agent input errors.

Changes

Error categorization infrastructure (packages/api/src/mcp/utils/errors.ts):

  • New mcpUserError() / mcpServerError() helpers replace inline error boilerplate across all MCP tools
  • Error category stored via WeakMap side-channel — prevents metadata from leaking through the MCP SDK serialization layer
  • getErrorCategory() retrieves category for tracing/metrics

ClickHouse error classification (packages/api/src/mcp/tools/query/helpers.ts):

  • clickHouseErrorResult() auto-classifies ClickHouse errors: infrastructure types (NETWORK_ERROR, SOCKET_TIMEOUT, etc.) → server; query errors → user
  • isServerError() walks the full .cause chain for both ClickHouse server-side error types and Node.js TCP-level errors (ECONNREFUSED, ENOTFOUND, etc.)
  • isClickHouseError() with constructor-name fallback handles duplicate @clickhouse/client-common packages across the monorepo (root vs common-utils)

Tracing enrichment (packages/api/src/mcp/utils/tracing.ts):

  • error_category attribute added to spans and the hyperdx.mcp.tool.errors counter
  • Defaults to server when category is unset (safe default that surfaces unclassified errors in alerts)

Tool migrations (~25 files):

  • All MCP tools migrated from inline { isError: true, content: [...] } to typed helpers
  • Timeouts in describeMetric, describeSource, listMetricsmcpServerError (infrastructure condition)
  • MongoDB failures in searchDashboardsmcpServerError
  • Pattern mining failures in runEventPatternsmcpServerError
  • Validation errors (assertSourceKindMatchesSelect, validateMetricSelectItems) → mcpUserError

Why

Without error categorization, every isError: true result looks the same in metrics/alerts. Agent typos (wrong source ID, bad query syntax) and real infrastructure outages (ClickHouse down, MongoDB timeout) are indistinguishable. This makes MCP error alerts either too noisy (fire on every user error) or useless (too many false positives to trust).

With this change, alerting rules can filter on error_category = "server" to catch only infrastructure failures.

Testing

Screenshot 2026-07-01 at 3 13 44 PM

Unit tests (query.test.ts):

  • getClickHouseErrorType — ClickHouseError type extraction, cross-package scenarios, duck-typing resilience
  • isServerError — ClickHouse error types, Node.js TCP errors, cause-chain walking
  • clickHouseErrorResult — auto-classification for infrastructure vs query errors, prefix/suffix formatting, hint composition

Tracing tests (tracing.test.ts):

  • Error category propagation through withToolTracing for user, server, and unset categories
  • Wire-format leak prevention (no _errorCategory property on serialized result)

Integration tests (queryTool.test.ts):

  • ClickHouse query errors (syntax, unknown column/table) surface as isError
  • Infrastructure errors (unreachable ClickHouse) surface as isError
  • _errorCategory does not leak on the wire result

- Add error categorization (user vs server) to MCP tool error responses
  using a WeakMap side-channel that can't leak through SDK serialization
- Use instanceof ClickHouseError instead of duck-typing for type detection
- Add isServerError() to detect both ClickHouse server-side error types
  (NETWORK_ERROR, SOCKET_TIMEOUT, etc.) and Node.js TCP errors
  (ECONNREFUSED, ENOTFOUND, etc.) with full cause-chain walking
- Classify errors on OTel spans (mcp.tool.error_category) and metric
  counters so server errors trigger alerts while user errors don't
- Use clickHouseErrorResult in catch blocks across eventDeltas, breakdown,
  and waterfall tool handlers for consistent categorization
- Fall back through e.message → e.cause.message → String(e) for errors
  where common-utils' ClickHouseQueryError wraps with an empty message

Tests:
- Unit: getClickHouseErrorType, isServerError, clickHouseErrorResult
  with getErrorCategory assertions for all error shapes
- Unit: withToolTracing error_category on spans and counters
- Integration: syntax errors, unknown columns/tables across sql, table,
  timeseries tools; unreachable host errors; no _errorCategory leak
- Add isClickHouseError() with constructor-name fallback for duplicate
  @clickhouse/client-common packages across the monorepo
- Migrate assertSourceKindMatchesSelect and validateMetricSelectItems
  from inline error objects to mcpUserError() (fixes WeakMap bypass)
- Reclassify timeouts as server errors in describeMetric, describeSource,
  listMetrics (infrastructure condition, not user input)
- Reclassify getColumns failure to use clickHouseErrorResult for
  auto-classification in describeMetric
- Reclassify MongoDB query failure as server error in searchDashboards
- Reclassify pattern mining failure as server error in runEventPatterns
- Remove dead mcpError() alias, inline mcpUserError in validateObjectId
- Fix TypeScript narrowing in tracing.ts for getErrorCategory call
@changeset-bot

changeset-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: dc7426a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@hyperdx/api Patch
@hyperdx/app Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented Jul 1, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment Jul 1, 2026 10:40pm
hyperdx-storybook Ready Ready Preview, Comment Jul 1, 2026 10:40pm

Request Review

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Large diff: 1162 production lines changed (threshold: 400)

Additional context: agent branch (claude/mcp-error-categorization)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 28
  • Production lines changed: 1162 (+ 532 in test files, excluded from tier calculation)
  • Branch: claude/mcp-error-categorization
  • Author: brandon-pereira

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds structured error categorization (user vs server) to all MCP tool error responses, enabling alerting rules to filter on error_category = "server" for genuine infrastructure failures without noise from agent input mistakes. The approach uses a WeakMap side-channel to avoid metadata leaking through MCP SDK serialization.

  • Introduces mcpUserError() / mcpServerError() helpers and migrates ~25 tool files away from inline { isError: true, content: [...] } objects, with the WeakMap design ensuring category metadata can never leak on the wire.
  • Adds isServerError() with full .cause-chain walking and AggregateError support to auto-classify ClickHouse errors (NETWORK_ERROR, SOCKET_TIMEOUT, etc. → server; SYNTAX_ERROR etc. → user), with a constructor-name fallback for duplicate @clickhouse/client-common installations across the monorepo.
  • Enriches spans and the hyperdx.mcp.tool.errors counter with error_category, defaulting to server for any result missing a category (safe default that surfaces unclassified errors in alerts).

Confidence Score: 5/5

Safe to merge; all tool files are fully migrated to typed helpers, the WeakMap side-channel correctly prevents wire leakage, and the classification logic is well-tested across unit, integration, and tracing layers.

The migration is complete with no remaining inline { isError: true } objects outside errors.ts itself. The isServerError chain-walking logic handles circular causes, cross-package instanceof failures, and AggregateError TCP errors. The default-to-server fallback in withToolTracing is intentionally conservative and documented. Tests cover the full classification matrix including edge cases (circular .cause, duck-typed ClickHouseError, ECONNREFUSED nested in .cause).

No files require special attention.

Important Files Changed

Filename Overview
packages/api/src/mcp/utils/errors.ts Core error infrastructure: adds McpErrorCategory type, WeakMap side-channel, and mcpUserError/mcpServerError helpers; removes old mcpError; exports getErrorCategory for tracing
packages/api/src/mcp/tools/query/helpers.ts Key logic: adds isServerError() with full cause-chain walking, isClickHouseError() with constructor-name fallback, and enriches clickHouseErrorResult() with auto-classification and prefix/suffix context
packages/api/src/mcp/utils/tracing.ts Adds error_category attribute to spans and counters; defaults to 'server' for uncategorized errors; correctly handles both isError results and thrown exceptions
packages/api/src/mcp/tests/query.test.ts Comprehensive unit tests for getClickHouseErrorType, isServerError (including circular-cause guard and cross-package fallback), and clickHouseErrorResult classification
packages/api/src/mcp/tests/tracing.test.ts Adds tests for user/server/unset error category propagation through withToolTracing and verifies no _errorCategory wire leak
packages/api/src/mcp/tests/queryTool.test.ts Integration tests for ClickHouse query errors (syntax, unknown column/table) and infrastructure errors (unreachable host); verifies no _errorCategory wire leak
packages/api/src/mcp/tools/alerts/saveAlert.ts Migrates to typed helpers; uses BaseError instanceof check to distinguish user vs server alert validation errors
packages/api/src/mcp/tools/sources/describeMetric.ts Timeout → mcpServerError; source/connection not-found → mcpUserError; column-load failure now uses clickHouseErrorResult for auto-classification
packages/api/src/mcp/tools/query/runEventPatterns.ts Drain algorithm catch block correctly uses mcpServerError (in-memory clustering failure, not a ClickHouse error); user validation errors use mcpUserError

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[MCP Tool Invocation] --> B{Validation / Logic Error?}
    B -->|User input bad| C[mcpUserError]
    B -->|Source/connection not found| C
    B -->|ClickHouse query| D{isServerError?}
    D -->|NETWORK_ERROR / SOCKET_TIMEOUT\nECONNREFUSED / ENOTFOUND etc.| E[mcpServerError]
    D -->|SYNTAX_ERROR / UNKNOWN_TABLE\nquery-level errors| F[mcpUserError via clickHouseErrorResult]
    B -->|Timeout on system query\ndescribeMetric/describeSource/listMetrics| E
    B -->|Drain clustering failure\npatterning algorithm bug| E
    B -->|MongoDB failure\nsearchDashboards| E

    C --> G[WeakMap stores category = 'user']
    E --> H[WeakMap stores category = 'server']
    F --> G

    G --> I[withToolTracing\nreads getErrorCategory]
    H --> I
    I -->|result.isError| J{category set?}
    J -->|Yes| K[Use stored category]
    J -->|No - unclassified inline error| L[Default: 'server']
    K --> M[Span attr: mcp.tool.error_category\nCounter label: error_category]
    L --> M

    N[Thrown exception] --> O[withToolTracing catch block\nHardcoded: 'server']
    O --> M
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[MCP Tool Invocation] --> B{Validation / Logic Error?}
    B -->|User input bad| C[mcpUserError]
    B -->|Source/connection not found| C
    B -->|ClickHouse query| D{isServerError?}
    D -->|NETWORK_ERROR / SOCKET_TIMEOUT\nECONNREFUSED / ENOTFOUND etc.| E[mcpServerError]
    D -->|SYNTAX_ERROR / UNKNOWN_TABLE\nquery-level errors| F[mcpUserError via clickHouseErrorResult]
    B -->|Timeout on system query\ndescribeMetric/describeSource/listMetrics| E
    B -->|Drain clustering failure\npatterning algorithm bug| E
    B -->|MongoDB failure\nsearchDashboards| E

    C --> G[WeakMap stores category = 'user']
    E --> H[WeakMap stores category = 'server']
    F --> G

    G --> I[withToolTracing\nreads getErrorCategory]
    H --> I
    I -->|result.isError| J{category set?}
    J -->|Yes| K[Use stored category]
    J -->|No - unclassified inline error| L[Default: 'server']
    K --> M[Span attr: mcp.tool.error_category\nCounter label: error_category]
    L --> M

    N[Thrown exception] --> O[withToolTracing catch block\nHardcoded: 'server']
    O --> M
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into claude/mcp-erro..." | Re-trigger Greptile

Comment thread packages/api/src/mcp/tools/query/helpers.ts
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 222 passed • 3 skipped • 1565s

Status Count
✅ Passed 222
❌ Failed 0
⚠️ Flaky 5
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Deep Review

✅ No critical issues found. The change is well-structured and well-tested: the WeakMap category side-channel preserves object identity on every migrated path (error results are returned by reference and success-only mutators never touch them), the migration is complete (no raw isError literals remain), error-message guidance is preserved byte-for-byte, and no new credential leak or crash is introduced. The findings below are classification-accuracy tuning and test-coverage gaps for the new behavior — none block merge.

🟡 P2 -- recommended

  • packages/api/src/mcp/tools/query/helpers.ts:612 -- SERVER_CH_ERROR_TYPES is a 4-entry allowlist, so genuine cluster-health ClickHouseError types fall through to user and never alert; e.g. errorHint tells the agent SETTING_CONSTRAINT_VIOLATION is "a server-side constraint" while clickHouseErrorResult classifies that same type user.
    • Fix: Treat unrecognized ClickHouseError types as server, or extend the set to cover replica/keeper/overload/constraint types, so admin- and cluster-level failures stay alertable.
    • adversarial, correctness, agent-native-reviewer
  • packages/api/src/mcp/tools/sources/describeMetric.ts:747 -- Wall-clock discovery timeouts (describeMetric and listMetrics.ts:295) are unconditionally mcpServerError, so one popular high-cardinality metric or a wide time range repeatedly trips the app-level timeout and pages on-call as an infrastructure incident.
    • Fix: Default these Promise.race/abort timeouts to user, escalating to server only when the underlying error satisfies isServerError.
    • reliability
  • packages/api/src/mcp/utils/tracing.ts:61 -- The alerting model relies on every tool routing errors through mcpUserError/mcpServerError, but nothing enforces it; a future handler returning a raw { isError: true, content: [...] } literal defaults to server and silently becomes alert noise with no compile or lint signal.
    • Fix: Add a lint rule banning raw isError literals outside errors.ts, or make error results constructible only via the helpers (branded/opaque type).
    • maintainability, api-contract, correctness
  • packages/api/src/mcp/__tests__/queryTool.test.ts:774 -- The integration tests for query errors and unreachable-host infra errors assert only isError: true, never the resulting category, so the core new behavior (user vs server through the real handler pipeline) is unverified end-to-end and a shape mismatch between the real client's errors and isServerError would silently misclassify outages.
    • Fix: Assert the category (via the metrics counter label or getErrorCategory) for a live unreachable-host case (server) and a not-found case (user).
    • testing, correctness, api-contract, reliability
  • packages/api/src/mcp/tools/query/helpers.ts:764 -- The new e.message || e.cause.message || String(e) fallback branch in clickHouseErrorResult (for wrappers like ClickHouseQueryError with an empty own message) has no test.
    • Fix: Add a query.test.ts case with an empty-message Error whose .cause carries the real message, asserting the cause's message surfaces.
    • testing
🔵 P3 nitpicks (8)
  • packages/api/src/mcp/tools/query/helpers.ts:713 -- isServerError's AggregateError branch checks inner errors only for direct Node TCP codes; a code on an inner's .cause, or a ClickHouseError inner with a server type, is missed and classified user.
    • Fix: Recurse isServerError into each inner error instead of calling only hasNodeErrorCode.
    • correctness, adversarial
  • packages/api/src/mcp/tools/alerts/saveAlert.ts:103 -- e instanceof BaseError ? mcpUserError : mcpServerError treats every BaseError subclass (including Api500Error) as user; latent today since validateAlertInput only throws Api400Error, but any future 500 from that path would go unalerted.
    • Fix: Branch on the concrete subclass or statusCode >= 500 rather than instanceof BaseError.
    • reliability, agent-native-reviewer, project-standards
  • packages/api/src/mcp/tools/sources/describeMetric.ts:545 -- The getColumns failure now forwards the raw ClickHouse/connection error to the agent (previously a generic message), disclosing internal host:port; this aligns with pre-existing behavior in the other query tools but widens it.
    • Fix: Keep the generic user-facing text and log the raw error server-side, applied consistently across query tools.
    • security
  • packages/api/src/mcp/utils/tracing.ts:62 -- result as McpErrorResult downcasts the broader ToolResult to a single-element-tuple type the compiler never verified; harmless today because the WeakMap keys on identity.
    • Fix: Type the map as WeakMap<object, McpErrorCategory> and drop the cast.
    • kieran-typescript, api-contract
  • packages/api/src/mcp/tools/query/helpers.ts:833 -- helpers.ts is now 833 lines, well past the 300-line limit in agent_docs/code_style.md; this diff adds ~120 lines of classification logic.
    • Fix: Extract the ClickHouse error-classification block into its own module (e.g. mcp/utils/clickhouseErrors.ts).
    • project-standards
  • packages/api/src/mcp/tools/query/helpers.ts:612 -- SERVER_CH_ERROR_TYPES/SERVER_NODE_ERROR_CODES are hand-maintained closed sets with no link to an upstream source of truth, risking silent drift on client upgrades.
    • Fix: Add a comment linking the client version that enumerates these types, or a test that fails when they diverge.
    • maintainability
  • packages/api/src/mcp/tools/query/helpers.ts:644 -- The constructor?.name === 'ClickHouseError' fallback guards against duplicate @clickhouse/client-common installs, but the lockfile currently resolves a single copy shared by api and common-utils.
    • Fix: Verify the duplicate can actually occur; if not, resolve it at the dependency level rather than carrying the fallback indefinitely.
    • maintainability
  • packages/api/src/mcp/tools/query/helpers.ts:652 -- Redundant (err as Record<string, unknown>) cast; in-narrowing already types err.type without it (see hasNodeErrorCode in the same file).
    • Fix: Drop the cast and write typeof err.type === 'string'.
    • kieran-typescript

Reviewers (11): correctness, reliability, adversarial, security, testing, maintainability, api-contract, kieran-typescript, project-standards, agent-native, learnings-researcher.

Testing gaps:

  • No test pins classification for server-health ClickHouse types outside the 4-entry allowlist (TIMEOUT_EXCEEDED, TOO_MANY_SIMULTANEOUS_QUERIES, SETTING_CONSTRAINT_VIOLATION).
  • 4 of the 8 SERVER_NODE_ERROR_CODES (ENETUNREACH, EHOSTUNREACH, EPIPE, UND_ERR_CONNECT_TIMEOUT) are never exercised, so an accidental removal would go uncaught.
  • No dedicated errors.test.ts unit-testing mcpUserError/mcpServerError/getErrorCategory/validateObjectId in isolation (coverage is only indirect via tracing.test.ts/query.test.ts).

isServerError now checks for ClickHouseError server-side types at
every depth of the cause chain, not just depth 0-1. Previously a
NETWORK_ERROR nested 2+ levels deep would be missed while a Node.js
ECONNREFUSED at the same depth was correctly caught.
- Fix eventDeltas.ts:343 misclassification: use clickHouseErrorResult
  for 'Failed to build sample queries' catch (matches sibling at 381)
- Pin @clickhouse/client-common to 1.23.0-head.fae5998.1 matching
  common-utils, eliminating duplicate package installs
- Add changeset for @hyperdx/api
- Add tests: cross-package constructor-name fallback, deep cause chain
  ClickHouseError detection, AggregateError TCP errors, circular cause
  guard
Comment thread packages/api/src/mcp/tools/query/runEventPatterns.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants