feat(mcp): classify MCP tool errors by category for alerting#2570
feat(mcp): classify MCP tool errors by category for alerting#2570brandon-pereira wants to merge 5 commits into
Conversation
- Add error categorization (user vs server) to MCP tool error responses using a WeakMap side-channel that can't leak through SDK serialization - Use instanceof ClickHouseError instead of duck-typing for type detection - Add isServerError() to detect both ClickHouse server-side error types (NETWORK_ERROR, SOCKET_TIMEOUT, etc.) and Node.js TCP errors (ECONNREFUSED, ENOTFOUND, etc.) with full cause-chain walking - Classify errors on OTel spans (mcp.tool.error_category) and metric counters so server errors trigger alerts while user errors don't - Use clickHouseErrorResult in catch blocks across eventDeltas, breakdown, and waterfall tool handlers for consistent categorization - Fall back through e.message → e.cause.message → String(e) for errors where common-utils' ClickHouseQueryError wraps with an empty message Tests: - Unit: getClickHouseErrorType, isServerError, clickHouseErrorResult with getErrorCategory assertions for all error shapes - Unit: withToolTracing error_category on spans and counters - Integration: syntax errors, unknown columns/tables across sql, table, timeseries tools; unreachable host errors; no _errorCategory leak
- Add isClickHouseError() with constructor-name fallback for duplicate @clickhouse/client-common packages across the monorepo - Migrate assertSourceKindMatchesSelect and validateMetricSelectItems from inline error objects to mcpUserError() (fixes WeakMap bypass) - Reclassify timeouts as server errors in describeMetric, describeSource, listMetrics (infrastructure condition, not user input) - Reclassify getColumns failure to use clickHouseErrorResult for auto-classification in describeMetric - Reclassify MongoDB query failure as server error in searchDashboards - Reclassify pattern mining failure as server error in runEventPatterns - Remove dead mcpError() alias, inline mcpUserError in validateObjectId - Fix TypeScript narrowing in tracing.ts for getErrorCategory call
🦋 Changeset detectedLatest commit: dc7426a The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
🔴 Tier 4 — CriticalTouches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD. Why this tier:
Additional context: agent branch ( Review process: Deep review from a domain expert. Synchronous walkthrough may be required. Stats
|
Greptile SummaryThis PR adds structured error categorization (
Confidence Score: 5/5Safe to merge; all tool files are fully migrated to typed helpers, the WeakMap side-channel correctly prevents wire leakage, and the classification logic is well-tested across unit, integration, and tracing layers. The migration is complete with no remaining inline No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[MCP Tool Invocation] --> B{Validation / Logic Error?}
B -->|User input bad| C[mcpUserError]
B -->|Source/connection not found| C
B -->|ClickHouse query| D{isServerError?}
D -->|NETWORK_ERROR / SOCKET_TIMEOUT\nECONNREFUSED / ENOTFOUND etc.| E[mcpServerError]
D -->|SYNTAX_ERROR / UNKNOWN_TABLE\nquery-level errors| F[mcpUserError via clickHouseErrorResult]
B -->|Timeout on system query\ndescribeMetric/describeSource/listMetrics| E
B -->|Drain clustering failure\npatterning algorithm bug| E
B -->|MongoDB failure\nsearchDashboards| E
C --> G[WeakMap stores category = 'user']
E --> H[WeakMap stores category = 'server']
F --> G
G --> I[withToolTracing\nreads getErrorCategory]
H --> I
I -->|result.isError| J{category set?}
J -->|Yes| K[Use stored category]
J -->|No - unclassified inline error| L[Default: 'server']
K --> M[Span attr: mcp.tool.error_category\nCounter label: error_category]
L --> M
N[Thrown exception] --> O[withToolTracing catch block\nHardcoded: 'server']
O --> M
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[MCP Tool Invocation] --> B{Validation / Logic Error?}
B -->|User input bad| C[mcpUserError]
B -->|Source/connection not found| C
B -->|ClickHouse query| D{isServerError?}
D -->|NETWORK_ERROR / SOCKET_TIMEOUT\nECONNREFUSED / ENOTFOUND etc.| E[mcpServerError]
D -->|SYNTAX_ERROR / UNKNOWN_TABLE\nquery-level errors| F[mcpUserError via clickHouseErrorResult]
B -->|Timeout on system query\ndescribeMetric/describeSource/listMetrics| E
B -->|Drain clustering failure\npatterning algorithm bug| E
B -->|MongoDB failure\nsearchDashboards| E
C --> G[WeakMap stores category = 'user']
E --> H[WeakMap stores category = 'server']
F --> G
G --> I[withToolTracing\nreads getErrorCategory]
H --> I
I -->|result.isError| J{category set?}
J -->|Yes| K[Use stored category]
J -->|No - unclassified inline error| L[Default: 'server']
K --> M[Span attr: mcp.tool.error_category\nCounter label: error_category]
L --> M
N[Thrown exception] --> O[withToolTracing catch block\nHardcoded: 'server']
O --> M
Reviews (4): Last reviewed commit: "Merge branch 'main' into claude/mcp-erro..." | Re-trigger Greptile |
E2E Test Results✅ All tests passed • 222 passed • 3 skipped • 1565s
Tests ran across 4 shards in parallel. |
Deep Review✅ No critical issues found. The change is well-structured and well-tested: the 🟡 P2 -- recommended
🔵 P3 nitpicks (8)
Reviewers (11): correctness, reliability, adversarial, security, testing, maintainability, api-contract, kieran-typescript, project-standards, agent-native, learnings-researcher. Testing gaps:
|
isServerError now checks for ClickHouseError server-side types at every depth of the cause chain, not just depth 0-1. Previously a NETWORK_ERROR nested 2+ levels deep would be missed while a Node.js ECONNREFUSED at the same depth was correctly caught.
- Fix eventDeltas.ts:343 misclassification: use clickHouseErrorResult for 'Failed to build sample queries' catch (matches sibling at 381) - Pin @clickhouse/client-common to 1.23.0-head.fae5998.1 matching common-utils, eliminating duplicate package installs - Add changeset for @hyperdx/api - Add tests: cross-package constructor-name fallback, deep cause chain ClickHouseError detection, AggregateError TCP errors, circular cause guard
What
Classify MCP tool errors into user vs server categories so we can alert on infrastructure failures without noise from agent input errors.
Changes
Error categorization infrastructure (
packages/api/src/mcp/utils/errors.ts):mcpUserError()/mcpServerError()helpers replace inline error boilerplate across all MCP toolsWeakMapside-channel — prevents metadata from leaking through the MCP SDK serialization layergetErrorCategory()retrieves category for tracing/metricsClickHouse error classification (
packages/api/src/mcp/tools/query/helpers.ts):clickHouseErrorResult()auto-classifies ClickHouse errors: infrastructure types (NETWORK_ERROR, SOCKET_TIMEOUT, etc.) → server; query errors → userisServerError()walks the full.causechain for both ClickHouse server-side error types and Node.js TCP-level errors (ECONNREFUSED, ENOTFOUND, etc.)isClickHouseError()with constructor-name fallback handles duplicate@clickhouse/client-commonpackages across the monorepo (root vs common-utils)Tracing enrichment (
packages/api/src/mcp/utils/tracing.ts):error_categoryattribute added to spans and thehyperdx.mcp.tool.errorscounterserverwhen category is unset (safe default that surfaces unclassified errors in alerts)Tool migrations (~25 files):
{ isError: true, content: [...] }to typed helpersdescribeMetric,describeSource,listMetrics→mcpServerError(infrastructure condition)searchDashboards→mcpServerErrorrunEventPatterns→mcpServerErrorassertSourceKindMatchesSelect,validateMetricSelectItems) →mcpUserErrorWhy
Without error categorization, every
isError: trueresult looks the same in metrics/alerts. Agent typos (wrong source ID, bad query syntax) and real infrastructure outages (ClickHouse down, MongoDB timeout) are indistinguishable. This makes MCP error alerts either too noisy (fire on every user error) or useless (too many false positives to trust).With this change, alerting rules can filter on
error_category = "server"to catch only infrastructure failures.Testing
Unit tests (
query.test.ts):getClickHouseErrorType— ClickHouseError type extraction, cross-package scenarios, duck-typing resilienceisServerError— ClickHouse error types, Node.js TCP errors, cause-chain walkingclickHouseErrorResult— auto-classification for infrastructure vs query errors, prefix/suffix formatting, hint compositionTracing tests (
tracing.test.ts):withToolTracingfor user, server, and unset categories_errorCategoryproperty on serialized result)Integration tests (
queryTool.test.ts):isErrorisError_errorCategorydoes not leak on the wire result