fix: MCP daemon hangs on init when a parse-worker grammar load fails#567
Open
arttttt wants to merge 5 commits into
Open
fix: MCP daemon hangs on init when a parse-worker grammar load fails#567arttttt wants to merge 5 commits into
arttttt wants to merge 5 commits into
Conversation
…o load grammars
The parse worker's grammar-load handshake was awaited with a bare
`once('message')` that only resolved on `grammars-loaded`. If the worker died
while loading grammars (a tree-sitter WASM abort), that message never arrived,
so the await — and the in-process `indexMutex` it runs under — hung forever. In
the shared daemon (one process across all MCP clients), this wedged the
background catch-up sync, so the first tool call and every client that connected
afterwards hung on init with no fallback, regardless of project size.
`ensureWorker` now awaits via the new exported `awaitWorkerGrammarLoad`, which
settles on the first of grammars-loaded / grammars-load-failed / worker error /
worker exit / timeout. On failure the worker is torn down and parsing degrades
to in-process (slower but correct) for the rest of the run. The worker's
`load-grammars` handler also reports JS-level load failures instead of dying
silently.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Asserts awaitWorkerGrammarLoad settles (never hangs) on every outcome — grammars-loaded, grammars-load-failed, worker exit, worker error, and timeout — and always removes its listeners. Regression guard for the daemon init-hang. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st call Defense in depth on top of the parse-worker grammar-load fix. The first tool call blocks on the engine's post-open catch-up sync (the gate) so it never serves rows for files deleted while no server was running. It awaited the gate unbounded — a sync that never settles for any reason would hang the first call and, in the shared daemon, every client that connected after. The wait is now bounded (default 120s, override CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS); on timeout the call proceeds best-effort over possibly-stale data — the same outcome the existing rejection-swallow already accepts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Asserts the first tool call still serves (best-effort) when the catch-up gate never settles, instead of hanging — using a tiny CODEGRAPH_CATCHUP_GATE_TIMEOUT_MS override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The MCP server can hang on startup / first use and never initialize — on any project, large or small. The shared daemon stops responding and every editor session that connects to it hangs waiting to initialize.
Root cause
ExtractionOrchestratorspins up a parse worker and awaits a grammar-load handshake:The bare
once('message')only ever resolves ongrammars-loaded. The worker'sload-grammarshandler had no try/catch, andattachWorkerHandlers''error'/'exit'handlers reject only entries inpendingParses(empty during the grammar-load handshake). So if the worker dies while loading grammars (a tree-sitter WASM abort),'grammars-loaded'never arrives, nothing rejects the promise, andawait ensureWorker()hangs forever.ensureWorkerruns insidecg.index()/cg.sync(), which run insideindexMutex.withLock(...). A hung await meanswithLock'sfinallyrelease never runs, so the in-process index mutex is held forever. In the shared daemon (#411 — one process across all MCP clients) the background catch-up sync is that hung op, so its gate promise never settles: the first tool call — and every client that connects afterwards — hangs on init, regardless of project size. (The unbounded await predates the daemon — it's from the move to worker-thread parsing — but the shared daemon turned a per-client transient into a permanent, daemon-wide wedge.)Fix
parse-worker.ts— wrapload-grammarsin try/catch; report failures asgrammars-load-failedinstead of dying silently.extraction/index.ts— extractawaitWorkerGrammarLoad()that settles on the first ofgrammars-loaded/grammars-load-failed/ workererror/ workerexit/ timeout (always cleans up its listeners). On failure the worker is torn down and parsing degrades to in-process (slower but correct) for the rest of the run.mcp/tools.ts(defense in depth) — bound the first-call wait on the catch-up gate (default 120s, overrideCODEGRAPH_CATCHUP_GATE_TIMEOUT_MS) so a sync that never settles for any reason can't wedge the first call.Tests
__tests__/parse-worker-grammar-load.test.ts—awaitWorkerGrammarLoadsettles (never hangs) on every outcome and removes its listeners. (The worker path isn't exercised under vitest —parse-worker.jsisn't next to the.ts, souseWorker=false— which is why this never surfaced in CI; hence the extracted, directly-testable helper.)__tests__/mcp-catchup-gate.test.ts— a gate that never settles no longer hangs the first tool call.Not included (deliberate follow-up)
A proxy↔daemon JSON-RPC timeout (so a wedged daemon falls back to direct mode) was considered but skipped: the daemon sends
hellosynchronously and answersinitializewithout the gate, so after this fix a daemon can no longer go silent on JSON-RPC from any known cause — the timeout would guard a near-impossible case at the cost of reworking the transparent proxy pipe that every session depends on.🤖 Generated with Claude Code