fix(node): ensure halt_height triggers clean process shutdown#5495
fix(node): ensure halt_height triggers clean process shutdown#5495aeddi wants to merge 3 commits into
Conversation
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):No automated checks match this pull request. ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
It's not a bug, it's actually a feature. See https://github.com/cosmos/cosmos-sdk/blob/main/docs/docs/build/building-apps/03-app-upgrade.md#halt-behavior
Another issue I see with exiting the process is that, in some infrastructure setups, any killed process would be restarted automatically, resulting in an infinite restart loop. Hence I think this PR should be closed. |
I didn't realize that was expected behavior. cc @moul.
By returning an exit code 0, the idea is that validator admins can configure their setup to only restart on failure, and not restart if the node exits gracefully.
You're probably right, especially if we want to keep the exact same behavior. I'll wait for moul input. |
|
Discussed with @moul and we've decided to keep the behavior as described in the cosmos-sdk doc. Just following up on our chat, here’s a summary of the diff before vs. after this PR. Before we close this PR, do we want to stick to the original behavior for cold restarts too (exit 2 on unrecovered panic vs exit 1 on clean error)? #5334 behavior (original PR)
#5495 behavior (this PR)
Edit: Cleaned up Claude comparison. |
|
Two valid approaches imo:
Both are better than the current behavior. I don't have a strong preference between the two, but I'd avoid the in-between state where consensus is dead but P2P/RPC are still serving. |
|
OK so if I understand correctly there is a discrepency with the SDK and Gno on cold restart. After the halt height is reached, a cold restart exits the node with code=2 in Gno, while in the SDK it stays "zombie", just like it was before the cold restart. As usual, I tend to replicate the SDK behaviour because it is more battle-tested. So, as @moul calls it, the 'full zombie' approach. But then why do you want to kill the P2P reactor ? The zombie mode was specialy designed because
If you kill P2P you fall into these issues. |
|
TBH, I don't really get the point of the original spec (keeping the node half-alive with RPC and P2P on), and the cosmos-sdk docs don't give much info on the reasoning behind it, so it's hard to judge without more context. Personally, I see even less point in the 'full-zombie' mode (why not just exit the process if the P2P / RPC are down too?), but I do understand the value of sticking to the cosmos-sdk behavior as @tbruyelle suggested. So my take:
I'm fine with either. |
I think the intention here is to avoid the 'waiting for peers' step. This can happen before or after halt_heigth actually:
It's like a graceful relay state, old nodes hold connections, new nodes bootstrap off them. Peer discovery and reconnection takes time, so holding connection allows smoother upgrades. |
fix(node): ensure halt_height triggers clean process shutdown
Problem
PR #5334 added
halt_heightfor coordinated chain upgrades. The intent was for the node to panic deterministically inBeginBlockwhen the halt height is exceeded, so the process exits and no extra blocks can be processed.However, the panic is caught by the consensus
receiveRoutine's blanketrecover()handler (consensus/state.go:612). This handler logsCONSENSUS FAILURE!!!and stops the WAL, but does not shut down the node process. The result is a zombie node: consensus was killed but P2P, RPC, and the main goroutine continue running indefinitely.On cold restart, the behavior differs: the panic occurs during
Handshaker.replayBlockon the main goroutine (norecover()), so the process does crash, but with an unrecovered panic and stack trace rather than a clean error message.Fix
Typed error: Replace the raw
panic(string)inBeginBlockwithpanic(types.HaltHeightReachedError{...}), a new error type defined intm2/pkg/bft/types/errors.go(alongside existing domain errors). This lets both the consensus and node packages distinguish halt-height panics from unexpected failures via direct type assertion on therecover()value.Warm halt (consensus running, halt height reached during block production):
receiveRoutinerecovery handler detectsHaltHeightReachedError.INFO "Halt height reached, shutting down"(not ERROR — this is expected behavior).osm.Kill()which sends SIGTERM to the process, triggering the normal graceful shutdown path (Node.OnStopcleanup, context cancellation, etc.).Cold restart (halt height already exceeded when node starts):
doHandshakeinnode.gocatches theHaltHeightReachedErrorpanic during block replay and converts it to an error.NewNode→execStart, which prints a clear message and exits.Testing
Tested manually with a single-validator
gnoland start --lazy:halt_height = 3, started node at height 3INFO Halt height reached, shutting down, shut down gracefully, exit code 0halt height 3 already reached, remove or increase halt_height in config before restarting, exit code 1halt_height = 0, restarted: chain resumed from height 5, producing blocks normally