fix(node): ensure halt_height triggers clean process shutdown by aeddi · Pull Request #5495 · gnolang/gno

aeddi · 2026-04-14T07:22:39Z

fix(node): ensure halt_height triggers clean process shutdown

Problem

PR #5334 added halt_height for coordinated chain upgrades. The intent was for the node to panic deterministically in BeginBlock when the halt height is exceeded, so the process exits and no extra blocks can be processed.

However, the panic is caught by the consensus receiveRoutine's blanket recover() handler (consensus/state.go:612). This handler logs CONSENSUS FAILURE!!! and stops the WAL, but does not shut down the node process. The result is a zombie node: consensus was killed but P2P, RPC, and the main goroutine continue running indefinitely.

On cold restart, the behavior differs: the panic occurs during Handshaker.replayBlock on the main goroutine (no recover()), so the process does crash, but with an unrecovered panic and stack trace rather than a clean error message.

Fix

Typed error: Replace the raw panic(string) in BeginBlock with panic(types.HaltHeightReachedError{...}), a new error type defined in tm2/pkg/bft/types/errors.go (alongside existing domain errors). This lets both the consensus and node packages distinguish halt-height panics from unexpected failures via direct type assertion on the recover() value.

Warm halt (consensus running, halt height reached during block production):

The receiveRoutine recovery handler detects HaltHeightReachedError.
Logs INFO "Halt height reached, shutting down" (not ERROR — this is expected behavior).
Calls osm.Kill() which sends SIGTERM to the process, triggering the normal graceful shutdown path (Node.OnStop cleanup, context cancellation, etc.).
Process exits with code 0.

Cold restart (halt height already exceeded when node starts):

doHandshake in node.go catches the HaltHeightReachedError panic during block replay and converts it to an error.
The error propagates normally through NewNode → execStart, which prints a clear message and exits.
Process exits with code 1.

Testing

Tested manually with a single-validator gnoland start --lazy:

Set halt_height = 3, started node at height 3
Node committed block 3, logged INFO Halt height reached, shutting down, shut down gracefully, exit code 0
Restarted with same config: printed halt height 3 already reached, remove or increase halt_height in config before restarting, exit code 1
Set halt_height = 0, restarted: chain resumed from height 5, producing blocks normally

Gno2D2 · 2026-04-14T07:23:21Z

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):

IGNORE the bot requirements for this PR (force green CI check)

✅ Automated Checks (for Contributors):

No automated checks match this pull request.

☑️ Contributor Actions:

Fix any issues flagged by automated checks.
Follow the Contributor Checklist to ensure your PR is ready for review.
- Add new tests, or document why they are unnecessary.
- Provide clear examples/screenshots, if necessary.
- Update documentation, if required.
- Ensure no breaking changes, or include BREAKING CHANGE notes.
- Link related issues/PRs, where applicable.

☑️ Reviewer Actions:

Complete manual checks for the PR, including the guidelines and additional checks if applicable.

📚 Resources:

Debug

Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)
If
🟢 Condition met
└── 🟢 On every pull request
Can be checked by

Any user with comment edit permission

codecov · 2026-04-14T07:24:29Z

Codecov Report

❌ Patch coverage is 18.75000% with 13 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
tm2/pkg/bft/consensus/state.go	0.00%	5 Missing and 1 partial ⚠️
tm2/pkg/bft/node/node.go	28.57%	4 Missing and 1 partial ⚠️
tm2/pkg/bft/types/errors.go	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

tbruyelle · 2026-04-14T09:42:44Z

It's not a bug, it's actually a feature. See https://github.com/cosmos/cosmos-sdk/blob/main/docs/docs/build/building-apps/03-app-upgrade.md#halt-behavior

To perform the actual halt of the blockchain, the upgrade keeper simply panics which prevents the ABCI state machine from proceeding but doesn't actually exit the process. Exiting the process can cause issues for other nodes that start to lose connectivity with the exiting nodes, thus this module prefers to just halt but not exit.

Another issue I see with exiting the process is that, in some infrastructure setups, any killed process would be restarted automatically, resulting in an infinite restart loop.

Hence I think this PR should be closed.

aeddi · 2026-04-14T11:56:15Z

It's not a bug, it's actually a feature. See cosmos/cosmos-sdk@main/docs/docs/build/building-apps/03-app-upgrade.md#halt-behavior

I didn't realize that was expected behavior. cc @moul.
I’m missing some context, but I’m not sure I see the point of keeping node connections active when the chain is halted.

Another issue I see with exiting the process is that, in some infrastructure setups, any killed process would be restarted automatically, resulting in an infinite restart loop.

By returning an exit code 0, the idea is that validator admins can configure their setup to only restart on failure, and not restart if the node exits gracefully.

Hence I think this PR should be closed.

You're probably right, especially if we want to keep the exact same behavior. I'll wait for moul input.

aeddi · 2026-04-14T16:01:53Z

Discussed with @moul and we've decided to keep the behavior as described in the cosmos-sdk doc. Just following up on our chat, here’s a summary of the diff before vs. after this PR.

Before we close this PR, do we want to stick to the original behavior for cold restarts too (exit 2 on unrecovered panic vs exit 1 on clean error)?

#5334 behavior (original PR)

Step	Result
1. Start	10 nodes, all log `INFO Halt height configured {"height": 10}`
2. Warm halt	All 10 log `ERROR CONSENSUS FAILURE!!!` with full stack trace + `"halt height 10 reached, node shutting down"`
3. Process state	All `status=running`, RPC responds with `height=10` — process stays alive (zombie), consensus dead, P2P/RPC still up.
4. Cold restart	All `status=restarting`, `exit_code=2` (unrecovered panic) — crash via `Handshaker.replayBlock`. Log: `panic: halt height 10 reached, node shutting down`
5. Resume	After setting `halt_height=0`, chain resumes from height 11, advancing. All healthy.

#5495 behavior (this PR)

Step	Result
1. Start	10 nodes, all log `INFO Halt height configured {"height": 10}`
2. Warm halt	All 10 log `INFO Halt height reached, shutting down`
3. Exit status	All `status=exited`, `exit_code=0`, — clean shutdown, no zombies
4. Cold restart	All `status=restarting`, `exit_code=1`, error:`halt height 10 already reached, remove or increase halt_height in config before restarting`
5. Resume	After setting `halt_height=0`, chain resumes from height 11, advancing. All healthy.

Edit: Cleaned up Claude comparison.

moul · 2026-04-14T20:27:08Z

Two valid approaches imo:

Exit 0 (this PR): clean shutdown, easily distinguishable from failures; any restart policy that only retries on non-zero exit won't loop. Simple and explicit.
Full zombie (extended feat(node): add halt_height config field for coordinated chain upgrades #5334 behavior): keep the process alive but go further: also shut down P2P and RPC. This makes the halt state unambiguous: the node is intentionally frozen, not broken. No connectivity confusion, no false "still healthy" RPC responses.

Both are better than the current behavior. I don't have a strong preference between the two, but I'd avoid the in-between state where consensus is dead but P2P/RPC are still serving.
@tbruyelle @aeddi, what's your take? If we go with option 2 we probably want a follow-up to #5334 rather than merging this as-is.

tbruyelle · 2026-04-14T20:53:59Z

OK so if I understand correctly there is a discrepency with the SDK and Gno on cold restart. After the halt height is reached, a cold restart exits the node with code=2 in Gno, while in the SDK it stays "zombie", just like it was before the cold restart.

As usual, I tend to replicate the SDK behaviour because it is more battle-tested. So, as @moul calls it, the 'full zombie' approach.

But then why do you want to kill the P2P reactor ? The zombie mode was specialy designed because

Exiting the process can cause issues for other nodes that start to lose connectivity with the exiting nodes, thus this module prefers to just halt but not exit.

If you kill P2P you fall into these issues.

aeddi · 2026-04-15T10:21:26Z

TBH, I don't really get the point of the original spec (keeping the node half-alive with RPC and P2P on), and the cosmos-sdk docs don't give much info on the reasoning behind it, so it's hard to judge without more context.
But I know is that during our test12 runs, nobody intuitively understood why the process stayed alive, it just felt confusing / like a bug to validator admins.

Personally, I see even less point in the 'full-zombie' mode (why not just exit the process if the P2P / RPC are down too?), but I do understand the value of sticking to the cosmos-sdk behavior as @tbruyelle suggested.

So my take:

Either we mirror the cosmos-sdk behavior exactly for the sake of consistency and interop (zombie mode with consensus off + RPC/P2P on).
Or we change the behavior and have the node shut down gracefully with an exit 0, which IMO is the cleanest and most intuitive approach.

I'm fine with either.

tbruyelle · 2026-04-17T12:54:25Z

TBH, I don't really get the point of the original spec (keeping the node half-alive with RPC and P2P on), and the cosmos-sdk docs don't give much info on the reasoning behind it, so it's hard to judge without more context.

I think the intention here is to avoid the 'waiting for peers' step.

This can happen before or after halt_heigth actually:

before halt_height: all peers that had an open TCP connection to a terminated node get a connection reset. Depending on how many validators halt at the same time, this can trigger a wave of peer reconnection attempts. Because of that I assume some validators might have hard time commiting the last block.
after halt_height: when the new binary starts, nodes may struggle to find peers, stall at "waiting for peers".

It's like a graceful relay state, old nodes hold connections, new nodes bootstrap off them. Peer discovery and reconnection takes time, so holding connection allows smoother upgrades.

fix(node): ensure halt_height triggers clean process shutdown

f53862c

github-project-automation Bot added this to 💪 Bounties & Worx Apr 14, 2026

github-project-automation Bot moved this to Triage in 🧙‍♂️Gno.land development Apr 14, 2026

github-project-automation Bot added this to 🧙‍♂️Gno.land development Apr 14, 2026

github-actions Bot assigned aeddi Apr 14, 2026

github-actions Bot added the 📦 🌐 tendermint v2 Issues or PRs tm2 related label Apr 14, 2026

chore: fix lint error (errname)

ba5991f

aeddi requested review from moul and tbruyelle April 14, 2026 07:37

Merge branch 'chain/gnoland1' into fix-halt-height

84664d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(node): ensure halt_height triggers clean process shutdown#5495

fix(node): ensure halt_height triggers clean process shutdown#5495
aeddi wants to merge 3 commits into
gnolang:chain/gnoland1from
aeddi:fix-halt-height

aeddi commented Apr 14, 2026 •

edited

Loading

Uh oh!

Gno2D2 commented Apr 14, 2026 •

edited

Loading

✅ Automated Checks (for Contributors):

☑️ Contributor Actions:

☑️ Reviewer Actions:

📚 Resources:

If

Can be checked by

Uh oh!

codecov Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

tbruyelle commented Apr 14, 2026

Uh oh!

aeddi commented Apr 14, 2026

Uh oh!

aeddi commented Apr 14, 2026 •

edited

Loading

Uh oh!

moul commented Apr 14, 2026

Uh oh!

tbruyelle commented Apr 14, 2026

Uh oh!

aeddi commented Apr 15, 2026 •

edited

Loading

Uh oh!

tbruyelle commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

aeddi commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!