Skip to content

fix(node): ensure halt_height triggers clean process shutdown#5495

Open
aeddi wants to merge 3 commits into
gnolang:chain/gnoland1from
aeddi:fix-halt-height
Open

fix(node): ensure halt_height triggers clean process shutdown#5495
aeddi wants to merge 3 commits into
gnolang:chain/gnoland1from
aeddi:fix-halt-height

Conversation

@aeddi
Copy link
Copy Markdown
Contributor

@aeddi aeddi commented Apr 14, 2026

fix(node): ensure halt_height triggers clean process shutdown

Problem

PR #5334 added halt_height for coordinated chain upgrades. The intent was for the node to panic deterministically in BeginBlock when the halt height is exceeded, so the process exits and no extra blocks can be processed.

However, the panic is caught by the consensus receiveRoutine's blanket recover() handler (consensus/state.go:612). This handler logs CONSENSUS FAILURE!!! and stops the WAL, but does not shut down the node process. The result is a zombie node: consensus was killed but P2P, RPC, and the main goroutine continue running indefinitely.

On cold restart, the behavior differs: the panic occurs during Handshaker.replayBlock on the main goroutine (no recover()), so the process does crash, but with an unrecovered panic and stack trace rather than a clean error message.

Fix

Typed error: Replace the raw panic(string) in BeginBlock with panic(types.HaltHeightReachedError{...}), a new error type defined in tm2/pkg/bft/types/errors.go (alongside existing domain errors). This lets both the consensus and node packages distinguish halt-height panics from unexpected failures via direct type assertion on the recover() value.

Warm halt (consensus running, halt height reached during block production):

  • The receiveRoutine recovery handler detects HaltHeightReachedError.
  • Logs INFO "Halt height reached, shutting down" (not ERROR — this is expected behavior).
  • Calls osm.Kill() which sends SIGTERM to the process, triggering the normal graceful shutdown path (Node.OnStop cleanup, context cancellation, etc.).
  • Process exits with code 0.

Cold restart (halt height already exceeded when node starts):

  • doHandshake in node.go catches the HaltHeightReachedError panic during block replay and converts it to an error.
  • The error propagates normally through NewNodeexecStart, which prints a clear message and exits.
  • Process exits with code 1.

Testing

Tested manually with a single-validator gnoland start --lazy:

  • Set halt_height = 3, started node at height 3
  • Node committed block 3, logged INFO Halt height reached, shutting down, shut down gracefully, exit code 0
  • Restarted with same config: printed halt height 3 already reached, remove or increase halt_height in config before restarting, exit code 1
  • Set halt_height = 0, restarted: chain resumed from height 5, producing blocks normally

@Gno2D2
Copy link
Copy Markdown
Collaborator

Gno2D2 commented Apr 14, 2026

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

No automated checks match this pull request.

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 18.75000% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tm2/pkg/bft/consensus/state.go 0.00% 5 Missing and 1 partial ⚠️
tm2/pkg/bft/node/node.go 28.57% 4 Missing and 1 partial ⚠️
tm2/pkg/bft/types/errors.go 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@aeddi aeddi requested review from moul and tbruyelle April 14, 2026 07:37
@tbruyelle
Copy link
Copy Markdown
Contributor

It's not a bug, it's actually a feature. See https://github.com/cosmos/cosmos-sdk/blob/main/docs/docs/build/building-apps/03-app-upgrade.md#halt-behavior

To perform the actual halt of the blockchain, the upgrade keeper simply panics which prevents the ABCI state machine from proceeding but doesn't actually exit the process. Exiting the process can cause issues for other nodes that start to lose connectivity with the exiting nodes, thus this module prefers to just halt but not exit.

Another issue I see with exiting the process is that, in some infrastructure setups, any killed process would be restarted automatically, resulting in an infinite restart loop.

Hence I think this PR should be closed.

@aeddi
Copy link
Copy Markdown
Contributor Author

aeddi commented Apr 14, 2026

It's not a bug, it's actually a feature. See cosmos/cosmos-sdk@main/docs/docs/build/building-apps/03-app-upgrade.md#halt-behavior

I didn't realize that was expected behavior. cc @moul.
I’m missing some context, but I’m not sure I see the point of keeping node connections active when the chain is halted.

Another issue I see with exiting the process is that, in some infrastructure setups, any killed process would be restarted automatically, resulting in an infinite restart loop.

By returning an exit code 0, the idea is that validator admins can configure their setup to only restart on failure, and not restart if the node exits gracefully.

Hence I think this PR should be closed.

You're probably right, especially if we want to keep the exact same behavior. I'll wait for moul input.

@aeddi
Copy link
Copy Markdown
Contributor Author

aeddi commented Apr 14, 2026

Discussed with @moul and we've decided to keep the behavior as described in the cosmos-sdk doc. Just following up on our chat, here’s a summary of the diff before vs. after this PR.

Before we close this PR, do we want to stick to the original behavior for cold restarts too (exit 2 on unrecovered panic vs exit 1 on clean error)?

#5334 behavior (original PR)

Step Result
1. Start 10 nodes, all log INFO Halt height configured {"height": 10}
2. Warm halt All 10 log ERROR CONSENSUS FAILURE!!! with full stack trace + "halt height 10 reached, node shutting down"
3. Process state All status=running, RPC responds with height=10 — process stays alive (zombie), consensus dead, P2P/RPC still up.
4. Cold restart All status=restarting, exit_code=2 (unrecovered panic) — crash via Handshaker.replayBlock. Log: panic: halt height 10 reached, node shutting down
5. Resume After setting halt_height=0, chain resumes from height 11, advancing. All healthy.

#5495 behavior (this PR)

Step Result
1. Start 10 nodes, all log INFO Halt height configured {"height": 10}
2. Warm halt All 10 log INFO Halt height reached, shutting down
3. Exit status All status=exited, exit_code=0, — clean shutdown, no zombies
4. Cold restart All status=restarting, exit_code=1, error:halt height 10 already reached, remove or increase halt_height in config before restarting
5. Resume After setting halt_height=0, chain resumes from height 11, advancing. All healthy.

Edit: Cleaned up Claude comparison.

@moul
Copy link
Copy Markdown
Member

moul commented Apr 14, 2026

Two valid approaches imo:

  • Exit 0 (this PR): clean shutdown, easily distinguishable from failures; any restart policy that only retries on non-zero exit won't loop. Simple and explicit.
  • Full zombie (extended feat(node): add halt_height config field for coordinated chain upgrades #5334 behavior): keep the process alive but go further: also shut down P2P and RPC. This makes the halt state unambiguous: the node is intentionally frozen, not broken. No connectivity confusion, no false "still healthy" RPC responses.

Both are better than the current behavior. I don't have a strong preference between the two, but I'd avoid the in-between state where consensus is dead but P2P/RPC are still serving.
@tbruyelle @aeddi, what's your take? If we go with option 2 we probably want a follow-up to #5334 rather than merging this as-is.

@tbruyelle
Copy link
Copy Markdown
Contributor

OK so if I understand correctly there is a discrepency with the SDK and Gno on cold restart. After the halt height is reached, a cold restart exits the node with code=2 in Gno, while in the SDK it stays "zombie", just like it was before the cold restart.

As usual, I tend to replicate the SDK behaviour because it is more battle-tested. So, as @moul calls it, the 'full zombie' approach.

But then why do you want to kill the P2P reactor ? The zombie mode was specialy designed because

Exiting the process can cause issues for other nodes that start to lose connectivity with the exiting nodes, thus this module prefers to just halt but not exit.

If you kill P2P you fall into these issues.

@aeddi
Copy link
Copy Markdown
Contributor Author

aeddi commented Apr 15, 2026

TBH, I don't really get the point of the original spec (keeping the node half-alive with RPC and P2P on), and the cosmos-sdk docs don't give much info on the reasoning behind it, so it's hard to judge without more context.
But I know is that during our test12 runs, nobody intuitively understood why the process stayed alive, it just felt confusing / like a bug to validator admins.

Personally, I see even less point in the 'full-zombie' mode (why not just exit the process if the P2P / RPC are down too?), but I do understand the value of sticking to the cosmos-sdk behavior as @tbruyelle suggested.

So my take:

  • Either we mirror the cosmos-sdk behavior exactly for the sake of consistency and interop (zombie mode with consensus off + RPC/P2P on).
  • Or we change the behavior and have the node shut down gracefully with an exit 0, which IMO is the cleanest and most intuitive approach.

I'm fine with either.

@tbruyelle
Copy link
Copy Markdown
Contributor

tbruyelle commented Apr 17, 2026

TBH, I don't really get the point of the original spec (keeping the node half-alive with RPC and P2P on), and the cosmos-sdk docs don't give much info on the reasoning behind it, so it's hard to judge without more context.

I think the intention here is to avoid the 'waiting for peers' step.

This can happen before or after halt_heigth actually:

  • before halt_height: all peers that had an open TCP connection to a terminated node get a connection reset. Depending on how many validators halt at the same time, this can trigger a wave of peer reconnection attempts. Because of that I assume some validators might have hard time commiting the last block.
  • after halt_height: when the new binary starts, nodes may struggle to find peers, stall at "waiting for peers".

It's like a graceful relay state, old nodes hold connections, new nodes bootstrap off them. Peer discovery and reconnection takes time, so holding connection allows smoother upgrades.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

📦 🌐 tendermint v2 Issues or PRs tm2 related

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants