Skip to content

test(hf-glue): hardfork end-to-end testbed (DO NOT MERGE) — integrates #5511 + #5376#5486

Draft
moul wants to merge 128 commits into
gnolang:masterfrom
moul:moul/hf-glue-experimental
Draft

test(hf-glue): hardfork end-to-end testbed (DO NOT MERGE) — integrates #5511 + #5376#5486
moul wants to merge 128 commits into
gnolang:masterfrom
moul:moul/hf-glue-experimental

Conversation

@moul
Copy link
Copy Markdown
Member

@moul moul commented Apr 13, 2026

⚠️ HIGHLY EXPERIMENTAL — THIS IS A META PR ⚠️

This PR's purpose is integration + proof. The actual code changes are being extracted into smaller, single-concern PRs that land on master or on #5511 directly. As those land, this PR rebases onto them and shrinks automatically.

How to review

Pick whichever level you prefer:

  • Review the small PRs — each one is narrow, has its own tests, and targets the minimum base branch it needs. See table below.
  • Review this meta PR — one place to see the full end-to-end story, with reports, scripts, and a reproducible testbed (misc/hf-glue/). Useful when you want to check that all the small PRs compose into something that actually works against real gnoland1 state.

When a small PR merges, we merge master (or its base) back into this branch; the merged content drops out and the meta PR gets smaller. Easier than one 3000-LOC review.

Extracted PRs

# target scope lines
#5540 #5511 fix(tm2/sdk): InitialHeight > 1 support + feat(gnoland): genesis-mode PastChainIDs[0] + feat(hardfork): --patch-realm ~200
#5533 #5511 contribs/tx-archive: hardfork-replay readiness (metadata, SignerInfo brute-force, progress log, gas-replay report) ~450
#5535 master contribs/tx-archive: register chain stdlib amino types ~5

Meta PR content (this branch)

Once #5540 + #5533 + #5535 land, this PR becomes purely:

  • misc/hf-glue/ — the test harness (Makefile, docker-compose with gnoland + gnoweb, scripts/migrate.sh declarative DSL, scripts/lib/hf.sh helpers, scripts/{replay-log,report-replay,check-state}.sh reports, fixvalidator/)
  • misc/hf-glue/README.md with the DO-NOT-MERGE banner

~1 kLOC of test harness that stays out of the default compiled surface.

What the testbed proves (against rpc.gno.land, halt @ 704052)

Running make fetch && make init && make up:

  • Pulls a 192 MB gnoland1 genesis + 2 637 historical txs (~12 min via contribs/tx-archive)
  • Builds a hardfork genesis with PastChainIDs, per-tx SignerInfo (account_num + brute-forced sequence)
  • Boots a single-validator gnoland-1 node in Docker with 0 / 2715 tx failures on replay
  • Produces block 704053+, survives restart, renders 1:1 against prod gno.land for every sampled realm (r/sys/*, r/gov/dao, r/gnoland/blog, r/gnoland/coins, r/gnoland/wugnot all ✅)
  • manfred's account_num=3096261, sequence=31 match production exactly — proof the SignerInfo brute-force landed correctly
  • Delivers a realm upgrade inside the fork: r/sys/params gains #5368's halt.gno via --patch-realm, without any post-deploy re-addpkg dance

Open work (tracked in this PR, not in small PRs yet)

  • Valset swap in r/sys/validators/v2 — consensus works off GenesisDoc.Validators (which fixvalidator rewrites to our single local key), but the realm still lists the original 7 gnoland1 validators. A post-history "migration tx" is the fix; scripts/lib/hf.sh has the hook and doc-comment, misc/hardfork doesn't plumb it yet.
  • Further refactor: move misc/hardfork features into contribs/gnogenesisgnogenesis grows a workdir representation (dir of decoded sub-files) so every patch mutates one sub-file rather than re-marshalling the whole 192 MB genesis. contribs/tx-archive becomes the only thing that talks to a chain (download / blockstore export); gnogenesis stays purely filesystem. Design notes in test(hf-glue): hardfork end-to-end testbed (DO NOT MERGE) — integrates #5511 + #5376 #5486 (comment).

AI disclosure

Built with Claude Code. Reproduce with cd misc/hf-glue && make fetch && make init && make up; reports in out/*.md.

moul and others added 30 commits March 16, 2026 21:20
Signed-off-by: moul <94029+moul@users.noreply.github.com>
Co-authored-by: moul <94029+moul@users.noreply.github.com>
Co-authored-by: aeddi <antoine.e.b@gmail.com>
Co-authored-by: Antoine Eddi <5222525+aeddi@users.noreply.github.com>
Co-authored-by: Morgan <git@howl.moe>
Co-authored-by: Morgan Bazalgette <morgan@morganbaz.com>
Enables GovDAO to propose a coordinated chain halt at a specific block
height without requiring every operator to pass a CLI flag. This is the
governance-driven counterpart to the --halt-height CLI flag.

Changes:
- Add `NewSetHaltHeightRequest(height int64)` to `r/sys/params` realm,
  allowing GovDAO to vote on halting the chain at a target block.
- Add `nodeParamsKeeper` to validate `node:p:halt_height` params.
- Register the "node" module in the params keeper so halt_height can
  be set via governance proposals.
- Extend `EndBlocker` to read `node:p:halt_height` from the params
  store and call `osm.Kill()` when the halt height is reached.

Usage:
  // Create and submit a GovDAO proposal to halt at block 100000
  pr := params.NewSetHaltHeightRequest(100_000)
  id := dao.MustCreateProposal(cross, pr)

  // After approval and execution, all nodes will halt at block 100000

Generated with [Claude Code](https://claude.com/claude-code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Extends the GovDAO halt proposal with a mandatory minimum binary version
field. When set, nodes refuse to restart unless their version satisfies
the requirement, preventing an old binary from accidentally resuming a
chain that was halted for an upgrade.

- `NewSetHaltRequest(height, minVersion)` sets both `node:p:halt_height`
  and `node:p:halt_min_version` atomically in one GovDAO proposal.
- `checkNodeStartupParams` runs at node startup (after state is loaded)
  and compares `version.Version` against the stored `halt_min_version`.
- `meetsMinVersion` / `parseGnolandVersion` handle the "chain/gnolandX.Y"
  version format used for gno.land chain releases, with a string-equality
  fallback for other formats.

Example: setting minVersion="chain/gnoland1.1" will allow 1.1 and newer
to start, but reject 1.0 ("develop" also rejected unless it matches).

Generated with [Claude Code](https://claude.com/claude-code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
…g#5400)

TypeCheckMemPackage only writes a package to permCache when it is
reached as a dependency via ImportFrom (canPerm=true). The root package
of each call is never self-stored. This left 22 "leaf" stdlibs (packages
not imported by any other stdlib, e.g. time, regexp, math/rand) absent
from vm.typeCheckCache on every node startup.

On a production cold-start node (LoadStdlib, CacheStdlibLoad=false), the
cache was entirely empty — every stdlib import in a user tx required a
GetMemPackage store read (8 gas/byte). On a restarted node (Initialize),
the 22 leaf stdlibs were still missing. This caused non-deterministic
gas consumption: nodes that had restarted disagreed with genesis-fresh
nodes on tx gas, triggering a consensus halt on gnoland1 at block
352922.

Fix: capture the *types.Package return value from each
TypeCheckMemPackage call in the init loop and store it directly into
opts.Cache. Applied to all three initialization paths: Initialize,
LoadStdlibCached, and LoadStdlib.

The LoadStdlib change additionally routes the cache to vm.typeCheckCache
directly (instead of the per-tx context clone) so the results survive
beyond the initialization transaction.

Verified by:
- TestTypeCheckCacheContainsAllStdlibs: asserts all InitOrder() stdlibs
are present in vm.typeCheckCache after both cold and warm
initialization.
- TestAddPkgGasWithTypeCheckCache: asserts identical gas for a
strconv-importing addpkg regardless of typeCheckCache state (was 7M cold
vs 2.1M warm before).
- addpkg_stdlib_typecheckcache.txtar: deploys a time-importing package
with gas_wanted=2700000; succeeds at ~2.3M with fix, OOGs at ~3.2M
without.

This is a hotfix, hence it is on chain/gnoland1 as the base. I fear this
may cause different gas results in the chain, so we still need to figure
out:

1. A migration strategy for the existing nodes (to re-run block 352922)
2. And also understanding the impact that this has on validators joining
in the network afterwards. I feel like this PR changes the gas values of
all of the transactions, including genesis transactions, so we got to
understand if nodes would still validate transactions correctly with
lower gas values or if they are no longer valid, and this would require
a chain re-start re-running the transactions.
…nolang#5409)

- Add `gnoland version` subcommand mirroring `gno version` and `gnokey
version`
- Add `BuildVersion`/`build_version` field to `ResultStatus` (RPC
/status endpoint), populated from `tm2/pkg/version.Version`
- Inject version via ldflags in Dockerfile, computed from git at build
time; all build stages now read from a shared build_version file written
in setup-gnocore

goreleaser already has injection of the version, so no changes needed
there.

---------

Co-authored-by: moul <94029+moul@users.noreply.github.com>
…ng#5410)

## Summary

Adds `contribs/gnobr` — a block rollback tool for gnoland validators. It
trims the blockstore to a target height, patches the app hash in
state.db, and wipes app state so gnoland replays all blocks locally on
restart. No network access or special binary patches needed.

### Usage

```bash
# Build from the gno repo
cd contribs/gnobr && go build -o gnobr .

# Stop your node, then run:
gnobr --data-dir gnoland-data --drop-after 352921 \
  --app-hash 14BD8BB9FAD9869B86F1BFFD1A16DD3A02C3534323F6E15121025BE5DFDC9C51

# Restart your node — it replays blocks 1..352921 locally from its own blockstore.
```

### What it does

1. **Trims blockstore.db** — removes all blocks after the target height
2. **Patches state.db** — updates the AppHash to the correct value (via
`--app-hash`) so the Handshaker doesn't panic on mismatch
3. **Wipes gnolang.db** — forces the app to replay from genesis
4. **Wipes WAL** — removes stale write-ahead log
5. **Resets priv_validator_state.json** — prevents double-signing

On restart, gnoland's Handshaker sees `appHeight=0, storeHeight=N,
stateHeight=N`, runs InitChain, then replays all N blocks from the local
blockstore. Zero network access needed.

### Flags

| Flag | Description |
|---|---|
| `--data-dir` | Path to gnoland data directory (default:
`gnoland-data`) |
| `--drop-after` | Keep blocks up to this height, drop everything after
|
| `--app-hash` | Hex-encoded app hash to write into state.db |
| `--dry-run` | Show what would be done without modifying anything |

### Why

During the gnoland1 chain halt at height 352922, validators committed a
block with a divergent app hash. The `chain/gnoland1.1` tag fixes the
root cause, but validators who committed the bad block can't just update
the binary — state.db contains the wrong app hash, causing a panic on
replay. This tool patches it cleanly.

### Tested

Successfully tested on gnoland1 (val1.moul.p2p.team):
- Restored from backup, ran gnobr, restarted with clean
`chain/gnoland1.1` binary (no patches)
- Node replayed all 352921 blocks locally, reached correct app hash
`14BD8BB9...`

<details>
<summary>Contributors' checklist</summary>

- [x] Added new tests, or not needed, or not feasible
- [x] Provided an example (e.g. screenshot) to aid review or the PR is
self-explanatory
- [x] Updated the official documentation or not needed
- [x] No breaking changes were made, or a `BREAKING CHANGE: xxx` message
was included in the description
- [x] Added `benchmarks` label to the PR or not needed
</details>
Aligns with gnolang#5334's approach: GovDAO EndBlocker now sets the halt height
on BaseApp, which panics in BeginBlock of the next block. This is
deterministic (no async signals) and ensures the halted block is fully
committed.
…es (gnolang#5334)

## Summary

Adds a halt height mechanism for coordinated chain upgrades. The node
stops after committing the specified block height.

### How to set it

```bash
gnoland config set halt_height 352922
```

Or edit `config.toml` directly:
```toml
halt_height = 352922
```

### How it works

1. After `finalizeCommit`, consensus checks if `height >= halt_height`
2. If so, calls `osm.Kill()` for a graceful shutdown
3. The check is at the consensus level (not ABCI), following the same
pattern as `WithEarlyStart`

### Scope and future direction

This is a **temporary coordination tool** for the current chain upgrade.
For the gnoland1 → gnoland-1 hard fork, validators set `halt_height` in
their config, all nodes stop at the same block, then validators swap
binary + config and restart.

After the upgrade, the proper mechanism will be **GovDAO-based halting**
(gnolang#5368), which adds:
- On-chain `halt_height` param set via governance proposal (no manual
config needed)
- `halt_min_version` — prevents old binaries from restarting after halt
- Version guard at startup so validators can't accidentally run the
wrong binary

Once gnolang#5368 is merged and active, `halt_height` in config becomes a
**node operator tool** (e.g., "stop my node at height X for
maintenance") rather than a coordination mechanism. Coordination should
happen through governance.

### No CLI flag — config only

Per @tbruyelle's suggestion, there's no `--halt-height` CLI flag. Config
file is the single source of truth. This avoids the risk of validators
missing the flag in duplicated command setups across their
infrastructure.

### Related

- gnolang#5368 — GovDAO-based halt height + version guard (Phase 2, replaces
this for coordination)
- gnolang#5376 — gnoland-1 chain config
- gnolang#5411 — chain upgrade genesis replay

<details>
<summary>Contributors' checklist</summary>

- [x] Added new tests, or not needed, or not feasible
- [x] Provided an example (e.g. screenshot) to aid review or the PR is
self-explanatory
- [x] Updated the official documentation or not needed
- [x] No breaking changes were made, or a `BREAKING CHANGE: xxx` message
was included in the description
- [x] Added `benchmarks` label to the PR or not needed
</details>
…ght config

Addresses tbruyelle's review feedback:
1. Panic if new binary runs before the chain has halted at halt_height
2. Add skip_upgrade_height config field to bypass the check when the
   validator has already migrated state
Prepares the repository for the gnoland1 → gnoland-1 hard fork:

- Add misc/deployments/gnoland-1/ with:
  - migrate-from-gnoland1.sh: placeholder with a detailed TODO covering
    halt verification, state export, migration transforms (r/sys/params,
    r/gnops/valopers, namereg, gas params), genesis assembly, verification,
    and restart coordination. Exits with an error until implemented.
  - config.toml: copy of gnoland1 config with meter_name=gnoland-1 and
    peer/seed addresses reset (to be filled post-fork).
  - govdao-scripts/: copies of gnoland1 scripts with CHAIN_ID=gnoland-1.
  - README.md: upgrade workflow, what changed, and ⚠️ migration TODO warning.

- Update docs:
  - docs/resources/gnoland-networks.md: Betanet chain ID gnoland1 → gnoland-1
  - docs/resources/gas-fees.md: update --chainid example
  - docs/users/explore-with-gnoweb.md: update Betanet chain ID reference

The migration script is the critical missing piece — the hard fork cannot
happen until it is written and dry-run on test12.

Generated with [Claude Code](https://claude.com/claude-code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Revert premature doc references to gnoland-1 chain ID in gas-fees.md
  and explore-with-gnoweb.md (hardfork hasn't happened yet)
- Remove premature "Note" callout from gnoland-networks.md
- Update migrate-from-gnoland1.sh: reflect Scenario A decision (genesis
  tx-replay with InitialHeight), document blockers (gnolang#5411, gnolang#5390,
  Jae's InitialHeight tm2 work), reference issue gnolang#5374 for tracking
- Update gnoland-1/README.md: reflect correct PR merge status, document
  Scenario A approach, list migration blockers explicitly
- Add ChainID field to GnoTxMetadata for tx provenance recording
- Add InitialHeight validation (non-negative) to GenesisDoc.Validate and ValidateAndComplete
- Add test cases: no chain ID override when BlockHeight=0, no override when OriginalChainID unset
- Update ADR: document per-tx vs state-level design choice, mark InitialHeight as implemented end-to-end
…d-1 README

PR gnolang#5373 (valoper fee script) was closed without merging. The valoper
registration fee was already set to 0 via a GovDAO transaction on gnoland1,
so no code change is needed — the state is preserved in genesis replay.
- Fix comment headers: 'gnoland1' → 'gnoland-1' in add-validator.sh and rm-validator.sh
- Fix stale REMOTE default comment: 127.0.0.1:26657 → betanet endpoint
moul added 17 commits April 19, 2026 00:57
The upstream MANIFESTO.md references a PDF in a now-404'd GitHub repo
(github.com/jaekwon/ephesus). Replace with a Wayback Machine wildcard
redirect so the docs linter (which treats remote 4xx as hard-fail)
stops blocking CI on every PR.
The upstream MANIFESTO.md references a PDF in a now-404'd GitHub repo
(github.com/jaekwon/ephesus). Replace with a Wayback Machine wildcard
redirect so the docs linter (which treats remote 4xx as hard-fail)
stops blocking CI on every PR.
web.archive.org redirects + ia*.us.archive.org file hosts both throttle
and 5xx intermittently. Several PRs hit flaky docs/MANIFESTO.md checks
on these exact URLs. Since we use archive.org precisely to point at
links that are already dead upstream, the liveness check adds no
value — skip them.
web.archive.org redirects + ia*.us.archive.org file hosts both throttle
and 5xx intermittently. Several PRs hit flaky docs/MANIFESTO.md checks
on these exact URLs. Since we use archive.org precisely to point at
links that are already dead upstream, the liveness check adds no
value — skip them.
- Document GasReplayMode field and "source" mode
- Document GasUsed/GasWanted metadata fields
- Document auth.SkipGasMeteringKey context flag
- Document replay report with categorization
- Document RequestInitChain.InitialHeight cross-check (GnoGenesisState.InitialHeight is no longer "informational only")
- Document hardfork tooling: --patch-realm, hardfork test
- Add BaseApp.validateHeight / Info InitialHeight>1 fixes (PR gnolang#5540)
- Add genesis-mode sig verify against PastChainIDs[0] (PR gnolang#5540)
- Mark gas-tolerance and replay-report open items as resolved
- Add docs-linter stability fix note
The hardfork tooling lives alongside other genesis-manipulation tools
under contribs/gnogenesis instead of a standalone misc/hardfork module.

Command mapping:
  misc/hardfork genesis [flags] → gnogenesis fork generate [flags]
  misc/hardfork test [flags]    → gnogenesis fork test [flags]

Changes:
- Move misc/hardfork/*.go → contribs/gnogenesis/internal/fork/
- Rename genesis.go / genesisCfg / execGenesis → generate.go /
  generateCfg / execGenerate (subcommand is now 'generate' under 'fork')
- Register fork.NewForkCmd in contribs/gnogenesis/genesis.go
- Delete misc/hardfork/ (including its separate go.mod — absorbed into
  contribs/gnogenesis module)
- Move PLAN_account_metadata.md → gno.land/adr/pr5511_...
- Update misc/deployments/gnoland-1/generate-genesis.sh to use
  'gnogenesis fork generate' instead of the 'hardfork' binary
- Update ADR pr5489 tooling section

No behavioural changes — same flags, same logic, same tests.
…to moul/hf-glue-experimental

# Conflicts:
#	tm2/pkg/sdk/baseapp.go
- docs/MANIFESTO.md: restore original jaekwon/ephesus URL (upstream is
  back online; wayback redirect no longer needed)
- misc/docs/tools/linter/urls.go: drop archive.org skip — it was a
  workaround for the MANIFESTO URL flake that is now resolved

Leaving gnoland-networks.md etc. untouched (those came in via the
gnolang#5376 / gnolang#5511 merges and are intentional).
- Remove `--overlay-dir` flag and applyOverlay from 'gnogenesis fork
  generate' (the feature was dead code — it errored on any non-empty
  overlay dir and no one used it)
- Remove misc/deployments/gnoland-1/overlay/ (empty feature)
- Simplify misc/deployments/gnoland-1/generate-genesis.sh to SOURCE and
  HALT_HEIGHT env vars only (drop --output/--skip-txs/--debug/positional
  args/EXTRA_ARGS — those are edge-case knobs; users who need them call
  'gnogenesis fork generate' directly)
- Drop TestApplyOverlay_* tests along with the code
Three ADRs (pr5411 superseded, pr5489 mixed concerns, pr5511 PLAN) are
consolidated into two, split by layer:

- tm2/adr/pr5511_initial_height.md
  Focused on tm2 changes: GenesisDoc.InitialHeight,
  RequestInitChain.InitialHeight, BlockchainReactor/state/store/BaseApp
  fixes for InitialHeight > 1, auth.SkipGasMeteringKey.

- gno.land/adr/pr5511_chain_upgrade_genesis_replay.md
  Focused on gno.land app: GnoTxMetadata and GnoGenesisState fields
  (PastChainIDs, GasReplayMode, SignerInfo, gas fields), sequence
  recovery algorithm, genesis replay flow, replay report, gnogenesis
  fork tooling, bugs found, validation.

Cross-linked between the two. Content reflects the final PR state
(overlay mechanism removed, tool absorbed into gnogenesis fork, etc.).
Appends one or more genesis-mode txs at the END of appState.Txs —
i.e. AFTER the replayed historical-tx stream. Repeatable.

FILE is a .jsonl of gnoland.TxWithMetadata (amino JSON per line).
BlockHeight is forced to 0 so each tx is treated as genesis-mode:
chain-id via PastChainIDs[0], sig verify skipped by
--skip-genesis-sig-verification. Blank and # lines are ignored.

Use case: chain-specific post-replay migrations that need replayed
state to exist first — e.g. a govDAO proposal that updates
r/sys/validators/v2 so the hardforked chain's in-gno valset matches
the new GenesisDoc.Validators (not just the tm2 config side).

The plumbing lives here; chain configs in misc/deployments/*/ wire it
up with their own migration files.
Adds a post-history migration tx that reconciles r/sys/validators/v2
with the new GenesisDoc.Validators after the gnoland1 → gnoland-1
hardfork. Without this, tm2 consensus would use the new single-
validator set while every in-gno query (valopers, govDAO proposals
touching the valset) would still see the pre-fork gnoland1 set
written by govdao_prop1.

Layout:
  misc/deployments/gnoland-1/migrations/
    01_reset_valset.gno.tmpl   Gno MsgRun body with placeholders for
                               OLD_VALIDATORS_GO + NEW_VALIDATORS_GO.
                               Submits + votes + executes a govDAO
                               proposal via r/sys/validators/v2.NewPropRequest.
    build.sh                   Renders the template with the local
                               priv_validator_key.json (new validator)
                               and the hard-coded initial gnoland1
                               valset (to be removed). Wraps into a
                               signed MsgRun tx and emits migrations.jsonl.

generate-genesis.sh gains a PV_KEY env var; when set it runs build.sh
and passes --migration-tx to `gnogenesis fork generate`.

Sig check passes because migration txs are genesis-mode (BlockHeight=0)
and --skip-genesis-sig-verification is on at replay; the MsgRun executes
as $CALLER (default: manfred, a govDAO T1 member) regardless of which
key signed.
- hf.sh: point at 'gnogenesis fork generate' (was the removed misc/hardfork)
  and plumb hf_migration_tx through new --migration-tx flag
- migrate.sh: call misc/deployments/gnoland-1/migrations/build.sh when
  out/gnoland-home/secrets/priv_validator_key.json exists ("make init"
  output), emit migrations.jsonl, register via hf_migration_tx
- build.sh: fix BSD sed newline bug (use awk), fix gnokey recover
  stdin handshake (\n + mnemonic), derive bech32 gpub1 pubkey via
  'gnoland secrets get' (r/sys/validators/v2 wants bech32 not base64)
- 01_reset_valset.gno.tmpl: rename placeholder tokens in doc comment
  so they don't collide with the substitution
aeddi added a commit to aeddi/gno that referenced this pull request Apr 21, 2026
PR gnolang#5486 commit bd3580d ("refactor: absorb misc/hardfork into 'gnogenesis
fork' subcommand") moved misc/hardfork/ into contribs/gnogenesis/internal/fork/
and rewired the CLI:

  misc/hardfork test  →  gnogenesis fork test

That refactor updated misc/deployments/gnoland-1/generate-genesis.sh but
missed the references in misc/hf-glue/. As a result, scripts/replay-log.sh
(invoked by `make replay-log` and `make reports`) still does:

  cd "$REPO/misc/hardfork"
  go run . test ...

which fails with "No such file or directory" because misc/hardfork no
longer exists on the PR head.

Update the cd target to contribs/gnogenesis and the subcommand to
`fork test` to match the new layout.

Other stale refs in misc/hf-glue/ (Makefile smoketest target,
fetch-from-dir.sh, README, lib/hf.sh comment) are also broken but
addressed in separate commits / left as docs.
moul added 4 commits April 22, 2026 13:51
Adds a 2-node docker cluster variant of the hf-glue testbed to verify
that a hardfork genesis actually drives consensus across connected
validators (not just replays in a single node).

- fixvalidator: --priv-key is now repeatable (names auto-suffixed -N)
- scripts/init-cluster.sh: generate N=$NODES homes under out/cluster/,
  rewrite genesis with all validators, wire per-node config.toml with
  persistent_peers pointing at other nodes over the compose network
- docker-compose.cluster.yml: 2 gnoland services (node0/node1) +
  gnoweb pointing at node0; ports 26656/7 + 36656/7 on host
- migrate.sh + build.sh: PV_KEYS (colon-separated) for cluster-mode
  valset-swap migration (all cluster validators land in r/sys/validators/v2)
- Makefile: cluster-init / cluster-up / cluster-down / cluster-logs /
  cluster-status / cluster-reset / cluster-reset-db

Smoke-tested: init-cluster generates correct peer entries; build.sh
emits migration with both validators; docker compose up starts both
containers; replay verification pending.
README gets a cluster-mode section + new Make targets in the table.

Smoke-test finding: cluster wiring is correct (docker DNS, persistent_peers,
validator-set rewrite, per-node secrets), nodes peer up and cast consensus
votes at the initial height. But ABCI handshake fails on restart with
'block not found for height 1' when genesis has initial_height > 1, because
the consensus WAL replay path expects block 1 to exist. Documented as a
separate upstream issue from the cluster harness itself.
… > 1

The ABCI handshake replay path assumed heights in [1, appBlockHeight+1]
always have a block in the store. For chains that start at InitialHeight > 1
(e.g. a hardfork upgrade that replays historical txs at genesis), heights
below InitialHeight never had a block — LoadBlock returns nil and the
replay errors with "block not found for height 1".

Repro (before this fix):
  - Fresh chain with genesis.initial_height = 2.
  - Node runs InitChainer, saves state.LastBlockHeight = 1.
  - Consensus produces block 2 and stores it; node crashes before app Commit.
  - On restart: appBlockHeight=0, storeBlockHeight=2, stateBlockHeight=1.
  - ReplayBlocks routes to replayBlocks(mutateState=true), which loops
    from appBlockHeight+1 = 1 up to storeBlockHeight-1 = 1. LoadBlock(1)
    returns nil → handshake error, node crash-loops.

Fix: clamp the loop start to max(appBlockHeight+1, InitialHeight). Heights
below InitialHeight are phantom — skip them, let the mutateState
replayBlock apply the real first block (at InitialHeight or above) to the
post-InitChain app.

Regression test TestReplayBlocks_SkipsPhantomHeightsAtInitialHeight: uses
a recording BlockStore that records LoadBlock requests and asserts none
are below InitialHeight. Verified: test fails without the replay.go fix
("LoadBlock(1) was called but 1 < InitialHeight 100") and passes with it.
…nconsistency

When baseStore's package-index says 'index N -> path P' but iavlStore has
no MemPackage under path-key(P), GetMemPackage returned nil, IterMemPackage
yielded that nil, and downstream callers (ParseMemPackage et al) SIGSEGV'd
on 'mpkg.Type.(MemPackageType)' with no clue where the nil came from.

Replace the silent nil with a descriptive panic that names the inconsistent
index entry and path. The underlying atomicity issue (how baseStore and
iavlStore can get out of sync across a crash) is a separate investigation;
this patch just surfaces the symptom clearly so the next person reading
the stack trace has a starting point.
@github-actions github-actions Bot added the 📦 🤖 gnovm Issues or PRs gnovm related label Apr 22, 2026
moul added 2 commits April 22, 2026 14:57
…ication

Single-node restart with fresh state + initial_height=2 genesis now works
end-to-end (verified: height=20, restarts=0, no panic) thanks to ca97894.

The earlier 'hardfork handshake replay' description of the cluster issue
was wrong: that bug IS fixed. The remaining cluster crash-loop is a
separate cross-store atomicity bug that only triggers on the 2-validator
consensus path after block 2 (7a88a8a turns the resulting SIGSEGV into
a descriptive panic).
…re atomicity bug

Investigated the cluster crash-loop. Actual root cause is NOT the
InitialHeight>1 replay (that's fixed) and NOT a cluster-specific state
machine bug. It's:

1. Docker Desktop default VM = 7.65 GiB. Two gnoland nodes during
   hardfork replay = ~3.5 GiB each. They fit, barely, until consensus
   and p2p push memory over the edge and the kernel OOM-kills one
   (exitcode=137, no panic, no logs).

2. The OOM SIGKILL lands between two writes in AddMemPackage:
   - baseStore (dbadapter): Set() hits PebbleDB immediately, no buffer
   - iavlStore (iavl): Set() buffers in IAVL tree until CommitMultiStore
   So on-disk has the package index entry but no package data. On
   restart, VMKeeper.Initialize's IterMemPackage yields a nil mpkg
   (indexed path but no iavl entry) and SIGSEGVs.

3. 7a88a8a already replaced the SIGSEGV with a descriptive panic
   identifying the inconsistent index/path — that's the diagnostic
   minimum. Fixing the atomicity properly would move the package
   index under iavlStore (or wrap baseStore in a committable buffer),
   which is a tm2 store-layer change outside this PR's scope.

Single-node works because it rarely crashes mid-commit. Cluster
surfaces the bug because OOM is deterministic.

- docker-compose.cluster.yml: prominent memory-requirement banner
- README.md: corrected 'known issue' section with actual root cause,
  store-layer explanation, and the workaround (bump Docker VM to >=12G)
moul added a commit that referenced this pull request Apr 27, 2026
These were added with the original 5411 work but aren't needed in this PR.
The hardfork genesis flow now lives in the hf-glue testbed (#5486).
jaekwon pushed a commit that referenced this pull request May 7, 2026
## Overview

Chain hardfork mechanism for gno.land: export all state and historical
transactions from the source chain, replay them during `InitChain` on
the new chain, and start producing blocks at the halted height. Replaces
the original single-`OriginalChainID` design from
[#5411](#5411) with a more flexible
multi-chain model (`PastChainIDs` allowlist + per-tx `ChainID`).

**History:**
- Original work: [#5411](#5411)
- Jae's refinements:
[feat/genesis-replay-upgrade2](https://github.com/gnolang/gno/tree/feat/genesis-replay-upgrade2)
- This PR: builds on top of Jae's work, adds fixes from extensive review
+ end-to-end validation on the full gnoland1 chain via the [hf-glue
testbed](#5486)

## What's in

### tm2 (consensus + SDK)
- **`GenesisDoc.InitialHeight`** consensus starts block production at
this height after `InitChain`; `Handshaker` sets `state.LastBlockHeight
= InitialHeight - 1`.
- **`BlockchainReactor`, `state`, `store`, validation** all updated to
handle chains where `InitialHeight > 1` (empty block store,
non-contiguous block save, validator set / consensus params persisted at
InitialHeight, etc.)
- **`BaseApp.lastBlockHeight` tracker** (this iteration): real chain
height = `multistoreVersion + initialHeightOffset`, with the offset
persisted under `mainInitialHeightKey` and restored on every restart.
`validateHeight` now enforces strict contiguity against real chain
height; the previous "allow monotonic jump" branch (which permanently
bypassed contiguity for `InitialHeight > 1` chains) is gone.
- **`BaseApp.Info` guard** handle calls before the multistore is loaded.
- **`auth.SkipGasMeteringKey`** context flag that lets `SetGasMeter`
bypass the new VM's gas meter (used for `GasReplayMode="source"`).
- **`RequestInitChain.InitialHeight`** new ABCI field so the app can
cross-check against `GnoGenesisState.InitialHeight`. Amino round-trip
test added.

### gno.land
- **`GnoGenesisState`** extensions:
- `PastChainIDs []string` allowlist of past chain IDs valid for
signature verification
- `InitialHeight int64` cross-checked against `GenesisDoc.InitialHeight`
- `GasReplayMode string` `""`/`"strict"` (default, new VM's gas meter)
or `"source"` (bypass gas meter, preserve source-chain outcomes)
- **`GnoTxMetadata`** extensions:
  - `BlockHeight int64` original block height
  - `ChainID string` originating chain ID
- `Failed bool` tx had non-zero return code on source chain (skipped
during replay)
- `SignerInfo []SignerAccountInfo` per-signer account metadata (address,
account number, pre-tx sequence) so signatures verify correctly even if
earlier txs diverged
- `GasUsed`, `GasWanted int64` source-chain gas (populated by
tx-archive, used by replay report)
- **`auth.NewAccountWithUncheckedNumber`** (this iteration, renamed from
`NewAccountWithNumber`): create accounts with a specific number,
bypassing the auto-increment counter. Doc comment now spells out the
precondition that the caller must enforce uniqueness; the rename forces
every call site to acknowledge it.
- **`validateSignerInfo` preflight** (this iteration): scans every
`SignerInfo` entry across all txs at the start of `loadAppState`.
Rejects the genesis if two different addresses claim the same account
number, or if a `SignerInfo` claims a number reserved by a balance-init
account at a different address. Defense-in-depth against a malformed
genesis silently corrupting state.
- **`InitChainerConfig.StrictReplay`** (this iteration): opt-in
fail-closed boot. Defaults to `false` for backwards compat. Hardfork
operators set it to `true` so any non-skipped tx replay failure aborts
`InitChain` instead of letting the chain boot in a corrupted state.
Skipped txs (`metadata.Failed = true`) do not count.
- **Genesis-mode tx sig verify with PastChainIDs[0]** genesis-mode txs
(no metadata or `BlockHeight == 0`) use the first `PastChainIDs` entry
for sig verify when a hardfork is in progress (PR #5540). The
genesis-mode chain-ID branch is now gated on `metadata == nil` (this
iteration) so migration txs (`metadata != nil`, `BlockHeight == 0`,
`Timestamp != 0`) keep their metadata-driven `ctxFn` instead of being
silently overwritten.
- **`BaseApp.InitChain` error surfacing** (this iteration): when
`InitChainer` returns `ResponseInitChain.Error`, return cleanly instead
of falling through to the validators-count sanity check, which would
otherwise panic with a misleading `"validators count mismatch"` and mask
the real cause.
- **Replay report** per-tx categorization emitted via logger after
`InitChain`: `ok` / `ok_gas_differs` / `failed` / `skipped_failed`.
Exposes `Outcomes()` and `FailedCount()` for external tooling.

### Hardfork tooling (`contribs/gnogenesis/internal/fork/`)
- **`gnogenesis fork generate`** generate a hardfork genesis from a
source chain (RPC URL, local data dir, or exported tarball).
- **`gnogenesis fork test`** local genesis replay smoke-test.
- **`--patch-realm PKGPATH=SRCDIR`** (repeatable) rewrite a genesis-mode
`addpkg` tx in-place with files from `SRCDIR`. Lets you deliver realm
upgrades as part of the fork (e.g. adding a new `.gno` file to an
existing realm) since you cannot re-addpkg post-deploy (PR #5540).
- **`--migration-tx`** inject a single migration tx at the end of the
historical replay.
- **`bruteForceSignerSequence`** resolve signer sequences during export
by trying candidate values against the signature.

## Bugs found and fixed during review

### tm2 consensus (all fixed)
1. **Fast-sync broken with InitialHeight > 1** `BlockPool` started at
`store.Height()+1 = 1` instead of `state.LastBlockHeight+1 =
InitialHeight`. Nodes trying to fast-sync would request non-existent
blocks.
2. **Validator set / consensus params not saved at InitialHeight**
`saveState` only saved validators when `nextHeight == 1`. With
InitialHeight > 1, `LoadValidators` failed and `LoadConsensusParams`
panicked at block InitialHeight+1.
3. **`ValidateBasic` bypass via zeroed `LastBlockID`** any block with
`LastBlockID.IsZero()` could skip commit validation. Fixed: only allow
skip when commit is also nil/empty.
4. **`BaseApp.validateHeight` permanent contiguity bypass** the previous
"allow monotonic jump" branch compared real block height against the
multistore version. After the first commit, `actual > prevHeight` is
trivially true on every subsequent block, so the contiguity check was
bypassed forever (an attacker or buggy consensus engine that skipped N
blocks would be silently accepted). Fixed by tracking real chain height
in `lastBlockHeight` (this iteration).
5. **`BaseApp.InitChain` masking real error** when `loadAppState`
returned an error response, the validators-count sanity check fired with
`"validators count mismatch"` masking the actual cause. Fixed: return
cleanly on error response (this iteration).

### gno.land (all fixed)
6. **`loadAppState` returns nil even on N tx failures** chain booted in
a corrupted state when historical-tx replay had failures. Fixed via
opt-in `StrictReplay` in `InitChainerConfig` (this iteration).
7. **Migration-tx `ctxFn` overwrite** the genesis-mode chain-ID branch
fired on any `metadata.BlockHeight == 0`, stomping the metadata-driven
`Timestamp` override on migration txs. Fixed: tighten predicate to
`metadata == nil` and compose with any prior `ctxFn` (this iteration).
8. **`NewAccountWithNumber` had no SignerInfo collision check** two
`SignerInfo` entries with the same `AccountNum` but different addresses,
or a `SignerInfo` colliding with a balance-init account, would silently
zero the original account's balance. Fixed: rename to
`NewAccountWithUncheckedNumber` (forcing every call site to acknowledge
the precondition) plus `validateSignerInfo` preflight in `loadAppState`
(this iteration).
9. **Failed-tx `ResponseDeliverTx` was empty (looked like success)**
explicit error marker so indexers can distinguish.
10. **`GnoGenesisState.InitialHeight` wasn't cross-checked against
`GenesisDoc.InitialHeight`** added `InitialHeight` to `RequestInitChain`
and validate in `loadAppState`.
11. **`RequestInitChain.InitialHeight` had no amino round-trip test**
silent registration regression would only surface during a real hardfork
(this iteration).

### Hardfork tooling (fixed)
12. **`applyOverlay` silent no-op** listed scripts but didn't execute
them, returned success. Fixed: returns error when scripts found but
execution not implemented.
13. **JSONL serialization used `encoding/json` instead of amino**
interface types (`std.Msg`) lost on round-trip. Fixed: both writer and
reader now use amino.
14. **`verifyGenesisFile` failure returned success** tool could produce
invalid genesis and exit 0. Fixed: failure aborts (opt out with
`--no-verify`).
15. **Zero unit tests for `bruteForceSignerSequence`** fixed: 10
table-driven tests.

### Docs linter (side fix for green CI)
- Skip `staging.gno.land`, `archive.org`, and add retry/timeout logic so
transient remote-link failures don't block unrelated PRs.

## Still open (design / follow-up)

- **RPC retry/resume**
(`contribs/gnogenesis/internal/fork/source_rpc.go`) a single transient
error during tx fetch aborts everything; needs exponential backoff +
checkpointing. Architectural, follow-up PR.
- **Streaming tx export** full tx history is held in memory; will OOM on
large chains. Needs streaming writer, follow-up PR.
- **`queryAccountAtHeight` silent nil** all error paths return nil with
no indication; flaky RPC → wrong sequence metadata.


## Cherry-picked from [#5597](#5597)
(this iteration)

Three follow-ups originally staged in the master-based hardfork series,
brought back to where they belong since they modify or extend code
introduced here:

- [`1babfe42a`](1babfe42a)
`fix(consensus): skip phantom heights during replay when InitialHeight >
1` — ABCI handshake replay path used to assume heights `[1,
appBlockHeight+1]` always have a stored block; for chains starting at
`InitialHeight > 1`, heights below `InitialHeight` never had blocks and
replay errored with "block not found for height 1".
- [`5bf2fa53e`](5bf2fa53e)
`fix(gnogenesis): default gas-storage params and gas_replay_mode in
hardfork genesis` — `buildHardforkGenesis` now defaults the post-#5415
`vm.params` gas-storage fields from `vm.DefaultParams()` when the source
has them all at zero, and sets `gas_replay_mode = "source"` when unset.
Operator overrides preserved. 4 unit tests.
- [`e31268467`](e31268467)
`feat(gnogenesis): add --skip-failing-genesis-txs and
--skip-genesis-sig-verification flags to fork test` — `make smoketest`
now matches what production validators actually run.

## End-to-end validation

The hf-glue testbed ([#5486](#5486))
runs `make fetch && make init && make up` against `rpc.gno.land`
halt@704052 and produces a 192 MB hardfork genesis that replays with **0
/ 2715 tx failures** and boots a live `gnoland-1` node.

## Dependencies / related PRs

- **Depends on / pairs with:**
[#5533](#5533) (`contribs/tx-archive`
metadata + `SignerInfo` populator) for replay-ready backups
- **Used in:** [#5486](#5486)
(hf-glue testbed)
- **Also fixed here:** [#5539](#5539)
(docs-linter skip staging preemptive fix, committed here too to keep CI
green)

## AI disclosure

Developed with significant assistance from Claude Code for testing,
review, and iterative fixes.

---------

Co-authored-by: moul <noreply@moul.io>
Co-authored-by: jaekwon <jae@tendermint.com>
assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: aeddi <antoine.e.b@gmail.com>

merging for moul
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

📖 documentation Improvements or additions to documentation 🤝 contribs 🐳 devops 🐹 golang Pull requests that update Go code 📦 🌐 tendermint v2 Issues or PRs tm2 related 📦 ⛰️ gno.land Issues or PRs gno.land package related 📦 🤖 gnovm Issues or PRs gnovm related 🧾 package/realm Tag used for new Realms or Packages.

Projects

Status: 📥 Inbox
Status: No status

Development

Successfully merging this pull request may close these issues.

7 participants