Skip to content

[new-protocol] new decide event handling, demo, integration test#4258

Draft
imabdulbasit wants to merge 5 commits intomainfrom
ab/new-protocol-integration-clean
Draft

[new-protocol] new decide event handling, demo, integration test#4258
imabdulbasit wants to merge 5 commits intomainfrom
ab/new-protocol-integration-clean

Conversation

@imabdulbasit
Copy link
Copy Markdown
Contributor

@imabdulbasit imabdulbasit commented May 6, 2026

end to end integration of new protocol in espresso node.

  • Coordinator::maker now seeds seed_genesis /
    seed_state from the HotShotInitializer
  • Stake table API: reject requests for epochs below first_epoch.
  • Leaf fetcher: compare QC by (view, data) instead of full commit
    (signature subsets differ across cert1s for the same leaf).
  • Decide events: decide_events_from_chain splits chains at
    NEW_PROTOCOL_VERSION, emitting legacy Decide + NewDecide as needed.
  • add demo file and integration test

End-to-end integration of new protocol in espresso node.

- Coordinator::maker now seeds seed_genesis / seed_state from the
  HotShotInitializer.
- Stake table API: reject requests for epochs below first_epoch.
- Leaf fetcher: compare QC by (view, data) instead of full commit
  (signature subsets differ across cert1s for the same leaf).
- Decide events: decide_events_from_chain splits chains at
  NEW_PROTOCOL_VERSION, emitting legacy Decide + NewDecide as needed.
- Add demo file and integration test.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements comprehensive support for the new protocol version, updating API, persistence, and consensus components to handle NewDecide events and VID shares. Key improvements include seeding the network with genesis peers, refining leaf request authentication in the query service, and adding a mechanism to wait for L1 finalization. Critical feedback points out that garbage collection has been disabled and last_decided_view updates removed, which could lead to memory leaks. Additionally, the use of a fixed sleep in block building is flagged as non-deterministic, and the 100x increase in EXTERNAL_EVENT_CHANNEL_SIZE requires further justification regarding its impact on resource consumption.

Comment on lines 872 to 891
fn gc(&mut self, _view: ViewNumber, _epoch: EpochNumber) {
// GC disabled while debugging proposal fetcher: peers must keep
// signed_proposals/pending_proposal_fetches across decides so
// catching-up nodes can fetch parent proposals.
// self.consensus.gc(view, epoch);
// self.checkpoint_collector.gc(view, epoch);
// let _ = self.network.gc(view);
// self.state_manager.gc(view);
// self.vid_disperser.gc(view);
// self.vid_reconstructor.gc(view);
// self.vote1_collector.gc(view, epoch);
// self.vote2_collector.gc(view, epoch);
// self.timeout_collector.gc(view, epoch);
// self.timeout_one_honest_collector.gc(view, epoch);
// self.epoch_root_collector.gc(view, epoch);
// self.epoch_manager.gc(epoch);
// self.block_builder.gc(view);
// self.pending_proposal_fetches.gc(view);
// self.storage.gc(view);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The garbage collection logic has been completely commented out. The comment indicates this is for debugging the proposal fetcher. Disabling GC is a critical issue as it will lead to unbounded memory usage over time. This change must be reverted before this PR is merged.

    fn gc(&mut self, view: ViewNumber, epoch: EpochNumber) {
        self.consensus.gc(view, epoch);
        self.checkpoint_collector.gc(view, epoch);
        let _ = self.network.gc(view); // TODO
        self.state_manager.gc(view);
        self.vid_disperser.gc(view);
        self.vid_reconstructor.gc(view);
        self.vote1_collector.gc(view, epoch);
        self.vote2_collector.gc(view, epoch);
        self.timeout_collector.gc(view, epoch);
        self.timeout_one_honest_collector.gc(view, epoch);
        self.epoch_root_collector.gc(view, epoch);
        self.epoch_manager.gc(epoch);
        self.block_builder.gc(view);
        self.pending_proposal_fetches.gc(view);
        self.storage.gc(view);
    }

let membership = self.membership.clone();

let handle = self.tasks.spawn(async move {
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Adding a fixed sleep is generally not a robust way to handle timing issues and can be a sign of a race condition. It can also slow down tests and production code unnecessarily. Could you please explain the purpose of this sleep? If it's to wait for some condition, it would be better to use a more deterministic synchronization mechanism, like a channel or a conditional variable.

Comment thread crates/hotshot/new-protocol/src/consensus.rs
Comment thread crates/hotshot/types/src/constants.rs Outdated

/// Default channel size for HotShot -> application communication
pub const EXTERNAL_EVENT_CHANNEL_SIZE: usize = 1_000;
pub const EXTERNAL_EVENT_CHANNEL_SIZE: usize = 100_000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The EXTERNAL_EVENT_CHANNEL_SIZE has been increased by a factor of 100. While this might solve an issue with the channel becoming full, it's a significant increase that will raise memory consumption. Could you add a comment explaining the rationale for this large increase? It's important to understand if this is addressing a symptom of a deeper performance issue that might need a different solution.

…or, fix ff-base test setup

- coordinator: emit genesis DA proposal at start so query service can serve block 0
- coordinator: kick off coord.start() from run_coordinator
- coordinator: persist LightClientStateUpdateCertificateV2 from epoch-root certs
  and from validated proposals carrying a state_cert, via Storage::update_state_cert
- state: expose StateManager::get_leaf
- tests: wait for LightClient proxy code in TestRuntime::initialize before
  test_state() reads finalizedState, which would otherwise panic with
  ZeroData on tests that don't gate on the builder
- tests: scope balance-conservation check to builder runs only
- tests: ff-base requirements derive requires_builder from genesis version
- demo-ff: epoch_height 1000 -> 300
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant