[new-protocol] new decide event handling, demo, integration test#4258
[new-protocol] new decide event handling, demo, integration test#4258imabdulbasit wants to merge 5 commits intomainfrom
Conversation
End-to-end integration of new protocol in espresso node. - Coordinator::maker now seeds seed_genesis / seed_state from the HotShotInitializer. - Stake table API: reject requests for epochs below first_epoch. - Leaf fetcher: compare QC by (view, data) instead of full commit (signature subsets differ across cert1s for the same leaf). - Decide events: decide_events_from_chain splits chains at NEW_PROTOCOL_VERSION, emitting legacy Decide + NewDecide as needed. - Add demo file and integration test.
There was a problem hiding this comment.
Code Review
This pull request implements comprehensive support for the new protocol version, updating API, persistence, and consensus components to handle NewDecide events and VID shares. Key improvements include seeding the network with genesis peers, refining leaf request authentication in the query service, and adding a mechanism to wait for L1 finalization. Critical feedback points out that garbage collection has been disabled and last_decided_view updates removed, which could lead to memory leaks. Additionally, the use of a fixed sleep in block building is flagged as non-deterministic, and the 100x increase in EXTERNAL_EVENT_CHANNEL_SIZE requires further justification regarding its impact on resource consumption.
| fn gc(&mut self, _view: ViewNumber, _epoch: EpochNumber) { | ||
| // GC disabled while debugging proposal fetcher: peers must keep | ||
| // signed_proposals/pending_proposal_fetches across decides so | ||
| // catching-up nodes can fetch parent proposals. | ||
| // self.consensus.gc(view, epoch); | ||
| // self.checkpoint_collector.gc(view, epoch); | ||
| // let _ = self.network.gc(view); | ||
| // self.state_manager.gc(view); | ||
| // self.vid_disperser.gc(view); | ||
| // self.vid_reconstructor.gc(view); | ||
| // self.vote1_collector.gc(view, epoch); | ||
| // self.vote2_collector.gc(view, epoch); | ||
| // self.timeout_collector.gc(view, epoch); | ||
| // self.timeout_one_honest_collector.gc(view, epoch); | ||
| // self.epoch_root_collector.gc(view, epoch); | ||
| // self.epoch_manager.gc(epoch); | ||
| // self.block_builder.gc(view); | ||
| // self.pending_proposal_fetches.gc(view); | ||
| // self.storage.gc(view); | ||
| } |
There was a problem hiding this comment.
The garbage collection logic has been completely commented out. The comment indicates this is for debugging the proposal fetcher. Disabling GC is a critical issue as it will lead to unbounded memory usage over time. This change must be reverted before this PR is merged.
fn gc(&mut self, view: ViewNumber, epoch: EpochNumber) {
self.consensus.gc(view, epoch);
self.checkpoint_collector.gc(view, epoch);
let _ = self.network.gc(view); // TODO
self.state_manager.gc(view);
self.vid_disperser.gc(view);
self.vid_reconstructor.gc(view);
self.vote1_collector.gc(view, epoch);
self.vote2_collector.gc(view, epoch);
self.timeout_collector.gc(view, epoch);
self.timeout_one_honest_collector.gc(view, epoch);
self.epoch_root_collector.gc(view, epoch);
self.epoch_manager.gc(epoch);
self.block_builder.gc(view);
self.pending_proposal_fetches.gc(view);
self.storage.gc(view);
}| let membership = self.membership.clone(); | ||
|
|
||
| let handle = self.tasks.spawn(async move { | ||
| tokio::time::sleep(std::time::Duration::from_millis(100)).await; |
There was a problem hiding this comment.
Adding a fixed sleep is generally not a robust way to handle timing issues and can be a sign of a race condition. It can also slow down tests and production code unnecessarily. Could you please explain the purpose of this sleep? If it's to wait for some condition, it would be better to use a more deterministic synchronization mechanism, like a channel or a conditional variable.
|
|
||
| /// Default channel size for HotShot -> application communication | ||
| pub const EXTERNAL_EVENT_CHANNEL_SIZE: usize = 1_000; | ||
| pub const EXTERNAL_EVENT_CHANNEL_SIZE: usize = 100_000; |
There was a problem hiding this comment.
The EXTERNAL_EVENT_CHANNEL_SIZE has been increased by a factor of 100. While this might solve an issue with the channel becoming full, it's a significant increase that will raise memory consumption. Could you add a comment explaining the rationale for this large increase? It's important to understand if this is addressing a symptom of a deeper performance issue that might need a different solution.
…or, fix ff-base test setup - coordinator: emit genesis DA proposal at start so query service can serve block 0 - coordinator: kick off coord.start() from run_coordinator - coordinator: persist LightClientStateUpdateCertificateV2 from epoch-root certs and from validated proposals carrying a state_cert, via Storage::update_state_cert - state: expose StateManager::get_leaf - tests: wait for LightClient proxy code in TestRuntime::initialize before test_state() reads finalizedState, which would otherwise panic with ZeroData on tests that don't gate on the builder - tests: scope balance-conservation check to builder runs only - tests: ff-base requirements derive requires_builder from genesis version - demo-ff: epoch_height 1000 -> 300
end to end integration of new protocol in espresso node.
Coordinator::makernow seedsseed_genesis/seed_statefrom theHotShotInitializerfirst_epoch.(view, data)instead of full commit(signature subsets differ across cert1s for the same leaf).
decide_events_from_chainsplits chains atNEW_PROTOCOL_VERSION, emitting legacyDecide+NewDecideas needed.