Skip to content

Use internal locking for epoch memberships.#4263

Open
twittner wants to merge 28 commits intomainfrom
tw/internal-locking
Open

Use internal locking for epoch memberships.#4263
twittner wants to merge 28 commits intomainfrom
tw/internal-locking

Conversation

@twittner
Copy link
Copy Markdown
Contributor

@twittner twittner commented May 6, 2026

No description provided.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini is experiencing higher than usual traffic and was unable to create the review. Please try again in a few hours by commenting /gemini review.

@twittner
Copy link
Copy Markdown
Contributor Author

twittner commented May 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Membership trait and its implementations to use synchronous methods for membership lookups, replacing async_lock::RwLock with parking_lot::RwLock to improve performance and simplify the codebase. The EpochCommittees structure was also reorganized to support these changes. A typo in a panic message was identified and should be corrected.

panic!(
"The decided block header's `next_stake_table_hash` does not match the hash \
of the stake table we have. This is an unrecoverable error likely due to \
issues with the your L1 RPC provider. Decided:\n\n{:?}Actual:\n\n{:?}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Typo in panic message: "the your" should be "your".

@twittner twittner marked this pull request as ready for review May 7, 2026 16:53
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 7, 2026

Claude finished @twittner's task in 5m 8s —— View job


Code Review: PR #4263 — Use internal locking for epoch memberships

  • Read existing review comments and CLAUDE.md
  • Review trait changes (election.rs)
  • Review new committee.rs implementation
  • Review epoch_membership.rs changes
  • Review stake_table.rs changes
  • Review task-impl changes
  • Review test changes
  • Post comprehensive review

Summary

This is a significant refactor (~2650 added / ~2590 removed across 84 files) that converts the Membership trait from async methods returning Arc-wrapped data behind async_lock::RwLock to synchronous methods using parking_lot::RwLock internally. The core idea is sound: the EpochCommittees struct now encapsulates its own locking through Arc<RwLock<Inner>>, and each public method takes a short-lived read or write lock. This simplifies all call sites across the codebase significantly.

Architecture & Design

Locking model is well-considered. The doc-comment on EpochCommittees (lines 44–56 in committee.rs) clearly documents the per-call-snapshot semantics and the invariant that single-epoch mutations happen under one write lock. This is the right design for this use case — callers that need cross-method consistency (like set_first_epoch and add_epoch_root) do so within a single write-lock scope.

async_lock::Mutex is retained where appropriate. The update_fixed_block_reward_lock correctly uses an async mutex because its critical section awaits a network fetch. Good judgment call.

The Membership trait is now fully synchronous (except get_epoch_root, get_epoch_drb, and add_epoch_root which remain async for network operations). The async_lock::RwLock that previously wrapped the entire membership is gone from all callers, which removes a class of potential deadlocks.

Code Quality

The refactor is mechanically clean across the 84 files. The task-impl changes consistently replace membership.read().await.method() with membership.method(), and the EpochMembership wrapper delegates cleanly. The extraction of EpochCommittees into its own committee.rs module is a good organizational move.

Concurrency & Safety

TOCTOU windows are documented and intentional. Methods like add_drb_result have a read-then-write pattern where the read lock is released before the write lock is acquired. The doc-comment explicitly acknowledges this, and the behavior (overwriting with a slightly stale leader set) is benign.

set_first_epoch atomicity is correctly implemented — all mutations to first_epoch, state, and randomized_committees happen under a single write lock (line 1123).

add_epoch_root properly scopes its write lock (lines 949–960) and drops it before async persistence operations. This avoids holding a synchronous lock across await points — a common source of deadlocks.

Testing

The new concurrency tests in committee.rs are thorough and well-designed:

  • concurrent_reads_during_mutations — verifies readers don't panic or deadlock with writers
  • set_first_epoch_is_atomic — verifies snapshot consistency using direct inner access
  • concurrent_add_drb_result_same_epoch — stress tests the read-then-write pattern
  • add_epoch_root_is_atomic — verifies the write lock scope in add_epoch_root

All tests use TEST_DURATION (5s) with multiple rounds, which is appropriate for catching flaky race conditions in CI.

Issues Found

Two minor issues with inline suggestions posted:

  1. Typo in panic message (line 1001): "the your""your" (also flagged by Gemini)
  2. Pre-existing error message (lines 364, 375): "state for found" should be "state not found"

One performance suggestion posted inline:

  • address() (line 369–377) clones the entire address_mapping HashMap just to look up one key. The lookup can be done under the read lock to avoid the clone.

Verdict

This is a well-executed refactor that simplifies the codebase significantly. The locking model is sound, the TOCTOU windows are documented and intentional, and the test coverage for the new concurrency model is strong. The two typo fixes are trivial. Looks good to merge after those are addressed.

Comment thread crates/espresso/types/src/v0/impls/committee.rs Outdated
Comment thread crates/espresso/types/src/v0/impls/committee.rs Outdated
Comment on lines +1118 to +1121
fn set_first_epoch(&self, epoch: EpochNumber, initial_drb_result: DrbResult) {
let rand_comm = self
.randomized_committee(EpochNumber::genesis(), initial_drb_result)
.expect("committee exist at genesis");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor observation: randomized_committee() takes a read lock on inner (line 574), which is released before the write lock on line 1123. This is correct with parking_lot::RwLock, but worth being aware of: between the read and write locks, the genesis committee could theoretically be removed by another writer, causing the .expect() here to panic.

In practice this is safe because state[EpochNumber::genesis()] is populated at construction and never removed. But if that invariant ever changed, this would be a latent issue. The test set_first_epoch_is_atomic validates the current behavior well.

Comment on lines +1108 to +1116
fn add_drb_result(&self, epoch: EpochNumber, drb: DrbResult) {
tracing::info!("Adding DRB result {drb:?} to epoch {epoch}");
if let Some(committee) = self.randomized_committee(epoch, drb) {
self.inner
.write()
.randomized_committees
.insert(epoch, committee);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented TOCTOU window: randomized_committee() acquires and releases a read lock, then add_drb_result acquires a write lock. Between those two, the state entry for epoch could change. The effect is benign (overwrite with slightly stale leader set) and the doc-comment on EpochCommittees explicitly calls out this per-call-snapshot behavior. The concurrent tests cover this well.

No action needed — just confirming this was reviewed and is intentional.

Comment thread crates/espresso/types/src/v0/impls/committee.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant