Skip to content

fix(finality-grandpa): eliminate data races in voter, timer, and test infrastructure#4849

Open
dimartiro wants to merge 4 commits into
developmentfrom
diego/fix-finality-grandpa-races
Open

fix(finality-grandpa): eliminate data races in voter, timer, and test infrastructure#4849
dimartiro wants to merge 4 commits into
developmentfrom
diego/fix-finality-grandpa-races

Conversation

@dimartiro
Copy link
Copy Markdown
Contributor

Changes

Fix all data races detected by go test -race ./pkg/finality-grandpa (137 warnings on development, 0 after this PR):

  • voter.gowakerChan.waker: changed from *waker to atomic.Pointer[waker] so the goroutine in start() and callers of setWaker() no longer race on pointer publication.
  • voter.gosharedVoteState.Get(): replaced ad-hoc svs.mtx (which never synchronized with the writer) with a snapshot pattern — the voter loop publishes a VoterStateReport into atomic.Pointer and Get() reads it without
    locking.
  • voter.gopruneBackgroundRounds: collect finalize notifications under inner.Mutex, release the lock, then invoke env.FinalizeBlock outside it. Holding the mutex across a user-supplied callback was both a deadlock hazard (a slow
    environment could stall observers and Stop()) and what forced Get() to give up on locking in the first place. A snapshot is also published right after releasing the lock so concurrent readers see fresh state even while we're inside a slow callback.
  • voter.goprocessBestRound: fixed a pre-existing missing inner.Unlock() on the error path from bestRound.poll.
  • timer.go: expired is now atomic.Bool (was written under a mutex, read without). The one-shot close of wakerChan.in is now guarded by sync.Once, replacing the previous mutex + flag combo.
  • environment_test.goBroadcastNetwork: protected senders/history with a mutex, added a stop channel and a separate WaitGroup so forwarder goroutines exit before Stop() closes the receiver (previously caused send-on-closed-channel races and potential panics).
  • voter_test.goTestBuffered: the shared run bool is now atomic.Bool.

Tests

go test -race ./pkg/finality-grandpa

Should report ok with no WARNING: DATA RACE. Before this PR the same command produced 137 warnings.

Issues

ChainSafe/go-jam#523

@dimartiro dimartiro requested a review from timwu20 as a code owner May 14, 2026 01:26
@dimartiro dimartiro changed the title Diego/fix finality grandpa races fix(finality-grandpa): eliminate data races in voter, timer, and test infrastructure May 14, 2026
Comment thread pkg/finality-grandpa/voter.go Outdated
// publishVoterStateSnapshot rebuilds the public VoterStateReport from the
// current inner state and stores it atomically. Callers must NOT hold
// inner.Mutex when invoking this; the method acquires it.
func (v *Voter[Hash, Number, Signature, ID]) publishVoterStateSnapshot() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest dropping this whole snapshot mechanism — publishVoterStateSnapshot, the defer in poll(), the voterStateSnapshot atomic.Pointer field, the buildVoterStateReport helper, and the lock-on-init in VoterState().

The Get() race is real, but the fix is one-line: have sharedVoteState reference v.inner (the writer's actual mutex — the old svs.mtx was a separate mutex that never synchronized with anything) and take it briefly in Get() while building the report.

Today the only callers of Get() are in voter_test.go. The one external consumer — SharedVoterState.reset(...) in internal/client/consensus/grandpa/grandpa.go — stores the handle but doesn't expose a reader yet (there's a // TODO: telemetry). Per-poll snapshot work + an extra field + ~25 lines of plumbing seems like a lot to carry for just tests.

Copy link
Copy Markdown
Contributor

@timwu20 timwu20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup overall — the race fixes in timer.go, wakerChan, environment_test.go, and the missing Unlock in processBestRound all look right.

My one piece of feedback is on the VoterState snapshot change — left an inline comment on publishVoterStateSnapshot. tl;dr: the underlying Get() race is worth fixing, but the snapshot/atomic.Pointer pattern feels heavy given that the only current callers of Get() are tests.

Otherwise LGTM.

@dimartiro
Copy link
Copy Markdown
Contributor Author

@timwu20 changes applied. Could you please take one more look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants