fix(finality-grandpa): eliminate data races in voter, timer, and test infrastructure#4849
fix(finality-grandpa): eliminate data races in voter, timer, and test infrastructure#4849dimartiro wants to merge 4 commits into
Conversation
| // publishVoterStateSnapshot rebuilds the public VoterStateReport from the | ||
| // current inner state and stores it atomically. Callers must NOT hold | ||
| // inner.Mutex when invoking this; the method acquires it. | ||
| func (v *Voter[Hash, Number, Signature, ID]) publishVoterStateSnapshot() { |
There was a problem hiding this comment.
Suggest dropping this whole snapshot mechanism — publishVoterStateSnapshot, the defer in poll(), the voterStateSnapshot atomic.Pointer field, the buildVoterStateReport helper, and the lock-on-init in VoterState().
The Get() race is real, but the fix is one-line: have sharedVoteState reference v.inner (the writer's actual mutex — the old svs.mtx was a separate mutex that never synchronized with anything) and take it briefly in Get() while building the report.
Today the only callers of Get() are in voter_test.go. The one external consumer — SharedVoterState.reset(...) in internal/client/consensus/grandpa/grandpa.go — stores the handle but doesn't expose a reader yet (there's a // TODO: telemetry). Per-poll snapshot work + an extra field + ~25 lines of plumbing seems like a lot to carry for just tests.
timwu20
left a comment
There was a problem hiding this comment.
Nice cleanup overall — the race fixes in timer.go, wakerChan, environment_test.go, and the missing Unlock in processBestRound all look right.
My one piece of feedback is on the VoterState snapshot change — left an inline comment on publishVoterStateSnapshot. tl;dr: the underlying Get() race is worth fixing, but the snapshot/atomic.Pointer pattern feels heavy given that the only current callers of Get() are tests.
Otherwise LGTM.
|
@timwu20 changes applied. Could you please take one more look? |
Changes
Fix all data races detected by
go test -race ./pkg/finality-grandpa(137 warnings ondevelopment, 0 after this PR):voter.go—wakerChan.waker: changed from*wakertoatomic.Pointer[waker]so the goroutine instart()and callers ofsetWaker()no longer race on pointer publication.voter.go—sharedVoteState.Get(): replaced ad-hocsvs.mtx(which never synchronized with the writer) with a snapshot pattern — the voter loop publishes aVoterStateReportintoatomic.PointerandGet()reads it withoutlocking.
voter.go—pruneBackgroundRounds: collect finalize notifications underinner.Mutex, release the lock, then invokeenv.FinalizeBlockoutside it. Holding the mutex across a user-supplied callback was both a deadlock hazard (a slowenvironment could stall observers and
Stop()) and what forcedGet()to give up on locking in the first place. A snapshot is also published right after releasing the lock so concurrent readers see fresh state even while we're inside a slow callback.voter.go—processBestRound: fixed a pre-existing missinginner.Unlock()on the error path frombestRound.poll.timer.go:expiredis nowatomic.Bool(was written under a mutex, read without). The one-shot close ofwakerChan.inis now guarded bysync.Once, replacing the previous mutex + flag combo.environment_test.go—BroadcastNetwork: protectedsenders/historywith a mutex, added astopchannel and a separate WaitGroup so forwarder goroutines exit beforeStop()closes the receiver (previously caused send-on-closed-channel races and potential panics).voter_test.go—TestBuffered: the sharedrunbool is nowatomic.Bool.Tests
go test -race ./pkg/finality-grandpaShould report ok with no WARNING: DATA RACE. Before this PR the same command produced 137 warnings.
Issues
ChainSafe/go-jam#523