chore(perf): misc improvement on new VID by mrain · Pull Request #4223 · EspressoSystems/espresso-network

mrain · 2026-04-24T16:14:56Z

Closes #<ISSUE_NUMBER>

This PR:

VID performance improvements: see #4223 (comment)

Hash swap — Keccak-256 → BLAKE3 (arity 4 Merkle tree).
Recover concat — pre-allocated extend_from_slice instead of flat_map().collect().
Recover intermediate removed — Vec<Option<&[u8]>> refs + direct decoder→output memcpy, no .clone()/.to_vec(), no unwrap().
Disperse padding — single-pass chunk_and_pad, no intermediate padded buffer.
Disperse share assembly — consume owned shares via iterator, no per-recipient .to_vec().
Parallelism — par_iter across namespaces in ns_disperse, recover, and verify_share_with_verified_common; inner par_iter with find_any short-circuit in verify_share.

This PR does not:

Key places to review:

claude · 2026-04-24T16:15:18Z

Claude finished @mrain's task in 4m 34s —— View job

Review of PR #4223 — VID Performance Improvements

Read existing review comments
Review vid/src/avidm_gf2.rs (core changes)
Review vid/src/avidm_gf2/namespaced.rs (namespace parallelism)
Review vid/Cargo.toml (dependency changes)
Review benchmark changes
Post review feedback

Summary

Solid performance work — the optimizations are well-motivated and the numbers in the PR description are impressive (4.8x disperse, 2.8x verify, 8.7x recover). The code changes are clean and correct. I left inline comments on specific files; here's the high-level:

Issues to address

Missing benchmark files (build-breaking) — vid/Cargo.toml declares [[bench]] entries for avidm_gf2_ns and avidm_gf2_breakdown, but neither vid/benches/avidm_gf2_ns.rs nor vid/benches/avidm_gf2_breakdown.rs exists. This will break cargo bench -p vid.
blake3 downgrade to 1.5.5 affects entire workspace — The ~1.5 pin (needed for traits-preview) forces blake3 from 1.8.4 → 1.5.5 across all 8+ crates that depend on it. This loses newer SIMD optimizations and ties the workspace to an explicitly-unstable feature. Consider using blake3's native hash() API instead to avoid the Digest trait dependency entirely.
Serialization compatibility — Switching from Keccak256 (arity 2) to BLAKE3 (arity 4) changes MerkleCommit and MerkleProof formats. Since AvidmGf2Commit appears in VidCommon::V2 which has reference tests (test_vid_common_v2_query_data), please confirm cargo test -p espresso-types reference passes (or update the reference data).

Minor / informational

use sha2::Digest on line 11 of avidm_gf2.rs is misleading now that the hasher is blake3 — digest::Digest would be clearer.
Benchmark parameters changed significantly (10k→1k weights, 5→10 MB payload), making comparison with prior baselines impossible. Consider keeping the old config as a separate group.

What looks good

chunk_and_pad — Clean single-pass implementation, all edge cases handled correctly (boundary-aligned payloads, sub-chunk payloads, empty trailing chunks).
find_any short-circuit pattern in verify_share and verify_share_with_verified_common — Correct and efficient. The non-deterministic error variant ordering is acceptable for verification.
recover refactor — Using &[u8] references instead of Option<Vec<u8>> clones is a solid improvement. The lifetime management is correct: input_orig borrows from shares (input param), and result.restored_original() borrows from the decoder result — both live long enough for the extend_from_slice loop.
Namespace-level parallelism in namespaced.rs — ns_disperse, verify_share_with_verified_common, and recover all correctly parallelize over independent namespaces. Rayon's work-stealing handles the nested par_iter calls from the inner AvidmGf2Scheme methods.
Share assembly in disperse — Consuming owned shares via shares.into_iter() + by_ref().take() instead of shares[range].to_vec() avoids redundant heap allocations.

gemini-code-assist

Code Review

This pull request optimizes the avidm_gf2 scheme by switching the hashing algorithm to blake3 and implementing several memory and concurrency improvements, such as avoiding intermediate allocations during padding and parallelizing dispersal, verification, and recovery using rayon. Feedback identifies security risks associated with downgrading blake3 and constant_time_eq dependencies in Cargo.lock. Additionally, there is a suggestion to parallelize the sequential payload extraction in the disperse function to further improve performance.

…ms/espresso-network into cl/vidimprove

vid/Cargo.toml declared `[[bench]]` entries for both files but they were not committed, breaking `cargo bench -p vid` and CI. - avidm_gf2_ns.rs: namespaced disperse/verify/recover sweep across num_ns ∈ {1, 10, 50, 100} matching the avidm_gf2 flat parameters. - avidm_gf2_breakdown.rs: per-phase timing of disperse/recover, hash parameterized so a single run reports BLAKE3 vs Keccak side-by-side.

Replace `jf_merkle_tree::hasher::HasherDigest` (which requires `digest::Digest`) with a local `Blake3DigestAlgorithm` + `Blake3Node` that call `blake3::hash` / `blake3::Hasher` directly. This unblocks upgrading `blake3` past 1.5.x — newer releases bumped to `digest 0.11` which is incompatible with `jf_merkle_tree`'s `digest 0.10` blanket impl, forcing a workspace-wide pin on `~1.5` and dragging `constant_time_eq` back to 0.3.x. After this: - Workspace pin lifted from `blake3 = "~1.5"` to `blake3 = "1"`, resolving to 1.8.5 (constant_time_eq 0.4.2). Same blake3 algorithm, no on-wire change. - `vid` no longer enables `blake3/traits-preview`, an explicitly- unstable feature whose `digest` version is the source of the pin. - Wire format unchanged: 32-byte node values, leaf/internal domain separators `b"1"`/`b"0"`, identical to `HasherDigestAlgorithm + HasherNode<blake3::Hasher>`. Reference test `test_vid_common_v2_query_data` still matches the data added in 5cf2687 — confirmed by `cargo test -p espresso-types reference`. - `avidm_gf2_breakdown` bench abstracted over a small `MerkleBackend` trait so both Keccak (via `HasherDigest`) and BLAKE3 (via `Blake3DigestAlgorithm`) plug in without code duplication. - `vid::utils` is now `pub` so benches can reach the BLAKE3 helpers. Addresses Cargo.lock blake3/constant_time_eq downgrade comments and the `traits-preview` instability concern from the PR review.

Co-authored-by: Copilot <copilot@github.com>

mrain · 2026-04-28T19:53:35Z

VID — original vs current, ST + MT, 1/10/100 ns, local + AWS

Snapshot of the cumulative impact of this PR. Workload: arity 4 (current) vs arity 3 (original), 10 MB payload, total_weights = 1000, recovery_threshold = 340. Original = Keccak baseline; current = BLAKE3 + all landed optimizations (per-namespace par_iter, recover refactor with Vec<Option<&[u8]>>, batch proof generation via collect_leaves_with_proof, drain proofs into per-recipient groups, upstream jf_merkle_tree::collect_all_with_proof allocation refactor).

Local (Apple Silicon)

op	mode	1 ns orig	1 ns curr	10 ns orig	10 ns curr	100 ns orig	100 ns curr
Disperse	ST	64.1 ms	31.4 ms	77.7 ms	58.5 ms	193.7 ms	109.4 ms
Disperse	MT	20.6 ms	15.6 ms	34.5 ms	13.8 ms	139.0 ms	35.3 ms
Verify	ST	52.4 µs	20.9 µs	78.1 µs	66.5 µs	331.3 µs	241.4 µs
Verify	MT	52.4 µs	20.7 µs	78.1 µs	51.5 µs	331.3 µs	118.3 µs
Recover	ST	61.9 ms	42.1 ms	64.5 ms	41.1 ms	101.2 ms	85.5 ms
Recover	MT	61.9 ms	39.9 ms	64.5 ms	10.1 ms	101.2 ms	11.3 ms

AWS Graviton (aarch64, 1 vCPU — MT ≈ ST)

op	1 ns orig	1 ns curr	10 ns orig	10 ns curr	100 ns orig	100 ns curr
Disperse	95.4 ms	51.5 ms	113.7 ms	80.6 ms	265.3 ms	191.3 ms
Verify	67.5 µs	25.6 µs	92.5 µs	68.4 µs	348.5 µs	286.1 µs
Recover	125.9 ms	106.8 ms	107.3 ms	79.7 ms	155.7 ms	106.9 ms

Notes

AWS box is 1 vCPU, so its "MT" is the same as ST — none of the per-namespace par_iter wins materialize there. The AWS column is the single-thread story end-to-end.
Local MT scaling is dramatic at high num_ns: dispersal 3.9×, recover 9.0×. The original code had no per-namespace parallelism in either path; landing par_iter over ns_table exposed all of it.
Verify ST regresses ~7 % at 10/100 ns — par_iter().find_any() adds rayon-dispatch overhead under RAYON_NUM_THREADS=1. Small enough to land vs the 2.8× MT win.
Original "MT" numbers are from the workspace-default rayon pool before any per-namespace par_iter landed; rayon was already used internally by AvidmGf2Scheme::disperse for leaf hashing and share assembly, hence the non-trivial pre-PR MT speedups vs ST. The wins above are additional to that.

Co-authored-by: Copilot <copilot@github.com>

…to cl/vidimprove

mrain · 2026-04-30T15:04:10Z

@claude review

claude · 2026-04-30T15:04:27Z

Claude finished @mrain's task in 3m 25s —— View job

Review of PR #4223 — VID Performance Improvements

Read previous review comments and understand resolved items
Review vid/src/utils/blake3.rs (new file — BLAKE3 native API integration)
Review vid/src/avidm_gf2.rs (core optimizations)
Review vid/src/avidm_gf2/namespaced.rs (namespace parallelism)
Review vid/src/lib.rs and vid/src/utils.rs (module wiring)
Review vid/Cargo.toml and Cargo.toml (dependency changes)
Review benchmark files
Review reference test data updates
Post review feedback

Summary

The PR is in significantly better shape than the first review. The blake3 version pin issue is resolved (now at 1.8.5 with "1" semver range), and all the core optimizations are clean. The benchmarks show impressive gains — especially the 3.9x MT disperse and 9.0x MT recover at 100 namespaces.

Resolved from prior review

blake3 version pin: Now blake3 = "1" in workspace, resolving to 1.8.5 in the lockfile. No more traits-preview dependency or workspace-wide downgrade.
Missing benchmark files: avidm_gf2_ns.rs, avidm_gf2_breakdown.rs, and big_mt.rs are all present.
Serialization compatibility: Reference test data in data/v2/ has been updated with new BLAKE3-based commitments.

Items to address

1. jf-merkle-tree pinned to an unpublished branch

Cargo.toml:240 pins jf-merkle-tree to branch = "cl/mtiter" instead of a tagged release. This presumably provides the collect_leaves_with_proof() method used in avidm_gf2.rs:241. This should be tagged and released before merging to main — branch references can break if the branch is rebased or deleted.

2. Domain separator fragility in blake3.rs

vid/src/utils/blake3.rs:24-25 copies domain separators (b"1" / b"0") from jf_merkle_tree's pub(crate) constants. These are correct today and the hashing order matches, but if jellyfish ever changes them the VID will silently produce incompatible commitments. Consider adding a compile-time or test-time assertion that validates these against a tree built through the standard HasherDigestAlgorithm path — the existing deterministic_commitment test only checks self-consistency.

Minor observations

3. Arity change from 2 to 4 is not documented in the version table

The Merkle tree arity changed from 2 (via HasherMerkleTree) to 4 (explicit JfMerkleTree<..., 4, ...>). The PR description mentions "arity 4 Merkle tree" but the code type alias at avidm_gf2.rs:27 is the only documentation. Since this affects proof sizes and commitment formats for V2 (VID2), it would be good to note the arity in the protocol version table or a code comment near the type alias.

4. avidm_gf2_breakdown.rs uses let-else with if let chain

vid/benches/avidm_gf2_breakdown.rs:340-344:

if let Some(idx) = recovered.iter().rposition(|&b| b != 0)
    && recovered[idx] == 1u8
{
    recovered.truncate(idx);
}

This silently succeeds on malformed payloads (no pad byte found), unlike the production recover() which returns Err. Acceptable for a benchmark, but worth noting if this code is ever copied.

5. sha2 and sha3 are still in vid's dependency list

vid/Cargo.toml still lists sha2 and sha3 as dependencies. If no other code in the vid crate uses them (I see sha3 still used in the avidm_gf2_breakdown.rs benchmark for the Keccak comparison backend), then sha2 could potentially be removed. sha3 is needed for the benchmark's KeccakBackend.

What looks good

Blake3DigestAlgorithm + Blake3Node (utils/blake3.rs): Clean implementation. The HasherWriter adapter for CanonicalSerialize is the right approach. Tests cover determinism, proof round-trips, tamper detection, and serialization. The #[repr(transparent)] on Blake3Node and raw 32-byte CanonicalSerialize are correct.
chunk_and_pad (avidm_gf2.rs:128-158): Single-pass, no intermediate buffer. All edge cases handled correctly — boundary-aligned payloads, sub-chunk, and empty trailing chunks. The guard padded_len < payload.len() + 1 is correct.
disperse share assembly (avidm_gf2.rs:235-261): The shares.into_iter() + by_ref().take() pattern correctly avoids per-recipient heap clones. Using collect_leaves_with_proof() for batch proof generation (single tree traversal) instead of range.len() individual mt.lookup() calls is a significant improvement.
recover refactor (avidm_gf2.rs:297-376): Vec<Option<&[u8]>> instead of Vec<Option<Vec<u8>>> eliminates all per-shard clones. The lifetime management is correct — input_orig borrows from shares (the input parameter), and result.restored_original() borrows from the decoder result, both of which live through the extend_from_slice loop. The early capacity reservation Vec::with_capacity(original_count * shard_bytes) is also good.
Namespace-level parallelism (namespaced.rs): ns_disperse, verify_share_with_verified_common, and recover all correctly parallelize over independent namespaces with par_iter. The find_any short-circuit pattern is correct and efficient for verification.
verify_share par_iter + find_any (avidm_gf2.rs:277-294): Clean parallel verification with short-circuit on first failure. The match arms correctly propagate Ok(Err(())) (proof mismatch) vs Err(e) (internal error).

Verdict

The core optimizations are correct and well-tested. The main concern is the unpublished jf-merkle-tree branch dependency — that should be tagged before merge. The domain separator fragility is a minor long-term risk, not a blocker.

View job

github-actions · 2026-04-30T15:49:02Z

Nextest failures (2) in this run

Test	Attempts	Time (s)	Main history
`hotshot-testing::test_epochs_combined_network::test_combined_network_reup_with_epochs::testtypes_::combinedimpl_::test_combined_network_reup_with_epochs`	1	45.38	passing
`hotshot-testing::tests_2::tests_2::catchup::test_all_restart_one_da::testtypes_::cliquenetimpl_::test_all_restart_one_da`	1	11.28	passing

See the step summary for flaky tests and slowest tests.

…etwork into cl/vidimprove

…to cl/vidimprove

github-actions · 2026-05-06T19:42:20Z

Nextest failures (1) in this run

Test	Attempts	Time (s)	Main history
`espresso-node::api::test::test_aggregator_namespace_endpoints`	3	29.56	flaky 5/5

See the step summary for flaky tests and slowest tests.

mrain added 2 commits April 24, 2026 11:51

blake3 + par verify/recovery

f5b1ef2

easy dispersal opt

f160ddc

mrain requested review from bfish713, imabdulbasit, jbearer, lukaszrzasik, ss-es and sveitser as code owners April 24, 2026 16:14