Skip to content

chore(perf): misc improvement on new VID#4223

Open
mrain wants to merge 14 commits intomainfrom
cl/vidimprove
Open

chore(perf): misc improvement on new VID#4223
mrain wants to merge 14 commits intomainfrom
cl/vidimprove

Conversation

@mrain
Copy link
Copy Markdown
Contributor

@mrain mrain commented Apr 24, 2026

Closes #<ISSUE_NUMBER>

This PR:

VID performance improvements: see #4223 (comment)

  1. Hash swap — Keccak-256 → BLAKE3 (arity 4 Merkle tree).
  2. Recover concat — pre-allocated extend_from_slice instead of flat_map().collect().
  3. Recover intermediate removed — Vec<Option<&[u8]>> refs + direct decoder→output memcpy, no .clone()/.to_vec(), no unwrap().
  4. Disperse padding — single-pass chunk_and_pad, no intermediate padded buffer.
  5. Disperse share assembly — consume owned shares via iterator, no per-recipient .to_vec().
  6. Parallelism — par_iter across namespaces in ns_disperse, recover, and verify_share_with_verified_common; inner par_iter with find_any short-circuit in verify_share.

This PR does not:

Key places to review:

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 24, 2026

Claude finished @mrain's task in 4m 34s —— View job


Review of PR #4223 — VID Performance Improvements

  • Read existing review comments
  • Review vid/src/avidm_gf2.rs (core changes)
  • Review vid/src/avidm_gf2/namespaced.rs (namespace parallelism)
  • Review vid/Cargo.toml (dependency changes)
  • Review benchmark changes
  • Post review feedback

Summary

Solid performance work — the optimizations are well-motivated and the numbers in the PR description are impressive (4.8x disperse, 2.8x verify, 8.7x recover). The code changes are clean and correct. I left inline comments on specific files; here's the high-level:

Issues to address

  1. Missing benchmark files (build-breaking)vid/Cargo.toml declares [[bench]] entries for avidm_gf2_ns and avidm_gf2_breakdown, but neither vid/benches/avidm_gf2_ns.rs nor vid/benches/avidm_gf2_breakdown.rs exists. This will break cargo bench -p vid.

  2. blake3 downgrade to 1.5.5 affects entire workspace — The ~1.5 pin (needed for traits-preview) forces blake3 from 1.8.4 → 1.5.5 across all 8+ crates that depend on it. This loses newer SIMD optimizations and ties the workspace to an explicitly-unstable feature. Consider using blake3's native hash() API instead to avoid the Digest trait dependency entirely.

  3. Serialization compatibility — Switching from Keccak256 (arity 2) to BLAKE3 (arity 4) changes MerkleCommit and MerkleProof formats. Since AvidmGf2Commit appears in VidCommon::V2 which has reference tests (test_vid_common_v2_query_data), please confirm cargo test -p espresso-types reference passes (or update the reference data).

Minor / informational

  • use sha2::Digest on line 11 of avidm_gf2.rs is misleading now that the hasher is blake3 — digest::Digest would be clearer.
  • Benchmark parameters changed significantly (10k→1k weights, 5→10 MB payload), making comparison with prior baselines impossible. Consider keeping the old config as a separate group.

What looks good

  • chunk_and_pad — Clean single-pass implementation, all edge cases handled correctly (boundary-aligned payloads, sub-chunk payloads, empty trailing chunks).
  • find_any short-circuit pattern in verify_share and verify_share_with_verified_common — Correct and efficient. The non-deterministic error variant ordering is acceptable for verification.
  • recover refactor — Using &[u8] references instead of Option<Vec<u8>> clones is a solid improvement. The lifetime management is correct: input_orig borrows from shares (input param), and result.restored_original() borrows from the decoder result — both live long enough for the extend_from_slice loop.
  • Namespace-level parallelism in namespaced.rsns_disperse, verify_share_with_verified_common, and recover all correctly parallelize over independent namespaces. Rayon's work-stealing handles the nested par_iter calls from the inner AvidmGf2Scheme methods.
  • Share assembly in disperse — Consuming owned shares via shares.into_iter() + by_ref().take() instead of shares[range].to_vec() avoids redundant heap allocations.

Comment thread vid/Cargo.toml
Comment thread Cargo.toml Outdated
Comment thread vid/src/avidm_gf2.rs Outdated
Comment thread vid/src/avidm_gf2.rs
Comment thread vid/src/avidm_gf2.rs
Comment thread vid/benches/avidm_gf2.rs
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the avidm_gf2 scheme by switching the hashing algorithm to blake3 and implementing several memory and concurrency improvements, such as avoiding intermediate allocations during padding and parallelizing dispersal, verification, and recovery using rayon. Feedback identifies security risks associated with downgrading blake3 and constant_time_eq dependencies in Cargo.lock. Additionally, there is a suggestion to parallelize the sequential payload extraction in the disperse function to further improve performance.

Comment thread Cargo.lock Outdated
Comment thread Cargo.lock Outdated
Comment thread vid/src/avidm_gf2.rs
@EspressoSystems EspressoSystems deleted a comment from github-actions Bot Apr 24, 2026
mrain and others added 4 commits April 28, 2026 09:07
vid/Cargo.toml declared `[[bench]]` entries for both files but they were
not committed, breaking `cargo bench -p vid` and CI.

- avidm_gf2_ns.rs: namespaced disperse/verify/recover sweep across
  num_ns ∈ {1, 10, 50, 100} matching the avidm_gf2 flat parameters.
- avidm_gf2_breakdown.rs: per-phase timing of disperse/recover, hash
  parameterized so a single run reports BLAKE3 vs Keccak side-by-side.
Replace `jf_merkle_tree::hasher::HasherDigest` (which requires
`digest::Digest`) with a local `Blake3DigestAlgorithm` + `Blake3Node`
that call `blake3::hash` / `blake3::Hasher` directly. This unblocks
upgrading `blake3` past 1.5.x — newer releases bumped to `digest 0.11`
which is incompatible with `jf_merkle_tree`'s `digest 0.10` blanket
impl, forcing a workspace-wide pin on `~1.5` and dragging
`constant_time_eq` back to 0.3.x.

After this:

  - Workspace pin lifted from `blake3 = "~1.5"` to `blake3 = "1"`,
    resolving to 1.8.5 (constant_time_eq 0.4.2). Same blake3 algorithm,
    no on-wire change.
  - `vid` no longer enables `blake3/traits-preview`, an explicitly-
    unstable feature whose `digest` version is the source of the pin.
  - Wire format unchanged: 32-byte node values, leaf/internal domain
    separators `b"1"`/`b"0"`, identical to
    `HasherDigestAlgorithm + HasherNode<blake3::Hasher>`. Reference
    test `test_vid_common_v2_query_data` still matches the data added
    in 5cf2687 — confirmed by `cargo test -p espresso-types reference`.
  - `avidm_gf2_breakdown` bench abstracted over a small `MerkleBackend`
    trait so both Keccak (via `HasherDigest`) and BLAKE3 (via
    `Blake3DigestAlgorithm`) plug in without code duplication.
  - `vid::utils` is now `pub` so benches can reach the BLAKE3 helpers.

Addresses Cargo.lock blake3/constant_time_eq downgrade comments and
the `traits-preview` instability concern from the PR review.
Co-authored-by: Copilot <copilot@github.com>
@mrain
Copy link
Copy Markdown
Contributor Author

mrain commented Apr 28, 2026

VID — original vs current, ST + MT, 1/10/100 ns, local + AWS

Snapshot of the cumulative impact of this PR. Workload: arity 4 (current) vs arity 3 (original), 10 MB payload, total_weights = 1000, recovery_threshold = 340. Original = Keccak baseline; current = BLAKE3 + all landed optimizations (per-namespace par_iter, recover refactor with Vec<Option<&[u8]>>, batch proof generation via collect_leaves_with_proof, drain proofs into per-recipient groups, upstream jf_merkle_tree::collect_all_with_proof allocation refactor).

Local (Apple Silicon)

op mode 1 ns orig 1 ns curr 10 ns orig 10 ns curr 100 ns orig 100 ns curr
Disperse ST 64.1 ms 31.4 ms 77.7 ms 58.5 ms 193.7 ms 109.4 ms
Disperse MT 20.6 ms 15.6 ms 34.5 ms 13.8 ms 139.0 ms 35.3 ms
Verify ST 52.4 µs 20.9 µs 78.1 µs 66.5 µs 331.3 µs 241.4 µs
Verify MT 52.4 µs 20.7 µs 78.1 µs 51.5 µs 331.3 µs 118.3 µs
Recover ST 61.9 ms 42.1 ms 64.5 ms 41.1 ms 101.2 ms 85.5 ms
Recover MT 61.9 ms 39.9 ms 64.5 ms 10.1 ms 101.2 ms 11.3 ms

AWS Graviton (aarch64, 1 vCPU — MT ≈ ST)

op 1 ns orig 1 ns curr 10 ns orig 10 ns curr 100 ns orig 100 ns curr
Disperse 95.4 ms 51.5 ms 113.7 ms 80.6 ms 265.3 ms 191.3 ms
Verify 67.5 µs 25.6 µs 92.5 µs 68.4 µs 348.5 µs 286.1 µs
Recover 125.9 ms 106.8 ms 107.3 ms 79.7 ms 155.7 ms 106.9 ms

Notes

  • AWS box is 1 vCPU, so its "MT" is the same as ST — none of the per-namespace par_iter wins materialize there. The AWS column is the single-thread story end-to-end.
  • Local MT scaling is dramatic at high num_ns: dispersal 3.9×, recover 9.0×. The original code had no per-namespace parallelism in either path; landing par_iter over ns_table exposed all of it.
  • Verify ST regresses ~7 % at 10/100 nspar_iter().find_any() adds rayon-dispatch overhead under RAYON_NUM_THREADS=1. Small enough to land vs the 2.8× MT win.
  • Original "MT" numbers are from the workspace-default rayon pool before any per-namespace par_iter landed; rayon was already used internally by AvidmGf2Scheme::disperse for leaf hashing and share assembly, hence the non-trivial pre-PR MT speedups vs ST. The wins above are additional to that.

mrain and others added 2 commits April 28, 2026 16:05
Co-authored-by: Copilot <copilot@github.com>
@mrain
Copy link
Copy Markdown
Contributor Author

mrain commented Apr 30, 2026

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 30, 2026

Claude finished @mrain's task in 3m 25s —— View job


Review of PR #4223 — VID Performance Improvements

  • Read previous review comments and understand resolved items
  • Review vid/src/utils/blake3.rs (new file — BLAKE3 native API integration)
  • Review vid/src/avidm_gf2.rs (core optimizations)
  • Review vid/src/avidm_gf2/namespaced.rs (namespace parallelism)
  • Review vid/src/lib.rs and vid/src/utils.rs (module wiring)
  • Review vid/Cargo.toml and Cargo.toml (dependency changes)
  • Review benchmark files
  • Review reference test data updates
  • Post review feedback

Summary

The PR is in significantly better shape than the first review. The blake3 version pin issue is resolved (now at 1.8.5 with "1" semver range), and all the core optimizations are clean. The benchmarks show impressive gains — especially the 3.9x MT disperse and 9.0x MT recover at 100 namespaces.

Resolved from prior review

  • blake3 version pin: Now blake3 = "1" in workspace, resolving to 1.8.5 in the lockfile. No more traits-preview dependency or workspace-wide downgrade.
  • Missing benchmark files: avidm_gf2_ns.rs, avidm_gf2_breakdown.rs, and big_mt.rs are all present.
  • Serialization compatibility: Reference test data in data/v2/ has been updated with new BLAKE3-based commitments.

Items to address

1. jf-merkle-tree pinned to an unpublished branch

Cargo.toml:240 pins jf-merkle-tree to branch = "cl/mtiter" instead of a tagged release. This presumably provides the collect_leaves_with_proof() method used in avidm_gf2.rs:241. This should be tagged and released before merging to main — branch references can break if the branch is rebased or deleted.

2. Domain separator fragility in blake3.rs

vid/src/utils/blake3.rs:24-25 copies domain separators (b"1" / b"0") from jf_merkle_tree's pub(crate) constants. These are correct today and the hashing order matches, but if jellyfish ever changes them the VID will silently produce incompatible commitments. Consider adding a compile-time or test-time assertion that validates these against a tree built through the standard HasherDigestAlgorithm path — the existing deterministic_commitment test only checks self-consistency.

Minor observations

3. Arity change from 2 to 4 is not documented in the version table

The Merkle tree arity changed from 2 (via HasherMerkleTree) to 4 (explicit JfMerkleTree<..., 4, ...>). The PR description mentions "arity 4 Merkle tree" but the code type alias at avidm_gf2.rs:27 is the only documentation. Since this affects proof sizes and commitment formats for V2 (VID2), it would be good to note the arity in the protocol version table or a code comment near the type alias.

4. avidm_gf2_breakdown.rs uses let-else with if let chain

vid/benches/avidm_gf2_breakdown.rs:340-344:

if let Some(idx) = recovered.iter().rposition(|&b| b != 0)
    && recovered[idx] == 1u8
{
    recovered.truncate(idx);
}

This silently succeeds on malformed payloads (no pad byte found), unlike the production recover() which returns Err. Acceptable for a benchmark, but worth noting if this code is ever copied.

5. sha2 and sha3 are still in vid's dependency list

vid/Cargo.toml still lists sha2 and sha3 as dependencies. If no other code in the vid crate uses them (I see sha3 still used in the avidm_gf2_breakdown.rs benchmark for the Keccak comparison backend), then sha2 could potentially be removed. sha3 is needed for the benchmark's KeccakBackend.

What looks good

  • Blake3DigestAlgorithm + Blake3Node (utils/blake3.rs): Clean implementation. The HasherWriter adapter for CanonicalSerialize is the right approach. Tests cover determinism, proof round-trips, tamper detection, and serialization. The #[repr(transparent)] on Blake3Node and raw 32-byte CanonicalSerialize are correct.

  • chunk_and_pad (avidm_gf2.rs:128-158): Single-pass, no intermediate buffer. All edge cases handled correctly — boundary-aligned payloads, sub-chunk, and empty trailing chunks. The guard padded_len < payload.len() + 1 is correct.

  • disperse share assembly (avidm_gf2.rs:235-261): The shares.into_iter() + by_ref().take() pattern correctly avoids per-recipient heap clones. Using collect_leaves_with_proof() for batch proof generation (single tree traversal) instead of range.len() individual mt.lookup() calls is a significant improvement.

  • recover refactor (avidm_gf2.rs:297-376): Vec<Option<&[u8]>> instead of Vec<Option<Vec<u8>>> eliminates all per-shard clones. The lifetime management is correct — input_orig borrows from shares (the input parameter), and result.restored_original() borrows from the decoder result, both of which live through the extend_from_slice loop. The early capacity reservation Vec::with_capacity(original_count * shard_bytes) is also good.

  • Namespace-level parallelism (namespaced.rs): ns_disperse, verify_share_with_verified_common, and recover all correctly parallelize over independent namespaces with par_iter. The find_any short-circuit pattern is correct and efficient for verification.

  • verify_share par_iter + find_any (avidm_gf2.rs:277-294): Clean parallel verification with short-circuit on first failure. The match arms correctly propagate Ok(Err(())) (proof mismatch) vs Err(e) (internal error).

Verdict

The core optimizations are correct and well-tested. The main concern is the unpublished jf-merkle-tree branch dependency — that should be tagged before merge. The domain separator fragility is a minor long-term risk, not a blocker.


View job

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

Nextest failures (2) in this run

Test Attempts Time (s) Main history
hotshot-testing::test_epochs_combined_network::test_combined_network_reup_with_epochs::testtypes_::combinedimpl_::test_combined_network_reup_with_epochs 1 45.38 passing
hotshot-testing::tests_2::tests_2::catchup::test_all_restart_one_da::testtypes_::cliquenetimpl_::test_all_restart_one_da 1 11.28 passing

See the step summary for flaky tests and slowest tests.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Nextest failures (1) in this run

Test Attempts Time (s) Main history
espresso-node::api::test::test_aggregator_namespace_endpoints 3 29.56 flaky 5/5

See the step summary for flaky tests and slowest tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant