fix(soundness): bind ShardRam y-sign to is_global_write#1344
Conversation
There was a problem hiding this comment.
Pull request overview
Findings (sorted by severity)
-
Blocker |
ceno_zkvm/src/tables/shard_ram.rs(newassert_byte/lookup_ltu_byteconstraints): LK multiplicity aggregation/order looks inconsistent with new lookups.
This PR introduces new LK interactions inShardRamConfig::configure(assert_byte+lookup_ltu_byte). However, the global LK multiplicities used to assign the DynamicRange/LTU table circuits are finalized and those table circuits are assigned beforeShardRamCircuitis assigned in the shard pipeline (seeceno_zkvm/src/e2e.rs:1500-1589). As written, ShardRam’s new lookup usage does not appear to contribute tocombined_lk_mltprior to table-circuit assignment, which is expected to break the logup multiset check (or otherwise leave these lookups unaccounted).
Suggested fix: update the witness/LK aggregation flow so ShardRam’s byte/LTU lookups contribute to the global multiplicity beforeRv32imConfig::assign_table_circuitruns (e.g., collect a per-chip multiplicity for ShardRam and include it inZKVMWitnesses.lk_mltsprior tofinalize_lk_multiplicities, or reorder assignment so ShardRam is assigned before lookup-table circuits). -
Major |
ShardRamRecord::to_ec_point: half-of-field boundary is off by one vs the new convention.
is_y_in_2nd_halfcurrently usesy6 >= prime/2. For odd primes,prime/2 == (p-1)/2, so the boundary valuey6 == (p-1)/2is classified as “second half”, causing the new convention (“read =>[1,(p-1)/2]”) to be violated and potentially making otherwise-valid witnesses fail the new in-circuit banding.
Suggested fix: compare against(prime + 1)/2(or use a stricty6 > prime/2) to match the stated ranges. -
Major | BabyBear-only guard uses
debug_assert_eq!in circuit configuration.
The constraint relies on BabyBear’s(p-1)/2 = 60·2^24, butdebug_assert_eq!is compiled out in release. If instantiated over a different field, the circuit would silently become incorrect.
Suggested fix: enforce at runtime (e.g.,assert_eq!orreturn Err(CircuitBuilderError::CircuitError(..))). -
Minor | Comment accuracy in
to_ec_point.
The “2-torsion case where (x,y)==(x,-y)” phrasing is misleading:y6 == 0doesn’t imply the full y-coordinate is zero; it only means that limb is fixed under negation, which is what makes the chosen encoding ambiguous/unsatisfiable.
Suggested fix: reword the comment to reflect the actual reason for rejection. -
Minor (testing) | New test does not assert constraint/prover rejection.
test_shard_ram_y_sign_circuit_rejects_negationcurrently checks derived limb properties (b3 < 60vs>= 60) but doesn’t actually run a constraint satisfiability check / mock prover / proof attempt that must fail for the tampered row. This can pass even if the lookup constraint is missing or if LK-table population is broken.
Suggested fix: make it a true regression by asserting the tampered witness fails constraint satisfaction (e.g., viaMockProverwith the necessary public inputs / table chips, or by attempting proof generation and asserting it errors).
Open questions / assumptions
- Is ShardRam always assigned after lookup-table circuits in all proving entrypoints (CPU + GPU + recursion pipelines)? If yes, the LK multiplicity/order blocker needs a design-level fix (not just a local change).
- Is it acceptable to hard-fail (non-debug) when
BaseField != BabyBear, or is there a preferred feature-gate pattern for BabyBear-only chips in this repo?
Changes:
- Host-side
to_ec_pointnow rejectsy6 == 0and documents the read/write y-half convention. - Circuit-side: adds byte-decomposition + lookup constraints and a conditional equality binding
y6tois_global_write. - Adds a targeted unit test around the y-sign binding logic for honest vs sign-flipped points.
15c96f8 to
c666a53
Compare
c666a53 to
fe8cd5e
Compare
## Problem `ShardRamCircuit` differentiates a global *read* from a global *write* by writing one of (x, y) or (x, -y) into the witness. Before this fix nothing constrained which y was chosen, so an attacker could flip is_global_write and migrate a record between the read/write sets without changing anything else in the witness. The y-sign was the entire signal — a soundness break. ## Design Rationale Bind the sign of `y6 = y[SEPTIC_EXTENSION_DEGREE - 1]` to is_global_write via a half-of-field convention: - read (is_global_write = 0): y6 in [1, (p-1)/2] - write (is_global_write = 1): y6 in [(p+1)/2, p-1] For BabyBear `(p-1)/2 = 60 * 2^24` exactly, so a witnessed `y6_lo in [0, (p-1)/2)` decomposes into four bytes with top byte `b3 < 60`. Three U8 `assert_byte` queries plus one `lookup_ltu_byte(b3, 60, 1)` bound y6_lo, then a single `condition_require_equal` ties y6 to either `y6_lo + 1` (read) or `(p-1) - y6_lo` (write) under the is_global_write selector. y6 = 0 is the unique fixed point not covered by either branch; `to_ec_point` skips it so the prover doesn't generate an unprovable record. Mirror the partition on the prover side: `to_ec_point` uses `y6 > prime / 2` (strict; `(p-1)/2` belongs to the read region) to decide whether to negate the natural sqrt, and bumps the nonce when y6 = 0. ## Change Highlights ### `ceno_zkvm/src/tables/shard_ram.rs` — chip-level y-sign binding - `ShardRamConfig`: add `y6_lo_bytes: [WitIn; 4]`; in `configure` emit 3 x `assert_byte` + 1 x `lookup_ltu_byte(_, 60, 1)` and one `condition_require_equal` tying y6 to is_global_write under the is_global_write selector. - `to_ec_point`: skip the `y6 = 0` case; classify `y6 > prime / 2` (strict, so the boundary `(p-1)/2` stays read) to decide whether to negate the natural sqrt. - `assign_instance`: write the four `y6_lo` byte limbs via the new `y6_lo_value` helper. mlt is surfaced via the new `assign_instances_with_lk_multiplicities` entry below — no per-row push left dangling. ### Lookup-multiplicity plumbing for ShardRam ShardRam's per-row y6_lo byte / LTU lookups must reach `combined_lk_mlt` so the U8 / LTU table `mlt` columns balance. ShardRam runs after opcode + dummy circuits, before `finalize_lk_multiplicities`. To surface mlt without burdening every other table circuit: - `ceno_zkvm/src/tables/mod.rs`: `TableCircuit` trait gains a second default-unimplemented method `assign_instances_with_lk_multiplicities` alongside the existing `assign_instances`. ShardRam overrides the former; every other table keeps overriding the latter. - `ceno_zkvm/src/structs.rs`: `ZKVMWitnesses::assign_shared_circuit` threads a `LkMultiplicity::default()` through ShardRam's parallel-chunk witgen and inserts `lk_multiplicity.into_finalize_result()` into `lk_mlts["ShardRamCircuit"]` before finalize. Asserts swap from `combined_lk_mlt.is_some()` to `is_none()` to lock the ordering. `assign_table_circuit` tolerates `combined_lk_mlt = None` by passing an empty multiplicity slice, so LocalFinalCircuit (which ignores the argument anyway) can also run before finalize. - `ceno_zkvm/src/e2e.rs`: move `MmuConfig::assign_continuation_circuit` (LocalFinal + ShardRam) to just before `finalize_lk_multiplicities`. Mirror the move inside the GPU debug-compare block so `combined_lk_mlt` diff stays meaningful. - `ceno_zkvm/src/instructions/riscv/rv32im/mmu.rs`: docstring updated to describe the new ordering invariant. ### Device-resident GPU shortcut for ShardRam (mlt mirror) `ZKVMWitnesses::try_assign_shared_circuit_gpu` dispatches into `instructions::gpu::chips::shard_ram::try_gpu_assign_shared_circuit` to keep the continuation EC computation device-resident (`gpu_batch_continuation_ec_on_device` + `merge_and_partition_records`) when `is_gpu_witgen_enabled()`. The GPU kernels never enter the CPU `assign_instance` per-row push, so the y6_lo lookup multiplicity is derived host-side: - After step 6 of `try_gpu_assign_shared_circuit` (merge+partition), D2H `partitioned_buf` once to `Vec<u32>` and walk it with stride `record_u32s = 26` (`GpuShardRamRecord` `#[repr(C)]` layout). Per record extract `is_to_write_set` (u32 offset 10) and `point_y[6]` (u32 offset 25), compute `y6_lo`, push the same 4 lookup queries the CPU path emits per row, then `into_finalize_result()` and return alongside the chunked `Vec<ChipInput<E>>`. `debug_assert_eq!(record_u32s, 26)` guards against `ceno_gpu` layout drift. - `try_assign_shared_circuit_gpu` inserts both `ChipInput` and the derived multiplicity into `self.witnesses` / `self.lk_mlts["ShardRamCircuit"]` so finalize folds the GPU-path contribution into `combined_lk_mlt` the same way the CPU shortcut does. ### Verifier: account for `has_ecc_ops` row doubling `ShardRamCircuit::has_ecc_ops()` adds an extra hypercube variable; the chip matrix has `2 * next_pow2(num_instance)` rows where the back half is EC-tree internal nodes with `selector_zero = 0`. Before this fix the chip had `num_lks = 0`, so the verifier's `dummy_table_item_multiplicity` correction never had to consider it. With the new byte/LTU queries the correction under-counted dummy lookups by a factor of 2 and shard verification failed with `logup_sum != 0`. - `ceno_zkvm/src/scheme/verifier.rs`: multiply `next_pow2_instance` by 2 when `circuit_vk.get_cs().has_ecc_ops()`. - `ceno_recursion/src/zkvm_verifier/verifier.rs`: mirror the same adjustment in the recursive verifier (lockstep per CLAUDE.md). ### Tests - `tables::shard_ram::tests::test_shard_ram_y_sign_circuit_rejects_negation` drives `assign_instances_with_lk_multiplicities` + `MockProver`. The honest row satisfies every constraint; the tampered row (same record, negated EC point) trips `lookup_Ltu` on the wrong-sign b3. A concrete challenge is supplied so the no-challenge `run` path doesn't drop `structural_witin`. - `test_shard_ram_circuit` updated to call `assign_instances_with_lk_multiplicities`. ## Testing ``` cargo fmt --all --check cargo make clippy # -D warnings, dev profile cargo clippy --workspace --all-targets --release cargo test --workspace --lib --release cargo run --release -p ceno_zkvm --features sanity-check --bin e2e -- \ --platform=ceno --max-cycle-per-shard=20000 \ --hints=10 --public-io=4191 \ examples/target/riscv32im-ceno-zkvm-elf/release/examples/fibonacci ``` End-to-end fibonacci across 6 shards verifies `ShardRamCircuit` and `LocalRAMTableFinal` on every shard with `exit code 0. Success.` GPU shortcut (`--features gpu` + `CENO_GPU_ENABLE_WITGEN=1`) needs a CUDA host to verify at runtime; static structure mirrors the CPU shortcut and CPU path remains identical. ## Risks and Rollout - Soundness boundary moved: the chip now constrains the EC y-sign that was previously unconstrained. Mirrored on native and recursive verifiers; protocol/transcript order is unchanged so the two stay in lockstep. - The `has_ecc_ops` row-factor verifier fix only manifests once any `has_ecc_ops` chip has `num_lks > 0`. ShardRam is the only such chip today; lookup balance failures elsewhere would be unrelated. - GPU mlt offsets are read from `shard_ram_record_to_gpu` (offsets 10 and 25 in 26 u32s). `debug_assert_eq!(record_u32s, 26)` trips if `ceno_gpu` reshuffles `GpuShardRamRecord` so silent drift is caught. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6e52439 to
a4879ab
Compare
| _multiplicity: &[FxHashMap<u64, usize>], | ||
| _input: &Self::WitnessInput<'_>, | ||
| ) -> Result<RMMCollections<E::BaseField>, ZKVMError> { | ||
| unimplemented!("assign_instances is not implemented for this table circuit") | ||
| } |
| let witness = TC::assign_instances( | ||
| config, | ||
| cs.zkvm_v1_css.num_witin as usize, | ||
| cs.zkvm_v1_css.num_structural_witin as usize, | ||
| self.combined_lk_mlt.as_ref().unwrap(), | ||
| self.combined_lk_mlt.as_ref().unwrap_or(&empty_mlt), |
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| // `lookup_ltu_byte(a, b, 1)` asserts `a, b` are bytes and `a < b`. | ||
| cb.lookup_ltu_byte( | ||
| y6_lo_bytes[3].expr(), | ||
| E::BaseField::from_canonical_u64(60).expr(), |
There was a problem hiding this comment.
@copilot please make this 60 as a constant s.t. we can refers to it here and in witness generation.
There was a problem hiding this comment.
Done in d2c12a4. I introduced Y6_LO_TOP_BYTE_LT_BOUND and replaced the duplicated 60 in the circuit-side check plus both CPU/GPU witness-generation lookup multiplicity paths so they all reference the same constant.
Agent-Logs-Url: https://github.com/scroll-tech/ceno/sessions/401f145a-d351-4dd2-82b7-e882d683a012 Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Problem
Issue #1338 reproduces a soundness break on
master. For the same RISC-Vexecution, the base verifier and the recursion verifier both accept two
distinct proof batches whose public per-shard
shard_rw_sumvalues differon all 17 shards. The attacker takes an honest witness, replaces every
cross-shard EC accumulator leaf
(x, y)with its inverse(x, -y),updates
shard_rw_sum, and reproves.Root cause:
ceno_zkvm/src/tables/shard_ram.rs:276-281was a TODO. Thehost code in
ShardRamRecord::to_ec_pointencodes read vs write in thesign of
y[6], but the circuit only constrained the curve equation andthe EC sum — never tying
y[6]'s half-of-field tois_global_write.Both
(x, y)and(x, -y)satisfied every existing check, so the publicsummary of cross-shard RAM flow was unbound.
The defect survives recursion (the reporter's PoC verifies through the
recursion verifier program).
Design Rationale
Approach borrows the idea from SP1's
crates/core/machine/src/operations/global_interaction.rs:210-236,not its column layout. Three pieces:
y[6]in terms of a fresh witnessy6_loso
y[6] = 0is never valid in either branch (it is invariant under the negate operation, thus make it impossible to distinguish read and write).y6_loto[0, (p-1)/2). Forthe rare exception
y[6] = 0(probability~1/p ≈ 2^-31per record)the host rejects and retries with a new
nonce.y6_lodecomposed into four bytelimbs
b0..b3(assert_byteforb0..b2,lookup_ltu_byte(b3, 60, 1)for
b3). For BabyBear,(p-1)/2 = 60·2^24exactly, sob3 < 60gives the tightest no-overlap band.
In-circuit branch equality via
condition_require_equal:is_global_write = 0):y[6] = y6_lo + 1⇒y[6] ∈ [1, (p-1)/2]is_global_write = 1):y[6] = p - 1 - y6_lo⇒y[6] ∈ [(p+1)/2, p-1]Union covers
[1, p-1]with no overlap;y[6] = 0is excluded.Why not a single
AssertLtConfig(y6_lo, (p-1)/2, max_bits=30)?On BabyBear (
p = 0x78000001, 31-bit) the AssertLt gadget onlyconstrains
lhs - rhs ≡ diff - 2^max_bits (mod p)withdiff ∈ [0, 2^30)— it does not pre-bound
lhsto be canonical-small. A maliciousy6_lo ∈ [0x74000001, p-1](≈ 2^26 values) produces a field-wrap diffthat still fits in 30 bits, so the constraint accepts upper-half values
and the exploit survives. Byte-decomposing first kills the wrap. Ceno's
DynamicRangeTableCircuit<E, 18>also does not carry 30-bit lookupentries, so a direct
assert_const_range(_, 30)is not availableanyway.
Why M = 60 (vs SP1's 63). SP1 targets KoalaBear; its
(p-1)/2 = 0x3f800000, so 63 leaves a small safety band. For BabyBear,(p-1)/2 = 60·2^24exactly — 63 would lety[6]straddlep/2andreintroduce the ambiguity.
Also corrects the stale comment that previously had the convention
reversed (claimed write ⇒ lower half, opposite of what the host code
does).
Change Highlights
ceno_zkvm/src/tables/shard_ram.rs— chip-level y-sign bindingShardRamRecord::to_ec_point: rejecty6 == 0and try the nextnonce. Classify with stricty6 > prime / 2so the boundary(p-1)/2correctly stays in the read region (a previous draft used>=which misclassified that single boundary value and would haveproduced an out-of-range
y6_lofor both branches).ShardRamConfig: new fieldy6_lo_bytes: [WitIn; 4].ShardRamConfig::configure: replace the TODO with the bytedecomposition, byte-range / LTU lookups, and the
condition_require_equalbranch equality.ShardRamCircuit::assign_instance: computey6_lofromy[6]andis_to_write_setvia a smally6_lo_valuehelper, assign bytelimbs, register byte and LTU multiplicities.
test_shard_ram_y_sign_circuit_rejects_negationdrivesassign_instances_with_lk_multiplicities+MockProverover onehonest row and one sign-flipped row, asserting
lookup_Lturejectsthe tampered witness. A concrete challenge is supplied so the
no-challenge
runpath doesn't dropstructural_witin.Lookup-multiplicity plumbing for ShardRam
ShardRam's per-row y6_lo byte / LTU lookups must reach
combined_lk_mltso the U8 / LTU tablemltcolumns balance.ShardRam runs after opcode + dummy circuits, before
finalize_lk_multiplicities. To surface mlt without burdening everyother table circuit:
ceno_zkvm/src/tables/mod.rs:TableCircuittrait gains a seconddefault-unimplemented method
assign_instances_with_lk_multiplicitiesalongside the existingassign_instances. ShardRam overrides the former; every othertable keeps overriding the latter.
ceno_zkvm/src/structs.rs:ZKVMWitnesses::assign_shared_circuitthreads a
LkMultiplicity::default()through ShardRam'sparallel-chunk witgen and inserts
lk_multiplicity.into_finalize_result()intolk_mlts["ShardRamCircuit"]before finalize. Asserts swap fromcombined_lk_mlt.is_some()tois_none()to lock the ordering.assign_table_circuittoleratescombined_lk_mlt = Nonebypassing an empty multiplicity slice, so
LocalFinalCircuit(whichignores the argument anyway) can also run before finalize.
ceno_zkvm/src/e2e.rs: moveMmuConfig::assign_continuation_circuit(LocalFinal + ShardRam) tojust before
finalize_lk_multiplicities. Mirror the move insidethe GPU debug-compare block so
combined_lk_mltdiff staysmeaningful.
ceno_zkvm/src/instructions/riscv/rv32im/mmu.rs: docstring updatedto describe the new ordering invariant.
Device-resident GPU shortcut for ShardRam (mlt mirror)
ZKVMWitnesses::try_assign_shared_circuit_gpudispatches intoinstructions::gpu::chips::shard_ram::try_gpu_assign_shared_circuitto keep the continuation EC computation device-resident
(
gpu_batch_continuation_ec_on_device+merge_and_partition_records)when
is_gpu_witgen_enabled(). The GPU kernels never enter the CPUassign_instanceper-row push, so the y6_lo lookup multiplicity isderived host-side:
try_gpu_assign_shared_circuit(merge+partition),D2H
partitioned_bufonce toVec<u32>and walk it with striderecord_u32s = 26(GpuShardRamRecord#[repr(C)]layout).Per record extract
is_to_write_set(u32 offset 10) andpoint_y[6](u32 offset 25), computey6_lo, push the same4 lookup queries the CPU path emits per row, then
into_finalize_result()and return alongside the chunkedVec<ChipInput<E>>.debug_assert_eq!(record_u32s, 26)guardsagainst
ceno_gpulayout drift.try_assign_shared_circuit_gpuinserts bothChipInputand thederived multiplicity into
self.witnesses/self.lk_mlts["ShardRamCircuit"]so finalize folds the GPU-pathcontribution into
combined_lk_mltthe same way the CPU shortcutdoes.
Verifier: account for
has_ecc_opsrow doublingShardRamCircuit::has_ecc_ops()adds an extra hypercube variable;the chip matrix has
2 * next_pow2(num_instance)rows where theback half is EC-tree internal nodes with
selector_zero = 0. Beforethis fix the chip had
num_lks = 0, so the verifier'sdummy_table_item_multiplicitycorrection never had to consider it.With the new byte/LTU queries the correction under-counted dummy
lookups by a factor of 2 and shard verification failed with
logup_sum != 0.ceno_zkvm/src/scheme/verifier.rs: multiplynext_pow2_instanceby 2 when
circuit_vk.get_cs().has_ecc_ops().ceno_recursion/src/zkvm_verifier/verifier.rs: mirror the sameadjustment in the recursive verifier (lockstep per CLAUDE.md).
Benchmark / Performance Impact
Per ShardRam row this PR adds 4 byte WitIn columns plus 3 byte-range
and 1 LTU lookup multiplicities. ShardRam rows scale with cross-shard
RAM events, not with cycles, so the absolute cost is sub-percent on the
prover. No full prover bench was rerun (no hot-loop arithmetic changed).
Existing
test_shard_ram_circuit(170k reads + 1420 writes, full chipproof) runtime is unchanged within noise:
Testing
cargo fmt --all --check cargo check --workspace --all-targets cargo check --workspace --all-targets --release cargo make clippy cargo clippy --workspace --all-targets --release -- -D warnings RUST_MIN_STACK=33554432 cargo test --workspace --lib --release cargo run --release --package ceno_zkvm --features sanity-check --bin e2e -- \ --platform=ceno --max-cycle-per-shard=20000 --hints=10 --public-io=4191 \ examples/target/riscv32im-ceno-zkvm-elf/release/examples/fibonacciAll pass locally on BabyBear.
test_shard_ram_circuitandtest_shard_ram_y_sign_circuit_rejects_negationare green. End-to-endmulti-shard fibonacci verifies
ShardRamCircuitandLocalRAMTableFinalon every shard with
exit code 0. Success.cargo make tests/cargo make tests_goldilockshould be re-run byCI; the change is gated to BabyBear via a
debug_assert_eq!onMODULUS_U64and goldilocks does not exercise shard_ram (perintegration.ymlcommented-out lines and CLAUDE.md).Risks and Rollout
shard_rw_sum#1338. The new constraint only adds local bytearithmetic and existing lookups — no change to transcript, sumcheck,
PCS, or EC accumulation. Recursive and native verifiers move in
lockstep (the
has_ecc_opsrow-factor fix lands in both).lookup multiplicity host-side from the merged partitioned device
buffer (single D2H of ~26 u32 × records). Layout assumption is
guarded by
debug_assert_eq!(record_u32s, 26)againstceno_gpu::GpuShardRamRecord. CPU + GPU paths converge on the samecombined_lk_mltcontribution; runtime verification withCENO_GPU_ENABLE_WITGEN=1 --features gpuon a CUDA host isrecommended before tag.
has_ecc_ops× 2 row adjustment; no separate constraint-systemchange is needed for the y-sign binding itself.
M = 60. Adebug_assert_eq!(MODULUS_U64, 0x78000001, ...)guards againstaccidental use on a different field; shard_ram is BabyBear-only
today per CLAUDE.md.
Follow-ups
shard_ram.rs#1340 TODOs (local read ⇄ global writepairing onshard_ram.rs:235-236,shard == shard_idbinding on line 244) areintentionally out of scope here.
Fixes #1338.
Partially addresses #1340.
Copilot Reviewer Directive (keep this section)
When Copilot reviews this PR, apply
.github/copilot-instructions.mdstrictly.