Skip to content

Revisit the case for native columnar-to-row? #4440

@mbutrovich

Description

@mbutrovich

Reading through PR #3221, PR #3266, PR #3649, and epic #3268, I want to surface some questions about whether the rationale for CometNativeColumnarToRowExec still holds. I might be missing context, so this is a question rather than a proposal.

Background

PR #3221 (landing) was upfront that there was no significant throughput win:

Although this native implementation does not provide significant speed improvements, the main benefit is reduced GC pressure by avoiding Java allocations.

The benchmark table in that PR showed native winning 1 of 8 scenarios, losing 3, tying 4.

PR #3649 (enable by default) retracted its ~10% TPC-H claim as misattributed to native scan, leaving "no regression + GC pressure" as the standing rationale.

I could not find a GC measurement in any of the PR threads, in #3268, in the operator's SQLMetrics, or in spark/benchmarks/. The GC pressure benefit appears to be theoretical, not measured.

A quick re-run on main (Spark 4.1, 1M rows, fixedWidthOnly)

Spark (ColumnarToRowExec)                     68 ms
Comet JVM (CometColumnarToRowExec)            65 ms
Comet Native (CometNativeColumnarToRowExec)   64 ms

All within ~5%, roughly consistent with the numbers in PR #3221.

A JFR pass at allocations

Running the same benchmark with -XX:StartFlightRecording=settings=profile, I expected to see per-row allocations from unsafeRow.copy() in NativeColumnarToRowConverter.scala:142. Out of 2,809 byte[] allocation samples, zero appeared in any C2R operator's stack. All came from the parquet scan and broadcast/serializer paths shared across implementations.

The most likely explanation is that under the .noop() sink the benchmark uses, HotSpot escape analysis elides Native's per-row .copy() (and any equivalent allocations on the JVM path). If that is right, this benchmark setup cannot validate the GC pressure claim either way, since the rows never escape the iteration.

Questions at address

  1. Is there a measurement of native C2R's GC benefit that I am overlooking? A pointer to a concrete workload with JFR or jdk.GarbageCollection evidence would help me understand the current value.

  2. Would adding a retain-mode benchmark variant to CometColumnarToRowBenchmark (e.g., collect, hash-agg, broadcast build, anything that defeats EA) be worth it, so the GC claim becomes testable?

  3. Structurally, NativeRowIterator.next calls unsafeRow.copy() per row, which allocates byte[sizeInBytes] + UnsafeRow per row. The JVM path allocates the same shape of garbage when the consumer retains. Is there a retention pattern where the two allocation profiles differ in a way that matters?

  4. If throughput parity and unmeasured GC turn out to be confirmed, would it be worth discussing whether the ~3,000 LOC of Rust + JNI + glue is paying for itself, and whether the operator should stay enabled by default? Not proposing removal, just asking what the cost/benefit looks like.

I was a reviewer on PR #3221, so this is partly a question I am putting back to myself: I do not think I challenged the GC framing at the time, and I want to make sure it actually holds up. Apologies if any of this has already been discussed somewhere I missed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:lowMinor issues, test failures, tooling, cosmetic

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions