Revisit the case for native columnar-to-row?

Reading through PR #3221, PR #3266, PR #3649, and epic #3268, I want to surface some questions about whether the rationale for `CometNativeColumnarToRowExec` still holds. I might be missing context, so this is a question rather than a proposal.

### Background

PR #3221 (landing) was upfront that there was no significant throughput win:

> Although this native implementation does not provide significant speed improvements, the main benefit is reduced GC pressure by avoiding Java allocations.

The benchmark table in that PR showed native winning 1 of 8 scenarios, losing 3, tying 4.

PR #3649 (enable by default) retracted its ~10% TPC-H claim as misattributed to native scan, leaving "no regression + GC pressure" as the standing rationale.

I could not find a GC measurement in any of the PR threads, in #3268, in the operator's SQLMetrics, or in `spark/benchmarks/`. The GC pressure benefit appears to be theoretical, not measured.

### A quick re-run on main (Spark 4.1, 1M rows, fixedWidthOnly)

```
Spark (ColumnarToRowExec)                     68 ms
Comet JVM (CometColumnarToRowExec)            65 ms
Comet Native (CometNativeColumnarToRowExec)   64 ms
```

All within ~5%, roughly consistent with the numbers in PR #3221.

### A JFR pass at allocations

Running the same benchmark with `-XX:StartFlightRecording=settings=profile`, I expected to see per-row allocations from `unsafeRow.copy()` in `NativeColumnarToRowConverter.scala:142`. Out of 2,809 `byte[]` allocation samples, zero appeared in any C2R operator's stack. All came from the parquet scan and broadcast/serializer paths shared across implementations.

The most likely explanation is that under the `.noop()` sink the benchmark uses, HotSpot escape analysis elides Native's per-row `.copy()` (and any equivalent allocations on the JVM path). If that is right, this benchmark setup cannot validate the GC pressure claim either way, since the rows never escape the iteration.

### Questions at address

1. Is there a measurement of native C2R's GC benefit that I am overlooking? A pointer to a concrete workload with JFR or `jdk.GarbageCollection` evidence would help me understand the current value.

2. Would adding a retain-mode benchmark variant to `CometColumnarToRowBenchmark` (e.g., `collect`, hash-agg, broadcast build, anything that defeats EA) be worth it, so the GC claim becomes testable?

3. Structurally, `NativeRowIterator.next` calls `unsafeRow.copy()` per row, which allocates `byte[sizeInBytes]` + `UnsafeRow` per row. The JVM path allocates the same shape of garbage when the consumer retains. Is there a retention pattern where the two allocation profiles differ in a way that matters?

4. If throughput parity and unmeasured GC turn out to be confirmed, would it be worth discussing whether the ~3,000 LOC of Rust + JNI + glue is paying for itself, and whether the operator should stay enabled by default? Not proposing removal, just asking what the cost/benefit looks like.

I was a reviewer on PR #3221, so this is partly a question I am putting back to myself: I do not think I challenged the GC framing at the time, and I want to make sure it actually holds up. Apologies if any of this has already been discussed somewhere I missed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit the case for native columnar-to-row? #4440

Background

A quick re-run on main (Spark 4.1, 1M rows, fixedWidthOnly)

A JFR pass at allocations

Questions at address

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Revisit the case for native columnar-to-row? #4440

Description

Background

A quick re-run on main (Spark 4.1, 1M rows, fixedWidthOnly)

A JFR pass at allocations

Questions at address

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions