Skip to content

docs: adopt Arrow-native nomenclature across user and contributor guides#4428

Draft
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:docs-nomenclature-4419
Draft

docs: adopt Arrow-native nomenclature across user and contributor guides#4428
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:docs-nomenclature-4419

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Part of #4419 (the documentation-only phase).

Rationale for this change

Comet's documentation conflated several distinct ideas under the word "native": the implementation language (Rust vs JVM), pipeline membership (handled by Comet vs falls back to Spark), and the data format (Arrow columnar vs Spark rows). The same kind of ambiguity appears in the shuffle naming, where both implementations are columnar and both use Arrow IPC but only one operator name says "Columnar." Issue #4419 lays out a clearer vocabulary so the docs stop overloading "native" and can scale to a roadmap where some JVM code paths (today Scala UDF codegen; soon Arrow UDFs and hybrid impls) also live inside the Comet pipeline.

What changes are included in this PR?

Documentation prose only. No code changes, no operator renames, no plan-stability golden updates, and no new style-guide page (that comes later). The vocabulary applied here matches the rules in #4419:

  • Arrow-native is now the term for the data-format property that unifies the pipeline (operators, expressions, shuffle, and broadcast all consume and produce Arrow batches).
  • Comet pipeline replaces "the native Comet path" / "on the native Comet path" / "accelerated by Comet" for membership.
  • Rust-implemented / native Rust / Rust code is used for the implementation-language axis.
  • Compound forms that fix their meaning are kept: native shuffle, native scan (paired with CometBatchScan), and Arrow-native.
  • Bare "native execution" / "runs natively" / "the native path" as vague adjectives are removed and replaced with the specific axis they referred to.

The biggest single rewrite is in docs/source/user-guide/latest/understanding-comet-plans.md, where the "three kinds of nodes" framing becomes four:

Category Example
Arrow-native Rust operators CometProject, CometHashAggregate, CometSort
Arrow-native JVM expressions Scala UDF codegen (today); Arrow UDFs and hybrid impls (future)
Arrow-native JVM plumbing CometUnion, CometCoalesce, CometBroadcastExchange
Spark fallback Project, HashAggregate, plain Exchange

Other targeted files: docs/source/index.md (value prop refreshed to lead with the Arrow-native framing), docs/source/user-guide/latest/scala_java_udfs.md (the "native Comet path" wording the issue specifically calls out), docs/source/contributor-guide/plugin_overview.md, docs/source/contributor-guide/native_shuffle.md, docs/source/contributor-guide/jvm_shuffle.md, and docs/source/about/gluten_comparison.md (now notes the JVM-on-Arrow path as a Comet differentiator). A sweep then cleans up the remaining bare-"native-execution" cases across the contributor guide and user guide.

Operator renames (CometExchangeCometNativeShuffleExchange, etc.) are explicitly out of scope here. They land in a follow-on PR with deprecation aliases and plan-stability goldens, as proposed in #4419's migration plan.

How are these changes tested?

Documentation only. Verified locally that sphinx-build produces no new warnings, that the rewritten plan-node section of understanding-comet-plans.md renders with the four new subsection headings (Arrow-Native Rust Operators / Arrow-Native JVM Plumbing / Arrow-Native JVM Expressions / Shuffle Operators / Columnar/Row Transitions), and that a final grep across docs/source/ for banned phrases (runs natively, native execution, the native path, etc.) returns zero hits outside the historical changelog files.

andygrove added 3 commits May 25, 2026 11:11
Comet's documentation conflated several distinct ideas under the word
'native': implementation language (Rust vs JVM), pipeline membership
(handled by Comet vs falls back to Spark), and data format (Arrow
columnar vs Spark rows). The same problem appears in the shuffle
naming, where both implementations are columnar and both use Arrow IPC
but only one operator name says 'Columnar'.

This change updates the documentation to use the vocabulary spelled
out in apache#4419:

- 'Arrow-native' for the data format property that unifies the
  pipeline.
- 'Comet pipeline' for membership, replacing 'native Comet path' / 'on
  the native Comet path' / 'accelerated by Comet'.
- 'Rust-implemented' / 'native Rust' / 'Rust code' for the
  implementation language axis.
- Compound forms 'native shuffle', 'native scan' (paired with
  CometBatchScan), and 'Arrow-native' stay; bare 'native execution',
  'runs natively', 'the native path' as vague adjectives are
  replaced.
- The 'three kinds of nodes' framing in understanding-comet-plans.md
  becomes four: Arrow-native Rust operators, Arrow-native JVM
  expressions, Arrow-native JVM plumbing, and Spark fallback.

This is the documentation-only phase. Operator renames
(CometExchange -> CometNativeShuffleExchange and friends) are tracked
separately and will land with deprecation aliases and plan-stability
golden updates.

Part of apache#4419
@andygrove andygrove marked this pull request as draft May 25, 2026 17:29
…ing to README

Several spots in the previous nomenclature pass replaced 'native execution' /
'native side' / 'native block' with 'Rust-side ...' even where the original
compound was already clear from surrounding context (e.g. the very next
sentence said 'calls into the native engine', or the section was already
titled 'Native -> JVM Data Flow'). Revert those to the original wording.
The genuinely improved replacements (Comet pipeline framing, the
four-category plan-node taxonomy, the gluten_comparison value prop, the
shuffle prose) stay.

Also pull the Arrow-native framing into the top-level README so the value
prop matches docs/source/index.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant