docs: adopt Arrow-native nomenclature across user and contributor guides#4428
Draft
andygrove wants to merge 4 commits into
Draft
docs: adopt Arrow-native nomenclature across user and contributor guides#4428andygrove wants to merge 4 commits into
andygrove wants to merge 4 commits into
Conversation
Comet's documentation conflated several distinct ideas under the word 'native': implementation language (Rust vs JVM), pipeline membership (handled by Comet vs falls back to Spark), and data format (Arrow columnar vs Spark rows). The same problem appears in the shuffle naming, where both implementations are columnar and both use Arrow IPC but only one operator name says 'Columnar'. This change updates the documentation to use the vocabulary spelled out in apache#4419: - 'Arrow-native' for the data format property that unifies the pipeline. - 'Comet pipeline' for membership, replacing 'native Comet path' / 'on the native Comet path' / 'accelerated by Comet'. - 'Rust-implemented' / 'native Rust' / 'Rust code' for the implementation language axis. - Compound forms 'native shuffle', 'native scan' (paired with CometBatchScan), and 'Arrow-native' stay; bare 'native execution', 'runs natively', 'the native path' as vague adjectives are replaced. - The 'three kinds of nodes' framing in understanding-comet-plans.md becomes four: Arrow-native Rust operators, Arrow-native JVM expressions, Arrow-native JVM plumbing, and Spark fallback. This is the documentation-only phase. Operator renames (CometExchange -> CometNativeShuffleExchange and friends) are tracked separately and will land with deprecation aliases and plan-stability golden updates. Part of apache#4419
…ing to README Several spots in the previous nomenclature pass replaced 'native execution' / 'native side' / 'native block' with 'Rust-side ...' even where the original compound was already clear from surrounding context (e.g. the very next sentence said 'calls into the native engine', or the section was already titled 'Native -> JVM Data Flow'). Revert those to the original wording. The genuinely improved replacements (Comet pipeline framing, the four-category plan-node taxonomy, the gluten_comparison value prop, the shuffle prose) stay. Also pull the Arrow-native framing into the top-level README so the value prop matches docs/source/index.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #4419 (the documentation-only phase).
Rationale for this change
Comet's documentation conflated several distinct ideas under the word "native": the implementation language (Rust vs JVM), pipeline membership (handled by Comet vs falls back to Spark), and the data format (Arrow columnar vs Spark rows). The same kind of ambiguity appears in the shuffle naming, where both implementations are columnar and both use Arrow IPC but only one operator name says "Columnar." Issue #4419 lays out a clearer vocabulary so the docs stop overloading "native" and can scale to a roadmap where some JVM code paths (today Scala UDF codegen; soon Arrow UDFs and hybrid impls) also live inside the Comet pipeline.
What changes are included in this PR?
Documentation prose only. No code changes, no operator renames, no plan-stability golden updates, and no new style-guide page (that comes later). The vocabulary applied here matches the rules in #4419:
native shuffle,native scan(paired withCometBatchScan), andArrow-native.The biggest single rewrite is in
docs/source/user-guide/latest/understanding-comet-plans.md, where the "three kinds of nodes" framing becomes four:CometProject,CometHashAggregate,CometSortCometUnion,CometCoalesce,CometBroadcastExchangeProject,HashAggregate, plainExchangeOther targeted files:
docs/source/index.md(value prop refreshed to lead with the Arrow-native framing),docs/source/user-guide/latest/scala_java_udfs.md(the "native Comet path" wording the issue specifically calls out),docs/source/contributor-guide/plugin_overview.md,docs/source/contributor-guide/native_shuffle.md,docs/source/contributor-guide/jvm_shuffle.md, anddocs/source/about/gluten_comparison.md(now notes the JVM-on-Arrow path as a Comet differentiator). A sweep then cleans up the remaining bare-"native-execution" cases across the contributor guide and user guide.Operator renames (
CometExchange→CometNativeShuffleExchange, etc.) are explicitly out of scope here. They land in a follow-on PR with deprecation aliases and plan-stability goldens, as proposed in #4419's migration plan.How are these changes tested?
Documentation only. Verified locally that
sphinx-buildproduces no new warnings, that the rewritten plan-node section ofunderstanding-comet-plans.mdrenders with the four new subsection headings (Arrow-Native Rust Operators / Arrow-Native JVM Plumbing / Arrow-Native JVM Expressions / Shuffle Operators / Columnar/Row Transitions), and that a final grep acrossdocs/source/for banned phrases (runs natively,native execution,the native path, etc.) returns zero hits outside the historical changelog files.