Skip to content

[SPARK-38101][CORE] Retry INTERNAL_ERROR_BROADCAST when fetching map statuses#55390

Open
EnricoMi wants to merge 1 commit intoapache:masterfrom
G-Research-Forks:retry-internal-error-broadcast
Open

[SPARK-38101][CORE] Retry INTERNAL_ERROR_BROADCAST when fetching map statuses#55390
EnricoMi wants to merge 1 commit intoapache:masterfrom
G-Research-Forks:retry-internal-error-broadcast

Conversation

@EnricoMi
Copy link
Copy Markdown
Contributor

@EnricoMi EnricoMi commented Apr 17, 2026

What changes were proposed in this pull request?

Retry INTERNAL_ERROR_BROADCAST immediately when fetching the broadcast variable fails. Retrying will create a new broadcast variable on the driver. This should eventually succeed.

Fixes #54723. Supersedes #54987.

Why are the changes needed?

Executors may fail when fetching map statuses while map status are modified:

Unable to deserialize broadcasted map statuses for shuffle 1: java.io.IOException: org.apache.spark.SparkException:
[INTERNAL_ERROR_BROADCAST] Failed to get broadcast_3_piece0 of broadcast_3 SQLSTATE: XX000

The issue occurs when

  1. an executor fetches map statuses via getStatuses (to get to know where to read shuffle data from)
  2. those map statuses are too large so the driver wraps them into a broadcast variable
  3. the broadcast variable gets deserialized by the executor (which works) and the value (the map status) is fetched from the driver
  4. in the meantime, on the driver an updateMapOutput occurred, which invalidates the cached broadcast variable and deletes the broadcast variable value
  5. the executor cannot fetch the broadcast variable value and throws an exception

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SPARK-38101] execuors fail fetching map statuses with INTERNAL_ERROR_BROADCAST

1 participant