Prevent activation_checkpoint_cpu_offload from silently no-op'ing on transformers >=5.0 by xylian86 · Pull Request #373 · snowflakedb/ArcticTraining

xylian86 · 2026-05-11T02:20:03Z

What

When activation_checkpoint_cpu_offload is enabled, force HF's gradient_checkpointing_enable to use use_reentrant=True.

Why

The CPU-offload activation checkpointing in arctic_training.monkey_patches replaces torch.utils.checkpoint.CheckpointFunction, which is only used by the reentrant code path of torch.utils.checkpoint.checkpoint. The non-reentrant path goes through _checkpoint_without_reentrant_generator + saved_tensors_hooks and never touches CheckpointFunction, so the monkey patch becomes a silent no-op there.

In current transformers (4.x), model.gradient_checkpointing_enable() with no kwargs defaults to use_reentrant=True, so this happens to work. Starting with transformers v5.0.0 ([huggingface/transformers#43203], merged Jan 2026), the default flips to use_reentrant=False. Once we bump the upper pin past <5.0.0, users setting activation_checkpoint_cpu_offload: true would get no offloading at all — long-sequence runs would OOM with no error or warning to indicate why.

Note: tactical fix. The proper long-term solution is to replace the CheckpointFunction monkey patch with a saved_tensors_hooks-based implementation (see torchtune reference in monkey_patches.py), which would work regardless of use_reentrant. Out of scope here.

…trant gradient checkpointing

Fix activation_checkpoint_cpu_offload silently failing under non-reen…

b7b7bc0

…trant gradient checkpointing

xylian86 requested a review from sfc-gh-jrasley as a code owner May 11, 2026 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent activation_checkpoint_cpu_offload from silently no-op'ing on transformers >=5.0#373

Prevent activation_checkpoint_cpu_offload from silently no-op'ing on transformers >=5.0#373
xylian86 wants to merge 1 commit into
snowflakedb:mainfrom
xylian86:xlian/fix-cpu-offload-reentrant

xylian86 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xylian86 commented May 11, 2026

What

Why

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants