Skip to content

Prevent activation_checkpoint_cpu_offload from silently no-op'ing on transformers >=5.0#373

Open
xylian86 wants to merge 1 commit into
snowflakedb:mainfrom
xylian86:xlian/fix-cpu-offload-reentrant
Open

Prevent activation_checkpoint_cpu_offload from silently no-op'ing on transformers >=5.0#373
xylian86 wants to merge 1 commit into
snowflakedb:mainfrom
xylian86:xlian/fix-cpu-offload-reentrant

Conversation

@xylian86
Copy link
Copy Markdown

What

When activation_checkpoint_cpu_offload is enabled, force HF's gradient_checkpointing_enable to use use_reentrant=True.

Why

The CPU-offload activation checkpointing in arctic_training.monkey_patches replaces torch.utils.checkpoint.CheckpointFunction, which is only used by the reentrant code path of torch.utils.checkpoint.checkpoint. The non-reentrant path goes through _checkpoint_without_reentrant_generator + saved_tensors_hooks and never touches CheckpointFunction, so the monkey patch becomes a silent no-op there.

In current transformers (4.x), model.gradient_checkpointing_enable() with no kwargs defaults to use_reentrant=True, so this happens to work. Starting with transformers v5.0.0 ([huggingface/transformers#43203], merged Jan 2026), the default flips to use_reentrant=False. Once we bump the upper pin past <5.0.0, users setting activation_checkpoint_cpu_offload: true would get no offloading at all — long-sequence runs would OOM with no error or warning to indicate why.

Note: tactical fix. The proper long-term solution is to replace the CheckpointFunction monkey patch with a saved_tensors_hooks-based implementation (see torchtune reference in monkey_patches.py), which would work regardless of use_reentrant. Out of scope here.

@xylian86 xylian86 requested a review from sfc-gh-jrasley as a code owner May 11, 2026 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants