Skip to content

Fix SmolVLM video processor resize using wrong interpolation after backend refactor#45258

Merged
ydshieh merged 2 commits intomainfrom
fix_smolvlm
Apr 6, 2026
Merged

Fix SmolVLM video processor resize using wrong interpolation after backend refactor#45258
ydshieh merged 2 commits intomainfrom
fix_smolvlm

Conversation

@ydshieh
Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh commented Apr 6, 2026

Summary

PR #43514 refactored _preprocess to pass resample=resample to resize, but the resize method in SmolVLMVideoProcessor still had interpolation as its parameter name. The resample kwarg was silently swallowed by **kwargs, causing interpolation to always default to BILINEAR instead of the intended LANCZOS→BICUBIC path.

Without this fix, SmolVLMForConditionalGenerationIntegrationTest::test_integration_test_video has inputs["pixel_values"] with a larger difference (~0.36) before and after #43514, which also changes the model output values.

Fix

  • Rename the interpolation parameter to resample in SmolVLMVideoProcessor.resize
  • Convert PIL resample integers to torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the pattern used in TorchvisionBackend.resize in image_processing_backends.py

Before #43514 (correct path)

resize(interpolation=InterpolationMode.LANCZOS)interpolation == tvF.InterpolationMode.LANCZOSBICUBIC fallback

After #43514 without this fix (broken path)

resize(resample=1)resample swallowed by **kwargs, interpolation=None → defaults to BILINEAR

After #43514 with this fix (correct path restored)

resize(resample=1)pil_torch_interpolation_mapping[1]InterpolationMode.LANCZOSBICUBIC fallback

…age processor backend refactor

The PR #43514 refactored _preprocess to pass resample=resample to resize,
but resize still accepted interpolation as its parameter. The resample kwarg
was silently swallowed by **kwargs, causing interpolation to default to BILINEAR
instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values.

Fix by renaming the parameter to resample and converting PIL resample integers to
torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the
pattern used in TorchvisionBackend.resize.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: smolvlm

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ydshieh
Copy link
Copy Markdown
Collaborator Author

ydshieh commented Apr 6, 2026

run-slow: smolvlm

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/smolvlm"]
quantizations: []

@ydshieh ydshieh changed the title Fix SmolVLM video processor resize using wrong interpolation after backend refactor Fix SmolVLM video processor resize using wrong interpolation after backend refactor Apr 6, 2026
(None, None): 'User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.\n\nFrame from 00:00:\nFrame from 00:01:\nFrame from 00:02:\nFrame from 00:03:\nFrame from 00:04:\nFrame from 00:05:\nFrame from 00:06:\nFrame from 00:08:\nFrame from 00:09:\n\nDescribe this video in detail\nAssistant: The video depicts a large language model architecture, specifically a language model with a "quick brown" feature',
("cuda", (8, 0)): 'User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.\n\nFrame from 00:00:\nFrame from 00:01:\nFrame from 00:02:\nFrame from 00:03:\nFrame from 00:04:\nFrame from 00:05:\nFrame from 00:06:\nFrame from 00:08:\nFrame from 00:09:\n\nDescribe this video in detail\nAssistant: The video showcases a large language model architecture, specifically a "Quick Brown" model, which is designed',
("cuda", (8, 6)): 'User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.\n\nFrame from 00:00:\nFrame from 00:01:\nFrame from 00:02:\nFrame from 00:03:\nFrame from 00:04:\nFrame from 00:05:\nFrame from 00:06:\nFrame from 00:08:\nFrame from 00:09:\n\nDescribe this video in detail\nAssistant: The video showcases a large language model, specifically a neural network model, which is designed to learn and',
("cuda", (8, 6)): 'User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.\n\nFrame from 00:00:\nFrame from 00:01:\nFrame from 00:02:\nFrame from 00:03:\nFrame from 00:04:\nFrame from 00:05:\nFrame from 00:06:\nFrame from 00:08:\nFrame from 00:09:\n\nDescribe this video in detail\nAssistant: The video depicts a large language model architecture, specifically a language model with a "quick brown" feature',
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value should have been updated long time ago. The PR #43514 further changed the actual outputs, but with the fix of this PR, it brings the actual output back to the one that remain the same for several months, which is the new value I provide here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN a1cab4b5 workflow commit (merge commit)
PR 908e786e branch commit (from PR)
main 374d44d5 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45258&sha=908e78

Copy link
Copy Markdown
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @ydshieh ! I'll check if this is present in any other places. We can simplify this a bit as explained below.

Comment on lines +151 to 164
if resample is not None:
if isinstance(resample, (PILImageResampling, int)):
interpolation = pil_torch_interpolation_mapping[resample]
else:
interpolation = resample
else:
interpolation = tvF.InterpolationMode.BILINEAR
if interpolation == tvF.InterpolationMode.LANCZOS:
logger.warning_once(
"You have used fast image processor with LANCZOS resample which not yet supported for torch.Tensor. "
"BICUBIC resample will be used as an alternative. Please fall back to image processor if you "
"want full consistency with the original model."
)
interpolation = tvF.InterpolationMode.BICUBIC
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we use super().resize with resample arg where tvf.resize is used below, we shouldn't need this logic (it's already in TorchvisionBackend)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks! It seems not the case for now. I can add a comment like "TODO (yoni): try to use super().resize".

I will wait a bit to see if you are ok with such a comment, then merge.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just make the modifications in this PR if that's ok

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really prefer to have a clean CI that everyone could trust the most and work on top of it. Having failing failing tests more time means if there are other PRs introducing new breaks (of different types) on the already failing tests, we won't detect it. Accumulation of such make the debug and the fix more difficult and time consuming.

(You can see several examples like #45268 or #45252)

(And I prefer to let you to find the time to work on this - you are much better than me. And I don't want to ask you to do it like today or this week)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to merge! Hope my arguments above convince you 🙏

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! I just opened a PR for this if you could approve :) #45272

@ydshieh ydshieh merged commit 182f20c into main Apr 6, 2026
23 of 25 checks passed
@ydshieh ydshieh deleted the fix_smolvlm branch April 6, 2026 19:41
louzongzhi pushed a commit to louzongzhi/transformers that referenced this pull request Apr 7, 2026
…r backend refactor (huggingface#45258)

* Fix SmolVLM video processor resize using wrong interpolation after image processor backend refactor

The PR huggingface#43514 refactored _preprocess to pass resample=resample to resize,
but resize still accepted interpolation as its parameter. The resample kwarg
was silently swallowed by **kwargs, causing interpolation to default to BILINEAR
instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values.

Fix by renaming the parameter to resample and converting PIL resample integers to
torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the
pattern used in TorchvisionBackend.resize.

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants