Fix SmolVLM video processor resize using wrong interpolation after backend refactor#45258
Fix SmolVLM video processor resize using wrong interpolation after backend refactor#45258
SmolVLM video processor resize using wrong interpolation after backend refactor#45258Conversation
…age processor backend refactor The PR #43514 refactored _preprocess to pass resample=resample to resize, but resize still accepted interpolation as its parameter. The resample kwarg was silently swallowed by **kwargs, causing interpolation to default to BILINEAR instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values. Fix by renaming the parameter to resample and converting PIL resample integers to torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the pattern used in TorchvisionBackend.resize.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: smolvlm |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: smolvlm |
|
This comment contains models: ["models/smolvlm"] |
SmolVLM video processor resize using wrong interpolation after backend refactor
| (None, None): 'User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.\n\nFrame from 00:00:\nFrame from 00:01:\nFrame from 00:02:\nFrame from 00:03:\nFrame from 00:04:\nFrame from 00:05:\nFrame from 00:06:\nFrame from 00:08:\nFrame from 00:09:\n\nDescribe this video in detail\nAssistant: The video depicts a large language model architecture, specifically a language model with a "quick brown" feature', | ||
| ("cuda", (8, 0)): 'User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.\n\nFrame from 00:00:\nFrame from 00:01:\nFrame from 00:02:\nFrame from 00:03:\nFrame from 00:04:\nFrame from 00:05:\nFrame from 00:06:\nFrame from 00:08:\nFrame from 00:09:\n\nDescribe this video in detail\nAssistant: The video showcases a large language model architecture, specifically a "Quick Brown" model, which is designed', | ||
| ("cuda", (8, 6)): 'User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.\n\nFrame from 00:00:\nFrame from 00:01:\nFrame from 00:02:\nFrame from 00:03:\nFrame from 00:04:\nFrame from 00:05:\nFrame from 00:06:\nFrame from 00:08:\nFrame from 00:09:\n\nDescribe this video in detail\nAssistant: The video showcases a large language model, specifically a neural network model, which is designed to learn and', | ||
| ("cuda", (8, 6)): 'User: You are provided the following series of nine frames from a 0:00:09 [H:MM:SS] video.\n\nFrame from 00:00:\nFrame from 00:01:\nFrame from 00:02:\nFrame from 00:03:\nFrame from 00:04:\nFrame from 00:05:\nFrame from 00:06:\nFrame from 00:08:\nFrame from 00:09:\n\nDescribe this video in detail\nAssistant: The video depicts a large language model architecture, specifically a language model with a "quick brown" feature', |
There was a problem hiding this comment.
This value should have been updated long time ago. The PR #43514 further changed the actual outputs, but with the fix of this PR, it brings the actual output back to the one that remain the same for several months, which is the new value I provide here.
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45258&sha=908e78 |
yonigozlan
left a comment
There was a problem hiding this comment.
Thanks for catching this @ydshieh ! I'll check if this is present in any other places. We can simplify this a bit as explained below.
| if resample is not None: | ||
| if isinstance(resample, (PILImageResampling, int)): | ||
| interpolation = pil_torch_interpolation_mapping[resample] | ||
| else: | ||
| interpolation = resample | ||
| else: | ||
| interpolation = tvF.InterpolationMode.BILINEAR | ||
| if interpolation == tvF.InterpolationMode.LANCZOS: | ||
| logger.warning_once( | ||
| "You have used fast image processor with LANCZOS resample which not yet supported for torch.Tensor. " | ||
| "BICUBIC resample will be used as an alternative. Please fall back to image processor if you " | ||
| "want full consistency with the original model." | ||
| ) | ||
| interpolation = tvF.InterpolationMode.BICUBIC |
There was a problem hiding this comment.
As long as we use super().resize with resample arg where tvf.resize is used below, we shouldn't need this logic (it's already in TorchvisionBackend)
There was a problem hiding this comment.
OK, thanks! It seems not the case for now. I can add a comment like "TODO (yoni): try to use super().resize".
I will wait a bit to see if you are ok with such a comment, then merge.
There was a problem hiding this comment.
I'll just make the modifications in this PR if that's ok
There was a problem hiding this comment.
I really prefer to have a clean CI that everyone could trust the most and work on top of it. Having failing failing tests more time means if there are other PRs introducing new breaks (of different types) on the already failing tests, we won't detect it. Accumulation of such make the debug and the fix more difficult and time consuming.
(You can see several examples like #45268 or #45252)
(And I prefer to let you to find the time to work on this - you are much better than me. And I don't want to ask you to do it like today or this week)
There was a problem hiding this comment.
Going to merge! Hope my arguments above convince you 🙏
There was a problem hiding this comment.
No problem! I just opened a PR for this if you could approve :) #45272
…r backend refactor (huggingface#45258) * Fix SmolVLM video processor resize using wrong interpolation after image processor backend refactor The PR huggingface#43514 refactored _preprocess to pass resample=resample to resize, but resize still accepted interpolation as its parameter. The resample kwarg was silently swallowed by **kwargs, causing interpolation to default to BILINEAR instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values. Fix by renaming the parameter to resample and converting PIL resample integers to torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the pattern used in TorchvisionBackend.resize. * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Summary
PR #43514 refactored
_preprocessto passresample=resampletoresize, but theresizemethod inSmolVLMVideoProcessorstill hadinterpolationas its parameter name. Theresamplekwarg was silently swallowed by**kwargs, causinginterpolationto always default toBILINEARinstead of the intendedLANCZOS→BICUBICpath.Without this fix,
SmolVLMForConditionalGenerationIntegrationTest::test_integration_test_videohasinputs["pixel_values"]with a larger difference (~0.36) before and after #43514, which also changes the model output values.Fix
interpolationparameter toresampleinSmolVLMVideoProcessor.resizeInterpolationModeviapil_torch_interpolation_mapping, matching the pattern used inTorchvisionBackend.resizeinimage_processing_backends.pyBefore #43514 (correct path)
resize(interpolation=InterpolationMode.LANCZOS)→interpolation == tvF.InterpolationMode.LANCZOS→ BICUBIC fallbackAfter #43514 without this fix (broken path)
resize(resample=1)→resampleswallowed by**kwargs,interpolation=None→ defaults to BILINEARAfter #43514 with this fix (correct path restored)
resize(resample=1)→pil_torch_interpolation_mapping[1]→InterpolationMode.LANCZOS→ BICUBIC fallback