Adding a vision language model (VLM) requires two image processor classes on top of the standard modular approach.
Note
For the modeling and config steps, follow the modular guide first.
- torchvision backend is the default and supports GPU acceleration.
- PIL backend is a fallback when no GPU is available.
Both classes share the same preprocessing logic but have different backends. Their constructor signatures and default values must be identical. [AutoImageProcessor.from_pretrained()] selects the backend at load time and falls back to PIL when torchvision isn't available. Mismatched signatures cause the same saved config to behave differently across environments.
Create image_processing_<model_name>.py with a class that inherits from [TorchvisionBackend]. Define a kwargs class first if your processor needs custom parameters beyond the standard [ImagesKwargs].
from ...image_processing_backends import TorchvisionBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...processing_utils import ImagesKwargs, Unpack
from ...utils import auto_docstring
class MyModelImageProcessorKwargs(ImagesKwargs, total=False):
tile_size: int # any model-specific kwargs
@auto_docstring
class MyModelImageProcessor(TorchvisionBackend):
resample = PILImageResampling.BICUBIC
image_mean = OPENAI_CLIP_MEAN
image_std = OPENAI_CLIP_STD
size = {"shortest_edge": 224}
do_resize = True
do_rescale = True
do_normalize = True
do_convert_rgb = True
def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
super().__init__(**kwargs)Create image_processing_pil_<model_name>.py with a class that inherits from [PilBackend]. Import the kwargs class from the torchvision file, but don't redefine it. Sharing the same class keeps both backends' kwargs in sync. For processors with no custom parameters, use [ImagesKwargs] directly.
from ...image_processing_backends import PilBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...utils import auto_docstring
from .image_processing_<model_name> import MyModelImageProcessorKwargs
@auto_docstring
class MyModelImageProcessorPil(PilBackend):
resample = PILImageResampling.BICUBIC
image_mean = OPENAI_CLIP_MEAN
image_std = OPENAI_CLIP_STD
size = {"shortest_edge": 224}
do_resize = True
do_rescale = True
do_normalize = True
do_convert_rgb = True
def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
super().__init__(**kwargs)Tip
See [CLIPImageProcessor]/[CLIPImageProcessorPil] and [LlavaOnevisionImageProcessor]/[LlavaOnevisionImageProcessorPil] for reference.
- Read the Auto-generating docstrings guide to auto-generate consistent docstrings with
@auto_docstring. - Read the Writing model tests guide to write integration tests for your model.