Add vision processing components

Adding a vision language model (VLM) requires two image processor classes on top of the standard modular approach.

Note

For the modeling and config steps, follow the modular guide first.

torchvision backend is the default and supports GPU acceleration.
PIL backend is a fallback when no GPU is available.

Both classes share the same preprocessing logic but have different backends. Their constructor signatures and default values must be identical. [AutoImageProcessor.from_pretrained()] selects the backend at load time and falls back to PIL when torchvision isn't available. Mismatched signatures cause the same saved config to behave differently across environments.

torchvision

Create image_processing_<model_name>.py with a class that inherits from [TorchvisionBackend]. Define a kwargs class first if your processor needs custom parameters beyond the standard [ImagesKwargs].

from ...image_processing_backends import TorchvisionBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...processing_utils import ImagesKwargs, Unpack
from ...utils import auto_docstring

class MyModelImageProcessorKwargs(ImagesKwargs, total=False):
    tile_size: int  # any model-specific kwargs

@auto_docstring
class MyModelImageProcessor(TorchvisionBackend):
    resample = PILImageResampling.BICUBIC
    image_mean = OPENAI_CLIP_MEAN
    image_std = OPENAI_CLIP_STD
    size = {"shortest_edge": 224}
    do_resize = True
    do_rescale = True
    do_normalize = True
    do_convert_rgb = True

    def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
        super().__init__(**kwargs)

PIL

Create image_processing_pil_<model_name>.py with a class that inherits from [PilBackend]. Import the kwargs class from the torchvision file, but don't redefine it. Sharing the same class keeps both backends' kwargs in sync. For processors with no custom parameters, use [ImagesKwargs] directly.

from ...image_processing_backends import PilBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...utils import auto_docstring
from .image_processing_<model_name> import MyModelImageProcessorKwargs

@auto_docstring
class MyModelImageProcessorPil(PilBackend):
    resample = PILImageResampling.BICUBIC
    image_mean = OPENAI_CLIP_MEAN
    image_std = OPENAI_CLIP_STD
    size = {"shortest_edge": 224}
    do_resize = True
    do_rescale = True
    do_normalize = True
    do_convert_rgb = True

    def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
        super().__init__(**kwargs)

Tip

See [CLIPImageProcessor]/[CLIPImageProcessorPil] and [LlavaOnevisionImageProcessor]/[LlavaOnevisionImageProcessorPil] for reference.

Next steps

Read the Auto-generating docstrings guide to auto-generate consistent docstrings with @auto_docstring.
Read the Writing model tests guide to write integration tests for your model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision processing components

torchvision

PIL

Next steps

FilesExpand file tree

add_vision_processing_components.md

Latest commit

History

add_vision_processing_components.md

File metadata and controls

Add vision processing components

torchvision

PIL

Next steps