Skip to content

Latest commit

 

History

History
88 lines (64 loc) · 3.88 KB

File metadata and controls

88 lines (64 loc) · 3.88 KB

Add vision processing components

Adding a vision language model (VLM) requires two image processor classes on top of the standard modular approach.

Note

For the modeling and config steps, follow the modular guide first.

  • torchvision backend is the default and supports GPU acceleration.
  • PIL backend is a fallback when no GPU is available.

Both classes share the same preprocessing logic but have different backends. Their constructor signatures and default values must be identical. [AutoImageProcessor.from_pretrained()] selects the backend at load time and falls back to PIL when torchvision isn't available. Mismatched signatures cause the same saved config to behave differently across environments.

torchvision

Create image_processing_<model_name>.py with a class that inherits from [TorchvisionBackend]. Define a kwargs class first if your processor needs custom parameters beyond the standard [ImagesKwargs].

from ...image_processing_backends import TorchvisionBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...processing_utils import ImagesKwargs, Unpack
from ...utils import auto_docstring

class MyModelImageProcessorKwargs(ImagesKwargs, total=False):
    tile_size: int  # any model-specific kwargs

@auto_docstring
class MyModelImageProcessor(TorchvisionBackend):
    resample = PILImageResampling.BICUBIC
    image_mean = OPENAI_CLIP_MEAN
    image_std = OPENAI_CLIP_STD
    size = {"shortest_edge": 224}
    do_resize = True
    do_rescale = True
    do_normalize = True
    do_convert_rgb = True

    def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
        super().__init__(**kwargs)

PIL

Create image_processing_pil_<model_name>.py with a class that inherits from [PilBackend]. Import the kwargs class from the torchvision file, but don't redefine it. Sharing the same class keeps both backends' kwargs in sync. For processors with no custom parameters, use [ImagesKwargs] directly.

from ...image_processing_backends import PilBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...utils import auto_docstring
from .image_processing_<model_name> import MyModelImageProcessorKwargs

@auto_docstring
class MyModelImageProcessorPil(PilBackend):
    resample = PILImageResampling.BICUBIC
    image_mean = OPENAI_CLIP_MEAN
    image_std = OPENAI_CLIP_STD
    size = {"shortest_edge": 224}
    do_resize = True
    do_rescale = True
    do_normalize = True
    do_convert_rgb = True

    def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
        super().__init__(**kwargs)

Tip

See [CLIPImageProcessor]/[CLIPImageProcessorPil] and [LlavaOnevisionImageProcessor]/[LlavaOnevisionImageProcessorPil] for reference.

Next steps