Skip to content

[docs] vlm addition#45271

Open
stevhliu wants to merge 3 commits intohuggingface:mainfrom
stevhliu:new-vlm
Open

[docs] vlm addition#45271
stevhliu wants to merge 3 commits intohuggingface:mainfrom
stevhliu:new-vlm

Conversation

@stevhliu
Copy link
Copy Markdown
Member

@stevhliu stevhliu commented Apr 6, 2026

adds a separate vlm contribution doc for more visibility instead of being hidden in the Contribute to Transformers doc, and integration tests are covered in #45152

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@stevhliu stevhliu requested a review from zucchini-nlp April 6, 2026 19:07
Comment on lines +29 to +30
- local: new_vlm
title: Add a vision language model
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i remember another PR of yours, re-ordering these sections. Ig vlm addition should be merged after that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah once #45130 is in we can merge this next :)

Comment on lines +24 to +27
- [torchvision](https://docs.pytorch.org/vision/stable/index.html) backend is the default and supports GPU acceleration.
- [PIL](https://pillow.readthedocs.io/en/stable/index.html) backend is a fallback when no GPU is available.

Both classes share the same preprocessing logic but have different backends. Their constructor signatures and default values must be identical. [`AutoImageProcessor.from_pretrained()`] selects the backend at load time and falls back to PIL when torchvision isn't available. Mismatched signatures cause the same saved config to behave differently across environments.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh this is not VLM-specific so prob we can name it differently. Like adding vision components or vision-processing components

Then we can add another for audio/video components if needed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, i like the flexibility!

@stevhliu stevhliu mentioned this pull request Apr 8, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants