I’ve been using Sentence Transformers for years now—it’s one of those libraries that just works without getting in your way. The v5.4 update adds something I’ve been waiting for: proper multimodal support. You can now encode and compare texts, images, audio, and videos using the same API you already know. No separate pipelines, no hacky workarounds.
What’s Actually New?
Traditional embedding models turn text into vectors. That’s fine for most search and RAG setups, but the world isn’t just text. Multimodal embedding models map inputs from different modalities into a shared embedding space. So you can compare a text query against image documents, find video clips matching a description, or build RAG pipelines that work across modalities.
Similarly, reranker (Cross Encoder) models now handle mixed-modality pairs. You can score the relevance of a text query against an image, or a combined text-image document against another. This opens up visual document retrieval, cross-modal search, and multimodal RAG pipelines without duct-taping separate models together.
Getting Started
Installation is straightforward, but you need to pick the extra dependencies for the modalities you plan to use:
pip install -U "sentence-transformers[image]"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers[image,video,train]"
One thing to watch out for: VLM-based models like Qwen3-VL-2B need at least ~8 GB of VRAM. The 8B variants chew through ~20 GB. If you don’t have a local GPU, cloud GPU services or Google Colab are your best bet. On CPU, these models are painfully slow—stick to text-only or CLIP models for CPU inference.
Multimodal Embedding Models
Loading a multimodal embedding model feels exactly like loading a text-only model. That’s the point:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
The model auto-detects which modalities it supports. No extra configuration. If you need to control image resolution or model precision, you can pass kwargs, but for most cases, it just works.
Encoding Images
model.encode() now accepts images alongside text. Images can be URLs, local file paths, or PIL Image objects:
img_embeddings = model.encode([
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
print(img_embeddings.shape)
Cross-Modal Similarity
Since the model maps both text and images into the same embedding space, you can compute similarities between them directly:
text_embeddings = model.encode([
"A green car parked in front of a yellow building",
"A red car driving on a highway",
"A bee on a pink flower",
"A wasp on a wooden table",
])
similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)
The results make sense: “A green car parked in front of a yellow building” matches the car image at 0.51, and “A bee on a pink flower” matches the bee image at 0.67. The hard negatives get lower scores as expected.
You’ll notice those scores aren’t close to 1.0. That’s the modality gap—embeddings from different modalities tend to cluster in separate regions. Cross-modal similarities are typically lower than within-modal ones (text-to-text), but the relative ordering is preserved, so retrieval still works fine.
Encoding Queries and Documents
For retrieval tasks, use encode_query() and encode_document(). Many retrieval models prepend different instruction prompts depending on whether the input is a query or a document. These methods handle that automatically:
query_embedding = model.encode_query("A green car")
doc_embeddings = model.encode_document([
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
Multimodal Reranker Models
Reranker models score the relevance of pairs. With multimodal rerankers, you can score pairs where one or both elements are images, combined text-image documents, or other modalities.
from sentence_transformers import CrossEncoder
model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")
scores = model.predict([
("A green car", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"),
("A bee", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"),
])
print(scores)
This is useful for re-ranking initial retrieval results. You can first retrieve candidates using a lightweight embedding model, then re-rank them with a more expensive but more accurate reranker.
Input Formats and Configuration
The library supports a wide range of input types: text strings, image URLs, local file paths, PIL Image objects, audio files, video files, and numpy arrays. The model automatically detects the modality based on the input type.
You can check which modalities a model supports:
print(model.supported_modalities)
If you need to control image resolution or model precision, pass kwargs to the model constructor:
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", trust_remote_code=True, model_kwargs={"torch_dtype": "float16"})
Supported Models
At launch, the main supported models are from the Qwen3-VL family, but more are on the way. The integration pull requests for some models are still pending, so you might need to specify a revision argument for now. Once merged, you can load them without that.
My Take
This is a solid update. Sentence Transformers has always been about making embedding models accessible without the usual friction. Adding multimodal support in the same API is exactly the right move.
That said, the modality gap is worth keeping in mind. Cross-modal similarity scores are lower than what you’re used to with text-only models. The relative ordering is preserved, so retrieval still works, but don’t expect cosine similarities of 0.9+ between a text and an image. You’ll need to adjust your thresholds.
Also, the GPU requirements for VLM-based models are no joke. If you’re running on a laptop without a dedicated GPU, stick to CLIP-based models for now. The Qwen3-VL models are powerful, but they’re not lightweight.
For multimodal RAG, this is a step forward. Being able to retrieve images, audio, and video alongside text without separate pipelines is a big quality-of-life improvement. I’m looking forward to seeing what the community builds with this.
If you want to train your own multimodal models, check out the companion blogpost on training and finetuning. But for most use cases, the pre-trained models should be more than enough.
Comments (0)
Login Log in to comment.
Be the first to comment!