Finetuning Multimodal Embedding Models with Sentence Transformers: A Walkthrough

I’ve been using Sentence Transformers for years, and the library just keeps getting better. Last time, Tom Aarsen showed us how to use the new multimodal capabilities—embedding and reranker models that handle text, images, audio, and video. Now he’s back with something more practical: how to train or finetune these models on your own data.

Let’s be real: general-purpose models are fine for demos, but they rarely shine on specific tasks. The example here is Visual Document Retrieval (VDR)—given a query like “What was the company’s Q3 revenue?”, you need to find the right document screenshot from thousands. That’s a very different skill from matching product photos to descriptions. General models try to do everything, so they end up mediocre at any one thing.

Why Finetune? Because General Models Are Jacks of All Trades

The base model here is Qwen/Qwen3-VL-Embedding-2B. It’s trained on diverse data—image-text matching, visual QA, document understanding, you name it. But when I finetuned it on VDR data, the NDCG@10 jumped from 0.888 to 0.947. That’s not just a minor bump; it beat every recent multimodal model I tested, including ones up to 4x larger. Size isn’t everything if the model doesn’t know your domain.

The finetuned model is now on Hugging Face as tomaarsen/Qwen3-VL-Embedding-2B-vdr. It’s a solid example of what targeted finetuning can do.

The Training Pipeline: Same Components, Different Data

If you’ve trained text-only Sentence Transformer models before, you’ll feel right at home. The pipeline uses the same SentenceTransformerTrainer. The only real difference is that your dataset now includes images (or audio, or video) alongside text, and the model’s processor handles preprocessing automatically.

Here’s what you need:

Model: Either finetune an existing multimodal embedding model or start from a Vision-Language Model (VLM) checkpoint.
Dataset: Your data, formatted properly.
Loss Function: The objective that guides optimization.
Training Arguments: Optional, but control performance and debugging.
Evaluator: Optional, for checking progress.
Trainer: Ties everything together.

Let’s walk through each one, using VDR as the running example.

Model: Start Smart, Not from Scratch

The easiest path is to finetune an existing multimodal embedding model. You pass processor_kwargs and model_kwargs to control preprocessing and model loading. For example:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
    processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)

Notice the max_pixels parameter. Higher resolution means better quality but more memory. You’ll want to tune this based on your hardware and the nature of your images. Document screenshots need more detail than product thumbnails.

You can also start from a fresh VLM checkpoint that hasn’t been trained for embeddings yet. Sentence Transformers will try to detect the architecture, infer supported modalities from the processor, and set up the right forward method and pooling. If automatic detection fails, you can edit the saved sentence_bert_config.json to adjust modality settings.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-2B")

Either way, the Transformer module inspects the processor to determine which modalities are available, and Pooling is added automatically if needed. You can check:

print(model.modalities)
print(model.supports("image"))

There’s also an alternative approach using a Router to compose separate encoders for different modalities instead of a single VLM backbone. I haven’t tried that yet, but it’s worth exploring if you need more flexibility.

Dataset: Visual Document Retrieval Format

For VDR, your dataset pairs text queries with document screenshots. The format is straightforward: each example has a query (text) and a positive document (image). You can also include hard negatives—documents that look similar but aren’t the right answer—to make training more effective.

The dataset should be in a format that Sentence Transformers can iterate over. Typically, you’ll use a Hugging Face Dataset or a custom iterable. The key is that each example has a “query” field (text) and a “positive” field (image path or PIL Image).

Loss Function: CachedMultipleNegativesRankingLoss and MatryoshkaLoss

The loss function is where the magic happens. For multimodal retrieval, CachedMultipleNegativesRankingLoss is the go-to. It’s a cached version of the standard MultipleNegativesRankingLoss, which is much more memory-efficient. Instead of computing all pairwise similarities in one shot, it caches embeddings and computes losses in chunks. This is critical when you’re dealing with images, which blow up memory usage.

You’ll also want MatryoshkaLoss if you’re training a Matryoshka embedding model—one that can produce embeddings of varying dimensions. This lets you trade off between accuracy and speed at inference time. For example, you might use 256-dimensional embeddings for quick retrieval and 1024-dimensional for final ranking.

Training Arguments: Tune for Your Hardware

Training arguments control batch size, learning rate, number of epochs, and so on. For multimodal models, batch size is often the bottleneck because images take up more memory than text. Start with a smaller batch size and increase it until you hit your GPU limit. Flash Attention 2 helps here—it reduces memory usage and speeds up training.

Other arguments worth tweaking:

warmup_steps: Typically 10% of total steps.
learning_rate: Start with 2e-5 and adjust based on validation loss.
evaluation_strategy: Set to “steps” so you can monitor NDCG@10 during training.

Evaluator: Track Real Progress

The evaluator runs during training to check if the model is actually improving. For VDR, you want NDCG@10 or Recall@k. Sentence Transformers has built-in evaluators like InformationRetrievalEvaluator that work with multimodal data. Just pass your evaluation dataset and the metric you care about.

Trainer: The Glue

The SentenceTransformerTrainer brings everything together. It’s basically the same as the text-only trainer, but it handles multimodal data automatically. You pass the model, dataset, loss function, training arguments, and evaluator, and it handles the rest.

Results: What You Get

The finetuned model achieved an NDCG@10 of 0.947, up from 0.888 for the base model. That’s a 6.6% improvement, which is huge for retrieval tasks. It also outperformed models up to 4x its size. The Matryoshka dimensions showed that even at 256 dimensions, the finetuned model matched the base model’s performance at 1024 dimensions. That’s the power of domain-specific finetuning.

Training Multimodal Reranker Models

Reranker models work differently from embedding models. Instead of producing a vector for each document, they take a query-document pair and output a relevance score. This is more expensive but often more accurate for the final ranking step.

Sentence Transformers supports multimodal rerankers too. The training process is similar, but you use a different loss function (like CoSENTLoss or ContrastiveLoss) and the model outputs a score instead of an embedding. The dataset format changes: each example has a query, a document, and a label (relevance score).

I haven’t finetuned a multimodal reranker myself yet, but the principles are the same. Start with a pretrained multimodal model, add a classification head, and train on paired data.

Additional Resources

If you want to dive deeper, check out:

The original blog post on multimodal models in Sentence Transformers
Training examples in the Sentence Transformers repository
The official documentation for SentenceTransformerTrainer

For text-only training, there are plenty of prior blogposts and examples. The multimodal stuff is newer, but the library handles it well.

Final Takeaways

Finetuning multimodal embedding models is more accessible than ever with Sentence Transformers. The pipeline is essentially the same as text-only training, just with images in your dataset. The results speak for themselves: a 2B parameter model finetuned on domain data can beat 8B parameter general models.

If you’re working on a retrieval task that involves images, documents, or any multimodal data, give it a shot. The hardest part is curating the dataset—the training itself is straightforward.