Introduction
Vision-Language Models (VLMs) — models that understand both images and text — have gone from research curiosity to production workhorses in under two years. GPT-4V, Claude 3, Gemini, LLaVA, and Qwen-VL all use fundamentally similar architectures, and understanding how they work explains both their impressive capabilities and their surprising failure modes.
This post covers the architecture of modern VLMs, how they're trained, and the engineering considerations for deploying them at scale.
The Core Challenge: Bridging Vision and Language
Language models process tokens — discrete integers representing pieces of text. Images are continuous, high-dimensional signals. The fundamental challenge of VLMs is how to represent visual information in a form that language models can reason over.
The solution that's converged across the industry: use a pretrained visual encoder to convert images to a sequence of visual tokens, then feed those tokens into a language model alongside text tokens.
Image → [Visual Encoder] → visual feature vectors → [Projection] → visual tokens
↓
Text tokens + visual tokens → [LLM] → response
Building Block 1: The Visual Encoder
The visual encoder converts a raw image into a sequence of feature vectors. Two dominant approaches:
CLIP-Based Encoders
CLIP (Radford et al., OpenAI, 2021) is trained via contrastive learning on 400M (image, caption) pairs. The encoder learns visual representations that are semantically aligned with text — similar-concept images and text end up close in embedding space.
CLIP training objective:
Given batch of (image_i, text_i) pairs:
image_features = image_encoder(image_i)
text_features = text_encoder(text_i)
Contrastive loss: maximize cosine similarity for matching pairs,
minimize for non-matching pairs
Result: image and text representations in the same semantic space
Most open-source VLMs (LLaVA, InstructBLIP, Idefics) use ViT-L/14 or ViT-bigG CLIP encoders.
SigLIP Encoders
SigLIP (Zhai et al., Google, 2023) replaces CLIP's softmax contrastive loss with a sigmoid loss, enabling training without a global batch view. It scales better and produces better visual representations at high resolution. Qwen-VL 2 and recent PaliGemma models use SigLIP encoders.
Vision Transformer (ViT) Output Format
Both CLIP and SigLIP use Vision Transformers. A ViT divides an image into patches and processes them with transformer self-attention:
Image (224×224) → Patches (16×16 each) → 196 patch tokens
+ 1 [CLS] token
= 197 visual tokens
Each token: a 1024-dimensional feature vector (for ViT-L)
For high-resolution inputs (e.g., 1344×1344):
→ 7056 patch tokens → expensive but necessary for fine-grained tasks
Building Block 2: The Projection Layer
Visual encoder features are not in the same space as text token embeddings. A projection layer bridges this gap:
class VisionProjection(nn.Module):
def __init__(self, vision_dim=1024, llm_dim=4096):
super().__init__()
# LLaVA uses a 2-layer MLP projector
self.proj = nn.Sequential(
nn.Linear(vision_dim, llm_dim),
nn.GELU(),
nn.Linear(llm_dim, llm_dim)
)
def forward(self, vision_features):
return self.proj(vision_features)
# Output: (N_patches × llm_dim) — ready for the LLM
Different architectures use different projectors:
- Linear projection (BLIP-2): simplest, trained with a frozen LLM
- MLP projector (LLaVA): 2-layer MLP, works well with fine-tuned LLMs
- Q-Former (InstructBLIP): transformer that compresses visual features to a fixed number of query tokens
- Resampler / Perceiver (Flamingo, Idefics): cross-attention to compress variable-length visual features
Building Block 3: The Language Model Backbone
The LLM backbone processes the concatenated sequence of visual tokens and text tokens:
Full input sequence to LLM:
[System prompt tokens] [Visual tokens (196)] [User text tokens] → LLM → [Response tokens]
Since visual tokens occupy 196+ positions in the context, they consume a significant chunk of the context window. High-resolution images (which generate thousands of patches) can overwhelm the LLM's context window, which is why modern VLMs use dynamic resolution strategies and tile-based compression.
LLaVA: The Influential Open-Source Architecture
LLaVA (Liu et al., 2023) demonstrated that a surprisingly simple VLM architecture — CLIP encoder + MLP projector + fine-tuned LLM — could achieve strong multimodal capabilities.
LLaVA Training: Two Stages
Stage 1 — Alignment Pre-training Freeze both the visual encoder and the LLM backbone. Train only the projection layer on 595K image-caption pairs from CC3M. Goal: learn to project visual features into the LLM's embedding space.
Frozen: ViT-L/14 | MLP [trainable] | Frozen: LLaMA-7B
Data: (image, caption) pairs
Objective: predict the caption given the image
Duration: 1 epoch, ~few hours on 8 A100s
Stage 2 — Instruction Fine-Tuning Unfreeze the LLM backbone. Train on 665K visual instruction-following examples generated by GPT-4 (descriptions, question-answering, complex reasoning over images).
Frozen: ViT-L/14 | MLP [trainable] | LLM [trainable]
Data: visual instruction following pairs
Objective: follow multimodal instructions
Duration: 1 epoch on 8 A100s, ~full day
The GPT-4-generated instruction data is key: it's diverse, high-quality, and teaches the model to describe images, answer questions, and reason visually. LLaVA-1.5 improved this with higher resolution (336×336 vs 224×224) and a better MLP projector.
High-Resolution Understanding: The Modern Challenge
Early VLMs processed images at 224×224 — barely enough to read text in an image or identify fine-grained details. Modern applications need much higher resolution.
Tile-Based High Resolution (LLaVA-NeXT, InternVL)
Divide the high-res image into tiles, encode each tile with the ViT, then concatenate tile features:
1344×1344 image → 6×6 grid of 224×224 tiles
Each tile → 196 visual tokens
Total: 6×6×196 = 7,056 visual tokens + 1 global low-res thumbnail
Trade-off: extremely detailed visual understanding, but very long sequences
Dynamic Resolution (Qwen-VL 2)
Adapt the number of tiles based on image aspect ratio and content complexity. A simple diagram uses fewer tiles than a dense spreadsheet. This keeps context length manageable while allocating resolution budget where it's needed.
How GPT-4o's Vision Works
GPT-4o uses a native multimodal architecture — rather than encoding images separately and projecting into the LLM, the architecture was designed from the ground up to process interleaved image and text tokens end-to-end.
The specific details are not public, but the key properties are:
- Images are tokenized into discrete visual tokens (similar to how text is tokenized)
- The model is pretrained on interleaved image-text data, not fine-tuned from a text-only LLM
- This enables tighter integration between visual and textual reasoning
This architectural choice explains GPT-4o's stronger performance on tasks requiring close visual-textual reasoning compared to models that use the projection approach.
VLM Failure Modes
Hallucination: VLMs frequently describe objects that aren't in the image, especially when the image contains familiar scenes. The LLM backbone "fills in" expected objects from its training distribution.
Spatial reasoning failures: Answering "what's to the left of X?" is surprisingly hard. The ViT's token representation doesn't directly encode spatial relationships; the LLM must infer them from patch positions.
Fine-grained OCR: While high-resolution VLMs can read text in images, they struggle with handwriting, complex layouts, and heavily stylized fonts.
Counting: VLMs consistently fail at counting objects beyond ~5. The patch-based representation makes counting hard to learn reliably.
Production Deployment Considerations
Context length management: At high resolution, visual tokens can consume 7K+ context positions. Monitor and cap image resolution based on your context window budget.
Prefill vs. decode: Image encoding (prefill) is parallelizable and fast; generating the response (decode) is the latency bottleneck. Caching visual token representations for repeated images can dramatically reduce time-to-first-token.
Batch inference: Visual encoding is highly parallelizable. vLLM and SGLang both support efficient batched VLM inference.
Conclusion
Vision-Language Models work by connecting a pretrained visual encoder (typically CLIP or SigLIP) to a language model via a learned projection layer, then fine-tuning the combined system on instruction-following data. The key variables are encoder resolution, projection architecture, and instruction-tuning data quality. Understanding this architecture explains both the impressive capabilities — rich visual reasoning, OCR, spatial understanding — and the predictable failure modes like hallucination and counting errors. The field is moving toward higher-resolution, natively multimodal architectures that tighten the integration between visual and linguistic processing.
Want to understand how these models are efficiently served in production? Read our guide on LLM Inference Optimization.