Foundation Model Engineering

11.1 Vision-Language Bridges

The transition from purely text-based Large Language Models (LLMs) to Multimodal Foundation Models requires bridging the gap between different modalities. Specifically, we need to connect visual understanding (images/videos) with text generation. This page explores the core architectures that serve as “bridges” between vision and language: CLIP, Flamingo, and the ubiquitous Projection Layer. We will dive deep into their mathematical formulations, architectural choices, and the engineering trade-offs involved.

The Challenge: Aligning Modalities

Visual data (pixels) and text data (tokens) live in completely different vector spaces. A model cannot directly understand an image by treating pixels as text tokens. We need a mechanism to translate visual features into a representation that a language model can process. This is the role of the Vision-Language Bridge. The goal is to align these spaces so that the model can understand that the visual concept of a “cat” mapped from pixels corresponds to the linguistic concept of “cat” mapped from tokens.

Bridge design is not a cosmetic implementation detail. It determines what kind of multimodal behavior the system can support. A weak bridge may be enough for retrieval or coarse captioning, but not for grounded dialogue about multiple images. A very expressive bridge may improve reasoning, but at the cost of more visual tokens, more cross-attention layers, and a much harder optimization problem. In practice, many product decisions in multimodal systems reduce to this question: how much visual information should the language model actually see, and in what form?

1. CLIP: The Contrastive Bridge (Representation Alignment)

Contrastive Language-Image Pre-training (CLIP), introduced by OpenAI in 2021 [1], is the foundational bridge for many multimodal models. It doesn’t generate text; instead, it learns to align image and text representations in a shared latent space.

The Architecture: Dual Encoders

CLIP uses a dual-encoder architecture:

  1. Image Encoder (EiE_i): Typically a Vision Transformer (ViT) or a ResNet that maps an image II to a dd-dimensional visual embedding v=Ei(I)v = E_i(I).
  2. Text Encoder (EtE_t): A Transformer that maps a text description TT to a dd-dimensional text embedding t=Et(T)t = E_t(T).

Both embeddings are normalized to have unit length (v2=1,t2=1\|v\|_2 = 1, \|t\|_2 = 1).

The Symmetric Contrastive Objective

The core innovation of CLIP is its training objective. Given a batch of NN image-text pairs, the model computes an N×NN \times N similarity matrix SS, where Si,j=vitjS_{i,j} = v_i^\top t_j is the cosine similarity between image ii and text jj.

CLIP applies a learnable temperature parameter τ\tau to scale the logits: Si,j=Si,j/τS_{i,j} = S_{i,j} / \tau. The loss is the average of two cross-entropy losses: one for image-to-text retrieval and one for text-to-image retrieval.

Image-to-Text Loss (Lit\mathcal{L}_{i \to t}): For each image, find the correct text.

Lit=1Ni=1Nlogexp(Si,i)j=1Nexp(Si,j)\mathcal{L}_{i \to t} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(S_{i,i})}{\sum_{j=1}^N \exp(S_{i,j})}

Text-to-Image Loss (Lti\mathcal{L}_{t \to i}): For each text, find the correct image.

Lti=1Nj=1Nlogexp(Sj,j)i=1Nexp(Si,j)\mathcal{L}_{t \to i} = -\frac{1}{N} \sum_{j=1}^N \log \frac{\exp(S_{j,j})}{\sum_{i=1}^N \exp(S_{i,j})}

Total Loss:

LCLIP=12(Lit+Lti)\mathcal{L}_{CLIP} = \frac{1}{2} (\mathcal{L}_{i \to t} + \mathcal{L}_{t \to i})

This objective forces the model to maximize the similarity of the NN correct pairs (diagonal) while minimizing the similarity of the N2NN^2 - N incorrect pairs (off-diagonal).

PyTorch Implementation of CLIP Loss

Here is a clean PyTorch implementation of the symmetric contrastive loss used in CLIP.

import torch
import torch.nn as nn
import torch.nn.functional as F

class CLIPLoss(nn.Module):
    def __init__(self, init_temperature=0.07):
        super().__init__()
        self.temperature = nn.Parameter(torch.tensor(init_temperature))
        
    def forward(self, image_embeddings, text_embeddings):
        """
        Compute symmetric contrastive loss.
        
        Args:
            image_embeddings: Tensor of shape (batch_size, embed_dim)
            text_embeddings: Tensor of shape (batch_size, embed_dim)
        """
        # Normalize embeddings to unit length
        image_embeddings = F.normalize(image_embeddings, p=2, dim=-1)
        text_embeddings = F.normalize(text_embeddings, p=2, dim=-1)
        
        batch_size = image_embeddings.size(0)
        
        # Compute cosine similarity matrix
        logits = torch.matmul(image_embeddings, text_embeddings.t()) / self.temperature
        
        # Target is the identity matrix (correct pairs are on the diagonal)
        labels = torch.arange(batch_size, device=image_embeddings.device)
        
        # Symmetric cross-entropy loss
        loss_i = F.cross_entropy(logits, labels)
        loss_t = F.cross_entropy(logits.t(), labels)
        
        return (loss_i + loss_t) / 2

# Example Usage
batch_size = 32
embed_dim = 512
img_emb = torch.randn(batch_size, embed_dim)
text_emb = torch.randn(batch_size, embed_dim)

criterion = CLIPLoss()
loss = criterion(img_emb, text_emb)
print(f"CLIP Loss: {loss.item():.4f}")

2. Flamingo: The Cross-Attention Bridge (Generative Alignment)

While CLIP aligns representations, it does not allow for deep interaction between vision and language during generation. DeepMind’s Flamingo (2022) [2] introduced a breakthrough architecture for few-shot multimodal learning by injecting visual information directly into a frozen LLM.

Key Components

Flamingo connects a frozen visual encoder and a frozen LLM using two novel components:

A. Perceiver Resampler

Visual encoders (like ViT) output a variable number of features depending on the image resolution or video frames. Passing hundreds of visual tokens directly to the LLM would explode the sequence length and compute cost (O(N2)O(N^2) in self-attention).

The Perceiver Resampler solves this by mapping a variable number of spatio-temporal visual features to a fixed number of visual tokens (e.g., 64 tokens).

  • It uses a set of KK learned latent queries.
  • These queries attend to the visual features using cross-attention.
  • The output is always KK tokens, regardless of input size, which are then fed to the LLM.

B. Gated Cross-Attention (Gated xAttn)

To allow the LLM to understand these visual tokens, Flamingo inserts new Gated Cross-Attention layers between the existing layers of the frozen LLM.

To prevent the new layers from destroying the LLM’s pre-trained knowledge at the start of training, Flamingo uses a tanh gating mechanism:

y=x+tanh(α)CrossAttend(x,V)y = x + \tanh(\alpha) \cdot \text{CrossAttend}(x, V)

Where:

  • xx is the text feature from the previous LLM layer.
  • VV are the visual tokens from the Perceiver Resampler.
  • α\alpha is a learnable scalar initialized to 0.

Because tanh(0)=0\tanh(0) = 0, the model starts by behaving exactly like the original frozen LLM. As training progresses, α\alpha deviates from zero, and the model gradually learns to incorporate visual information.

3. The Projection Layer: The Simple Bridge (LMMs)

Modern Vision-Language models like LLaVA [3] use a much simpler but highly effective approach: a direct projection layer.

How it Works

Instead of complex cross-attention mechanisms inserted throughout the LLM, these models use:

  1. A pre-trained visual encoder (often CLIP’s ViT) to extract features from an image.
  2. A Projection Layer (typically a simple Linear Layer or a Multi-Layer Perceptron, MLP) to map these features directly into the text embedding space of the LLM.

The LLM treats these projected visual features as if they were special “visual tokens” and simply concatenates them with the normal text tokens in the input sequence.

Xv=MLP(Ei(I))X_v = \text{MLP}(E_i(I))

Input sequence to LLM = [Xv,Xtext][X_v, X_{text}]

This approach is extremely simple to implement and allows for reusing existing LLM training infrastructure with minimal changes. It has become the dominant paradigm for open-source Large Multimodal Models (LMMs).

4. Engineering Trade-offs in Real Systems

Once we move from papers to deployed models, bridge choice becomes a systems trade-off rather than just an architectural preference.

Token Budget vs. Visual Fidelity

Every visual token competes with text tokens for context window budget and attention compute. If the bridge emits too many tokens, the model becomes expensive and may bury important text instructions. If it emits too few, it compresses away details such as small text in an image, spatial relations, or subtle frame-to-frame changes in video.

Frozen Backbones vs. End-to-End Adaptation

Many practical models freeze the vision encoder, freeze most of the LLM, and train only the bridge or projector. This dramatically lowers training cost and preserves the language model’s general capabilities. The downside is that the bridge must do a lot of representational work by itself. If the target domain is far from the pretraining data, the bridge may become the bottleneck.

Interleaving and Multi-Image Reasoning

Simple projection-layer models work best when the input pattern is straightforward: one image, then a prompt. Tasks such as visual dialogue, multi-image comparison, or long video understanding put more pressure on the bridge because the model must preserve temporal order, cross-reference visual evidence, and decide when to attend back to earlier frames. This is why deeper fusion mechanisms remain attractive even when projection layers dominate open-source practice.

Summary of Trade-offs

FeatureCLIPFlamingoProjection Layer (LLaVA)
Primary GoalRepresentation AlignmentGenerative Few-ShotGenerative Instruction Tuning
Fusion TypeLate Fusion (Dot Product)Deep Fusion (Cross-Attention)Early Fusion (Concatenation)
Compute CostLowHigh (requires new layers)Low (simple MLP)
FlexibilityRetrieval & Zero-shotInterleaved Text/ImageFixed Image Input

Quizzes

Quiz 1: Why does CLIP use a symmetric loss (Lit\mathcal{L}_{i \to t} and Lti\mathcal{L}_{t \to i}) instead of just one direction? Using only one direction (e.g., matching text to images) would create an asymmetric space. For example, multiple different texts could map close to the same image, but that image might not map close to those texts. Symmetric loss ensures that the representation space is aligned from both perspectives, leading to better generalization in both image-to-text (captioning/retrieval) and text-to-image (generation/retrieval) tasks.

Quiz 2: What is the main advantage of Flamingo’s Perceiver Resampler over simply concatenating all visual features to the text input? Concatenating all visual features (which can be hundreds or thousands of tokens for high-resolution images or videos) would drastically increase the sequence length. Since Transformer self-attention complexity is O(N2)O(N^2), this would make training and inference extremely expensive. The Perceiver Resampler reduces the visual features to a small, fixed number of tokens (e.g., 64), bounding the compute cost regardless of input resolution or video length.

Quiz 3: Why is the tanh gating initialized to zero in Flamingo’s Gated Cross-Attention? Initializing the gating parameter α\alpha to zero ensures that tanh(α)=0\tanh(\alpha) = 0. This means that at the start of training, the visual information is completely ignored, and the model behaves exactly like the pre-trained LLM. This prevents “catastrophic forgetting” and ensures stable training by slowly introducing visual information as the model learns.


References

  1. Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020.
  2. Alayrac, J. B., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. arXiv:2204.14198.
  3. Liu, H., et al. (2023). Visual Instruction Tuning. arXiv:2304.08485.