4.5 Various LLM Architectures

By the time many readers reach this section, it is tempting to think the architecture debate is over: decoder-only won, everyone copied GPT, and the rest is just scaling. In practice, that is where the real engineering begins. The moment you ask for longer context windows, lower serving cost, better throughput, or stronger tool use, models start to diverge again.

A useful analogy is transportation. Cargo ships, racing cars, and commuter trains all move people or goods, but each is optimized for a different bottleneck. Modern LLMs are similar. They still generate text autoregressively, yet they make different architectural trade-offs depending on whether the main pain point is KV-cache growth, attention complexity, routing efficiency, or training stability.

This section looks at those design pressures first, then the architectural ideas they produced.

Why Architectures Keep Diverging

Chapter 3 gave us the canonical Transformer. Real systems, however, run into recurring pressure points:

Long contexts make full attention expensive.
KV caches become costly during autoregressive serving.
Dense feed-forward layers scale poorly as models get larger.
Training objectives that work for next-token prediction are not always optimal for throughput or reasoning.

Most modern architectural changes can be understood as responses to one or more of these bottlenecks.

1. Hybrid and Linear Attention

Standard self-attention has quadratic cost in sequence length, so it becomes expensive when context grows. One increasingly common response is a hybrid attention stack.

Instead of using full attention in every layer, hybrid models mix expensive but expressive attention layers with cheaper mechanisms such as linear attention, local attention, or state-space style updates. The intuition is simple: not every layer needs to perform global token-to-token lookup with the same fidelity.

This is a recurring pattern in modern model design. Keep some layers that are excellent at exact retrieval and composition, then use cheaper layers elsewhere to carry longer-range state at lower cost.

Recent open-weight releases make this trend concrete. MiniMax-Text-01 scaled a hybrid stack where seven Lightning Attention layers are followed by one Softmax attention layer, combined with MoE, to support very long contexts [4]. Kimi Linear then pushed the idea further with Kimi Delta Attention (KDA), a refined gated-delta linear attention module, interleaved with global MLA layers in a 3:1 ratio; its model card reports up to 75% KV-cache reduction and up to 6x decoding throughput at 1M context [5]. Qwen3.6-35B-A3B shows the same direction in a smaller open-weight package: 35B total parameters, 3B active, a repeated 3 x Gated DeltaNet + 1 x Gated Attention layout, MoE after each attention block, multi-token prediction, and native 262K context extendable to about 1M tokens [6]. Ling-2.6-flash, released as an open model in April 2026, is another recent example: it upgrades a GQA design into a 1:7 MLA + Lightning Linear hybrid with highly sparse MoE, targeting faster agent-style inference rather than simply longer chain-of-thought outputs [7].

The important update is not that softmax attention disappeared. It has not. The newer pattern is selective full attention: keep a small number of global attention layers as high-fidelity retrieval/composition checkpoints, and let linear-attention layers carry most of the long-context state. This makes hybrid linear attention look less like an exotic research branch and more like a practical production architecture for agentic and long-context workloads.

2. KV-Cache Compression and Memory-Aware Attention

For autoregressive models, serving cost is often dominated not just by compute, but by memory movement. Every generated token extends the KV cache, and that cache must be read repeatedly during inference.

This is why designs such as Multi-head Latent Attention (MLA) matter. Instead of storing large key and value tensors directly, the model stores a compressed latent representation and reconstructs what it needs for attention computation [1]. Some systems then combine cache compression with local or sparse attention patterns to reduce both memory footprint and attention work.

3. Mixture of Experts in the Feed-Forward Stack

Another major lever is the feed-forward block. Dense FFNs scale cost with the full parameter count, while Mixture of Experts (MoE) activates only a subset of experts for each token.

MoE is attractive because it increases total model capacity without paying the full inference cost of a dense model of the same size. The trade-off is engineering complexity: routing, load balancing, expert parallelism, and communication overhead become first-class concerns. In other words, MoE often trades arithmetic cost for systems complexity.

4. Multi-Token Prediction (MTP)

Most language models are trained to predict exactly one next token. Multi-token prediction (MTP) changes the objective so that the model predicts several future positions at once.

This can improve training efficiency and create a cleaner path to faster inference schemes such as speculative decoding. Conceptually, it is a reminder that architecture is not only about layers and blocks. Training objectives also shape which serving optimizations become natural later.

5. Optical Context Compression

The previous examples change how the model processes tokens once they are already inside the Transformer. A newer line of work asks a stranger question: what if a long text context should not enter the model as a long 1D token sequence at all?

DeepSeek-OCR is an example of this idea. Despite the name, its most interesting contribution is not simply “better OCR.” It investigates contexts optical compression: mapping text-heavy documents into a 2D visual representation, encoding that page image into a small number of vision tokens, and then using a language decoder to recover or reason over the content [3]. A page full of text can sometimes be cheaper to pass through a vision encoder than to feed as thousands of text tokens.

The intuition is familiar if you have ever screenshotted a long paragraph. A screenshot preserves line breaks, tables, mathematical notation, and spatial layout in one compact object. DeepSeek-OCR turns that intuition into a model pipeline:

Render or receive the document as an image-like 2D signal.
Compress the page with a vision encoder, DeepEncoder, into a bounded number of vision tokens.
Decode the compressed representation with a DeepSeek3B-MoE-A570M language decoder.
Produce OCR text, markdown, layout-aware outputs, or document-grounded responses.

This does not mean the system has abandoned text. The output is still text, and the decoder is still a language model. The architectural shift is at the input boundary: text is treated as something that can be optically compressed before it becomes model context.

For practitioners, the key quantity is the compression ratio:

r = \frac{N_{\text{text tokens}}}{N_{\text{vision tokens}}}

If a document would normally consume 2,000 text tokens but can be represented by 200 vision tokens, then $r = 10$ . The DeepSeek-OCR paper reports that when this ratio stays below about $10\times$ , OCR precision remains around 97%; at roughly $20\times$ , accuracy falls to about 60% [3]. That trade-off is the whole story: optical compression can buy context length, but it is not a lossless replacement for text tokenization.

Technically, this moves some of the burden from sequence modeling into visual representation learning. The encoder must preserve fine-grained glyph identity, reading order, table structure, and layout geometry while producing far fewer tokens than a conventional OCR or document parsing pipeline. The decoder then has to turn those compressed visual tokens back into exact symbols or structured text. This is very different from ordinary image captioning: a one-character error in code, math, legal text, or a table cell may be unacceptable.

The deeper implication is that “context length” is no longer only an attention problem. Long-context systems can be attacked at several levels:

make attention cheaper over long token sequences with hybrid linear attention;
make the KV cache cheaper during generation;
route computation sparsely with MoE;
predict or verify multiple tokens at once;
or compress the input representation before it becomes a token sequence.

Optical context compression belongs to the last category. It is promising for scanned documents, PDFs, tables, receipts, historical archives, and layout-rich corpora. It is riskier for workloads where exact byte-level fidelity matters, such as source code patches, formal proofs, contracts, and safety-critical numeric tables. A robust production system would often pair it with confidence scoring, selective fallback to text OCR, or verification against the original document image.

A Better Way to Compare Modern Architectures

Rather than memorizing product names, it is more useful to compare architectures by the bottleneck they target.

Design Pressure	Common Architectural Response	What You Gain	What You Give Up
Long contexts	Hybrid linear attention, local attention, sparse attention	Lower asymptotic attention cost	Less uniform behavior across layers and more custom kernels
KV-cache growth	GQA, MQA, MLA, sliding windows	Lower memory traffic during serving	More specialized implementation details
Massive parameter counts	MoE layers	Higher capacity per unit inference compute	Routing and load-balancing complexity
Throughput during decoding	MTP, speculative-friendly objectives	Faster generation paths	More training and serving coordination
Input context token growth	Optical 2D mapping, vision-token compression	Large documents fit into fewer context tokens	OCR errors, layout dependence, and verification burden
Long-context stability	QK-Norm, RoPE variants, windowing	Better optimization at scale	Extra design tuning and validation

This lens also explains why there is no single “best” modern architecture. A model optimized for short, frequent chat requests may choose very different internals from one optimized for long-context document work or large-scale coding assistance.

A Short Story About the Bottleneck Shift

Early Transformer discussions focused mostly on parameter count and benchmark quality. Production systems taught a harsher lesson: many deployment failures do not come from insufficient model intelligence, but from bandwidth, latency, and memory pressure. Once models started serving long prompts at scale, “How smart is the model?” was joined by “What is the cheapest way to keep it responsive?”

That is why modern architecture work often looks less like a search for a universal winner and more like a search for the right compromise.

PyTorch Implementation: Simplified MLA

Despite the rise of hybrid attention, Multi-head Latent Attention (MLA) is still a useful example because it makes the serving problem concrete. The simplified implementation below shows the central idea: compress what you store, then reconstruct what you need for attention computation.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleLatentAttention(nn.Module):
    def __init__(self, d_model, n_heads, d_head, d_latent):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_head
        self.d_latent = d_latent # Dimension of compressed KV
        
        # Standard Q projection
        self.q_proj = nn.Linear(d_model, n_heads * d_head)
        
        # KV Compression: Projects input to a smaller latent space
        self.kv_compress = nn.Linear(d_model, d_latent)
        
        # KV Decompression: Reconstructs K and V from the latent space
        self.kv_decompress = nn.Linear(d_latent, n_heads * d_head * 2) 
        
        self.o_proj = nn.Linear(n_heads * d_head, d_model)
        
    def forward(self, x):
        batch_size, seq_len, _ = x.size()
        
        # 1. Project Query
        q = self.q_proj(x).view(batch_size, seq_len, self.n_heads, self.d_head)
        q = q.transpose(1, 2) # (B, H, S, d_head)
        
        # 2. Compress KV
        # This 'latent' vector is what we store in the KV Cache!
        latent = self.kv_compress(x) # (B, S, d_latent)
        
        # 3. Decompress KV for attention computation
        kv = self.kv_decompress(latent).view(batch_size, seq_len, self.n_heads, self.d_head, 2)
        k = kv[..., 0].transpose(1, 2) # (B, H, S, d_head)
        v = kv[..., 1].transpose(1, 2) # (B, H, S, d_head)
        
        # 4. Scaled Dot-Product Attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.d_head ** 0.5)
        
        # Apply causal mask (simplified)
        mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device))
        scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn = F.softmax(scores, dim=-1)
        
        context = torch.matmul(attn, v) # (B, H, S, d_head)
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)
        
        return self.o_proj(context)

Interactive Visualization: Attention Masking

To better understand how these different architectures control information flow, let’s visualize their attention masks. AI experts know that the mask is what truly defines whether a model behaves as an Encoder (bidirectional), a Decoder (causal), or a Prefix-LM.

Attention Mask Visualizer

Select architecture to see the attention mask pattern

Deep

Learning

Super

Fun

Deep

Learning

Super

Fun

All tokens can attend to all other tokens. Fully visible.

Attend

Masked

Quizzes

Quiz 1: Why would a model designer choose a hybrid attention stack instead of using full self-attention in every layer?

Because full self-attention gives strong token-to-token interaction but becomes expensive at long sequence lengths. A hybrid stack keeps some layers that are good at exact global retrieval while replacing other layers with cheaper mechanisms such as local, linear, or state-space style updates. The goal is to preserve enough expressiveness while reducing cost.

Quiz 2: What is the advantage of combining KV-cache compression with sparse or local attention patterns?

These techniques attack different costs. KV-cache compression reduces memory footprint and bandwidth pressure, while sparse or local attention reduces the amount of attention work performed over long sequences. Used together, they can improve long-context serving more than either change alone.

Quiz 3: Why do long-context models often introduce stabilizers such as QK-Norm or attention windowing?

As sequence length grows, attention scores and memory use can become harder to control. Stabilizers such as QK-Norm help keep attention values numerically well-behaved during training, while windowing reduces how much context each layer must process at once. Together they make long-context optimization more practical.

Quiz 4: Why is optical context compression not equivalent to simply increasing a model’s text context window?

Increasing the text context window preserves the input as discrete text tokens and asks attention and KV-cache systems to handle more positions. Optical compression changes the representation before the model sees it: text and layout are encoded as vision tokens. This can reduce token count, but it introduces a lossy reconstruction problem, layout sensitivity, and the need to verify exact symbols when precision matters.

References

DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437.
Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245.
Wei, H., Sun, Y., & Li, Y. (2025). DeepSeek-OCR: Contexts Optical Compression. arXiv:2510.18234.
MiniMaxAI. (2025). MiniMax-Text-01 model card. Hugging Face.
Moonshot AI. (2025). Kimi Linear: An Expressive, Efficient Attention Architecture. Hugging Face.
Qwen. (2026). Qwen3.6-35B-A3B model card. Hugging Face.
Inclusion AI. (2026). Ling-2.6-flash model card. Hugging Face.