Foundation Model Engineering

7.3 Model & Pipeline Parallelism

In the previous section, we explored how ZeRO solves the Memory Wall by dynamically partitioning model states across a cluster. However, ZeRO has a fundamental limitation: it operates at the layer level. During a forward or backward pass, ZeRO must reconstruct the entire weight matrix of a specific layer on a single GPU.

As Foundation Models scale beyond 100 billion parameters, the mathematical operations themselves become too colossal. For a sufficiently large model or an extremely long context window, the activations and weights of a single layer will exceed the 80GB VRAM limit of an H100 GPU. When a single matrix multiplication cannot fit into memory, Data Parallelism (even with ZeRO) fails. We must physically shard the model itself.

This brings us to the two pillars of model sharding: Pipeline Parallelism (Inter-layer) and Tensor Parallelism (Intra-layer).


1. Pipeline Parallelism (PP)

Pipeline Parallelism treats the neural network like a factory assembly line. Instead of placing the entire model on one GPU, we slice the model by depth. For a 40-layer Transformer on 4 GPUs, GPU 0 handles layers 1-10, GPU 1 handles layers 11-20, and so on.

The Bubble Problem and Micro-batching

The naive approach to PP is disastrous for hardware utilization. If GPU 0 processes a batch and sends it to GPU 1, GPU 0 sits idle until the backward pass returns. This idle time is known as the Pipeline Bubble.

To mitigate this, Google introduced GPipe [1]. GPipe divides a large mini-batch into smaller micro-batches. GPU 0 processes Micro-batch 1 (F1F_1) and immediately passes it to GPU 1. While GPU 1 works on F1F_1, GPU 0 can start processing F2F_2. This keeps the pipeline “full” and drastically shrinks the relative size of the bubble.

1F1B (One-Forward-One-Backward)

While GPipe solves the utilization problem, it creates a severe memory problem. GPipe executes all forward micro-batches before executing any backward micro-batches. This means GPU 0 must store the activations for every single micro-batch in memory until the backward pass eventually arrives.

Modern frameworks (like Megatron-LM) replace GPipe with the 1F1B (One-Forward-One-Backward) schedule. In 1F1B, as soon as the last GPU finishes a forward pass for a micro-batch, it immediately executes the backward pass. This backward pass propagates up the pipeline, interleaving with forward passes.

By running backward passes early, 1F1B frees the activation memory immediately. While the pipeline bubble size remains mathematically identical to GPipe, the peak memory footprint is drastically reduced, bounded only by the number of pipeline stages rather than the number of micro-batches.

Interactive Visualization: GPipe vs 1F1B

Use the visualizer below to compare the GPipe and 1F1B schedules. Pay close attention to the Peak Memory chart at the bottom, which tracks how many micro-batch activations GPU 0 must hold in memory at any given time.

Pipeline Execution Timeline (4 GPUs, 4 Micro-batches)

GPU
T1
T2
T3
T4
GPU 0
F1
F2
F3
F4
B1
B2
B3
B4
GPU 1
F1
F2
F3
F4
B1
B2
B3
B4
GPU 2
F1
F2
F3
F4
B1
B2
B3
B4
GPU 3
F1
F2
F3
F4
B1
B2
B3
B4

GPU 0 Peak Memory (Stored Activations)

1
T1
2
T2
3
T3
4
T4
3
T5
2
T6
1
T7
T8
T9
T10
T11
T12
T13
T14

1F1B interleaves forward and backward passes, immediately freeing activation memory after the backward pass completes.


2. Tensor Parallelism (TP)

While Pipeline Parallelism splits the model between layers, Tensor Parallelism (TP) splits the mathematical operations inside a single layer.

Spearheaded by NVIDIA’s Megatron-LM [2], TP is the engine that makes training massive models possible. It physically slices the weight matrices of the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) blocks across multiple GPUs.

megatron_tensor_parallelism Source: Generated by Gemini. Inspired by Shoeybi et al., 2019.

The Megatron-LM MLP Split

A standard Transformer MLP consists of two linear transformations: Y=GeLU(XA)Y = \text{GeLU}(XA) and Z=YBZ = YB. Megatron-LM uses a brilliant combination of Column Parallelism and Row Parallelism to distribute this computation while minimizing communication.

  1. Column Parallelism on AA: The first weight matrix AA is split vertically into A1A_1 and A2A_2. GPU 0 computes Y1=XA1Y_1 = XA_1 and GPU 1 computes Y2=XA2Y_2 = XA_2. Because the non-linear GeLU activation is element-wise, it can be applied independently on each GPU. Zero communication is required so far.
  2. Row Parallelism on BB: The second weight matrix BB is split horizontally into B1B_1 and B2B_2. GPU 0 computes Z1=Y1B1Z_1 = Y_1 B_1 and GPU 1 computes Z2=Y2B2Z_2 = Y_2 B_2.
  3. Synchronization: The final output requires summing the partial results: Z=Z1+Z2Z = Z_1 + Z_2. This is achieved via a single All-Reduce operation.

Below is a realistic PyTorch implementation of the Megatron-style MLP block. Notice how standard single-GPU operations are replaced with distributed primitives.

import torch
import torch.nn as nn
import torch.distributed as dist

class MegatronMLP(nn.Module):
    def __init__(self, d_model, d_ff, rank, world_size):
        super().__init__()
        # Column Parallelism: Split the output dimension
        self.w1 = nn.Linear(d_model, d_ff // world_size)
        
        # Row Parallelism: Split the input dimension
        self.w2 = nn.Linear(d_ff // world_size, d_model)
        self.gelu = nn.GELU()
        
    def forward(self, x):
        # 1. Column Parallel Forward (Independent computation, no comms)
        # Input x: [batch, seq, d_model] 
        # Output y_local: [batch, seq, d_ff // world_size]
        y_local = self.gelu(self.w1(x))
        
        # 2. Row Parallel Forward (Compute partial sums)
        # Output z_local: [batch, seq, d_model]
        z_local = self.w2(y_local)
        
        # 3. All-Reduce to sum the partial results across all TP GPUs
        dist.all_reduce(z_local, op=dist.ReduceOp.SUM)
        
        return z_local

The Communication Bottleneck

Tensor Parallelism is incredibly powerful, but it comes with a severe hardware constraint. As seen in the code, an All-Reduce operation is required inside every single Transformer block. This high-frequency, blocking communication demands massive bandwidth. Consequently, TP is almost strictly confined to GPUs within a single physical node (e.g., 8 GPUs connected via NVLink). Running TP across nodes over standard Ethernet or even InfiniBand would cripple training speed.


3. Sequence Parallelism (SP)

As the AI industry shifted towards massive context windows (e.g., 128k to 1M tokens), engineers hit a new Memory Wall. In standard TP, the weight matrices are sharded, but every GPU still holds the full sequence length of the activations. Since attention memory scales quadratically with sequence length, long contexts cause Out-Of-Memory (OOM) errors regardless of TP.

Sequence Parallelism (SP) [3] solves this by sharding the data along the sequence dimension itself. If the sequence length is 8,000 and we have 8 GPUs, each GPU only stores and processes 1,000 tokens.

  • In the MLP layer, SP operates seamlessly because MLPs process tokens independently.
  • In the Attention layer, SP requires complex communication algorithms (like Ring Attention or DeepSpeed-Ulysses) to allow tokens on GPU 0 to attend to tokens on GPU 7 without materializing the full sequence on any single device.

4. 3D Parallelism & Automated Sharding

To train a state-of-the-art model like Llama 3 or GPT-4, engineers do not choose between these techniques; they combine them into 3D Parallelism.

3d_parallelism_topology Source: Generated by Gemini.

  1. Tensor Parallelism (Intra-node): Shards the math across the 8 GPUs inside a single server using high-speed NVLink.
  2. Pipeline Parallelism (Inter-node): Shards the layers across multiple servers using InfiniBand.
  3. Data Parallelism (Cluster-wide): Replicates this entire TP+PP setup across thousands of servers to process massive datasets concurrently.

The Rise of Auto-Parallelism

Manually mapping a neural network to a 3D hardware topology is an agonizing engineering task. The current frontier of systems research is Automated Parallelism. Compilers like Alpa [4] treat parallelism as an optimization problem. By analyzing the model’s computational graph and the cluster’s network bandwidth, Alpa automatically derives the mathematically optimal combination of DP, TP, and PP without requiring engineers to write custom distributed code.


Quizzes

Quiz 1: Why does Megatron-LM specifically pair Column Parallelism for the first linear layer with Row Parallelism for the second linear layer in the MLP block? By executing Column Parallelism first, the intermediate activation tensor is physically partitioned across the GPUs, requiring zero communication. The subsequent Row Parallelism computes a partial sum. This design ensures that the entire two-layer MLP block requires only a single All-Reduce synchronization at the very end, effectively halving the communication overhead compared to naive sharding.

Quiz 2: In Pipeline Parallelism, why does the 1F1B (One-Forward-One-Backward) schedule drastically reduce peak memory compared to GPipe, even though both have the exact same pipeline bubble size? GPipe processes all forward micro-batches before starting any backward passes, forcing the early pipeline stages to store the activations for every single micro-batch in memory simultaneously. 1F1B interleaves the passes. As soon as a micro-batch completes its backward pass, its activations are freed. Therefore, 1F1B’s peak memory is bounded by the number of pipeline stages, independent of the total number of micro-batches.

Quiz 3: Why is Tensor Parallelism (TP) typically restricted to GPUs within the same physical server, while Pipeline Parallelism (PP) is often deployed across different servers? TP requires a blocking All-Reduce synchronization inside every single Transformer block (multiple times per layer). This high-frequency communication requires the massive bandwidth and low latency of intra-node connections like NVLink. PP only communicates boundary activations between layers, which happens less frequently and requires far less bandwidth, making it suitable for inter-node connections like InfiniBand.

Quiz 4: How does Sequence Parallelism (SP) solve the memory bottleneck of extremely long context windows that standard Tensor Parallelism cannot resolve? In standard Tensor Parallelism, the model weights are sharded, but every GPU still holds the full sequence length of the activations. For very long contexts, activation memory scales quadratically and causes OOM errors. SP partitions the sequence dimension itself, meaning each GPU only stores and processes a fraction of the tokens, drastically reducing the activation memory footprint.

Quiz 5: In Pipeline Parallelism with GPipe or 1F1B, let PP be the number of pipeline stages and MM be the number of micro-batches. Formulate the equation for the pipeline bubble size as a fraction of the total ideal compute time. The pipeline bubble consists of P1P-1 micro-batches in both the forward and backward pass. Therefore, the total unutilized stages equal 2(P1)2(P-1). The total ideal computational workload spans 2M2M steps. The bubble fraction FbubbleF_{\text{bubble}} is expressed as: Fbubble=P1M+P1F_{\text{bubble}} = \frac{P-1}{M + P - 1}. As MM increases relative to PP, the bubble fraction approaches zero, maximizing cluster utilization.


References

  1. Huang, Y., et al. (2019). GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism. NeurIPS. arXiv:1811.06965.
  2. Shoeybi, M., et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053.
  3. Li, S., et al. (2021). Sequence Parallelism: Making 4D Parallelism Possible. arXiv:2105.13120.
  4. Zheng, L., et al. (2022). Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. OSDI. arXiv:2201.12023.