Foundation Model Engineering

17.3 Elo Rating & Leaderboards

Evaluating Large Language Models (LLMs) is one of the hardest problems in AI. Academic benchmarks like MMLU or GSM8K are static and prone to data contamination. To truly understand a model’s capabilities, the AI community has turned to human preference evaluation, formalized through Elo Rating systems and live Leaderboards.

The most prominent example of this is the LMSYS Chatbot Arena, where users chat with two anonymous models side-by-side and vote on which response is better. This crowdsourced, blind A/B testing creates a dynamic, hard-to-game leaderboard.

The Mechanism: Crowdsourced Duels

The process of building an LLM leaderboard using Elo ratings involves:

  1. Blind Duels: A user enters a prompt. Two anonymous models (e.g., Model A and Model B) generate responses.
  2. Human Judgment: The user votes for the better response (or declares a tie).
  3. Rating Update: The system updates the Elo ratings of both models based on the outcome and their pre-existing ratings.

Elo Rating System for LLMs Source: Generated by Gemini

The Mathematics of Elo

Originally designed for chess, the Elo rating system calculates the relative skill levels of players. In the context of LLMs, the “players” are the models.

1. Expected Score

Given two models with ratings RAR_A and RBR_B, the expected score (probability of winning) for Model A is calculated using a logistic curve:

EA=11+10(RBRA)/400E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}

Similarly, for Model B: EB=11+10(RARB)/400E_B = \frac{1}{1 + 10^{(R_A - R_B)/400}}

Note that EA+EB=1E_A + E_B = 1. If Model A has a rating 400400 points higher than Model B, EA0.91E_A \approx 0.91, meaning Model A is expected to win 91%91\% of the time.

2. Rating Update

After a duel, the ratings are updated based on the actual score SAS_A and the expected score EAE_A:

RA=RA+K(SAEA)R'_A = R_A + K(S_A - E_A)

Where:

  • RAR'_A is the new rating.
  • KK is a constant (K-factor) that determines how much a single game affects the rating (e.g., K=32K=32).
  • SAS_A is the actual score: 11 for a win, 0.50.5 for a tie, and 00 for a loss.

If a highly-rated model loses to a low-rated model, the rating change is large because (SAEA)(S_A - E_A) is large.

Beyond Elo: The Bradley-Terry Model

While standard Elo updates ratings sequentially, platforms like LMSYS often use the Bradley-Terry model for offline analysis. The Bradley-Terry model fits a probability model to the entire dataset of duels at once using maximum likelihood estimation. It handles ties and multiple comparisons more robustly than sequential Elo updates.

Implementing Batch Elo Updates in PyTorch

For developers running internal model evaluations (e.g., comparing multiple fine-tuned versions), implementing a batch Elo update system in PyTorch can be highly efficient. Here is how to compute expected scores and updates for a batch of duels.

import torch

def compute_batch_elo_update(ratings_a, ratings_b, outcomes, k_factor=32.0):
    """
    Compute Elo rating updates for a batch of duels.
    
    Args:
        ratings_a: Tensor of current ratings for Model A. Shape (batch_size,)
        ratings_b: Tensor of current ratings for Model B. Shape (batch_size,)
        outcomes: Tensor of outcomes for Model A (1.0 for win, 0.5 for tie, 0.0 for loss).
                  Shape (batch_size,)
        k_factor: The K-factor hyperparameter.
        
    Returns:
        new_ratings_a: Updated ratings for Model A.
        new_ratings_b: Updated ratings for Model B.
    """
    # Compute expected scores
    # We use torch.pow(10, ...) or exp with log(10)
    exponent = (ratings_b - ratings_a) / 400.0
    expected_a = 1.0 / (1.0 + torch.pow(10.0, exponent))
    expected_b = 1.0 - expected_a
    
    # Compute updates
    # outcomes are scores for A. For B, score is (1.0 - outcomes)
    update_a = k_factor * (outcomes - expected_a)
    update_b = k_factor * ((1.0 - outcomes) - expected_b)
    
    new_ratings_a = ratings_a + update_a
    new_ratings_b = ratings_b + update_b
    
    return new_ratings_a, new_ratings_b

# Example Usage
batch_size = 4
# Initial ratings
r_a = torch.tensor([1500.0, 1600.0, 1400.0, 1500.0])
r_b = torch.tensor([1400.0, 1700.0, 1400.0, 1500.0])

# Outcomes for A: Win, Loss, Tie, Win
outcomes = torch.tensor([1.0, 0.0, 0.5, 1.0])

new_a, new_b = compute_batch_elo_update(r_a, r_b, outcomes)

print("Original A:", r_a)
print("Updated A:", new_a)
print("Original B:", r_b)
print("Updated B:", new_b)

Quizzes

Quiz 1: Why is a blind A/B test (like Chatbot Arena) considered superior to standard benchmarks for evaluating chat models? Standard benchmarks are static and can be contaminated (included in training data), leading to inflated scores. A blind A/B test with human judges evaluates models on diverse, real-world prompts that are hard to predict or game. It measures actual user preference, which is the ultimate goal for chat assistants.

Quiz 2: What happens to the ratings if a model with 2000 Elo wins against a model with 1000 Elo? What if it loses? If the 2000 Elo model wins, the rating change will be very small because the expected score was already close to 1.0. If it loses, the rating change will be very large (up to the full K-factor) because it was a highly unexpected result, causing the winner to gain many points and the loser to lose many.

Quiz 3: What is the main limitation of the sequential Elo update system compared to the Bradley-Terry model in evaluation platforms? Sequential Elo updates are order-dependent; the final rating depends on the order in which games were processed. The Bradley-Terry model is a full-dataset optimization approach that considers all games simultaneously, making it more robust to the order of data and better at handling sparse data where not all models have played against each other.

Quiz 4: Derive the explicit optimization sequence logic when utilizing the Bradley-Terry model via Maximum Likelihood Estimation (MLE) for offline leaderboard updates. The Bradley-Terry model defines the win probability as P(i>j)=esiesi+esjP(i > j) = \frac{e^{s_i}}{e^{s_i} + e^{s_j}} where sis_i is the continuous latent skill parameter. Offline calibration minimizes the aggregate negative log-likelihood L=lnP(i>j)L = -\sum \ln P(i > j). The gradient update for skill parameter sis_i is derived as: Lsi=j(I(i beats j)P(i>j))\frac{\partial L}{\partial s_i} = -\sum_j \left( \mathbb{I}(i \text{ beats } j) - P(i > j) \right). Optimization applies sequentially sisiηLsis_i \leftarrow s_i - \eta \frac{\partial L}{\partial s_i} until global calibration boundaries are converged, avoiding the order-dependent noise inherent in incremental Elo updates.


References

  1. Elo, A. E. (1978). The Rating of Chessplayers, Past and Present. Arco.
  2. Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.