Foundation Model Engineering

2.4 CNNs for NLP

While Recurrent Neural Networks (RNNs) were considered the natural choice for sequential data, Convolutional Neural Networks (CNNs), famous for their success in computer vision, were also adapted for Natural Language Processing (NLP).


Motivation: The Need for Speed

The primary motivation for using CNNs in NLP was computational efficiency.

  • The RNN Bottleneck: RNNs are inherently sequential. You cannot calculate the state of word 100 without first calculating words 1 through 99. This prevents full parallelization on GPUs.
  • The CNN Advantage: Convolutions apply the same operation to different parts of the input simultaneously. This makes them blazingly fast to train.

Before Transformers showed how to achieve both global context and parallelization, CNNs were the go-to choice for fast text processing.


The Metaphor: The Reading Glass

Imagine you are reading a long line of text through a narrow reading glass that only lets you see 3 words at a time.

  • You start at the beginning and slide the glass down the line.
  • At each stop, you look at the 3 words and recognize a pattern (e.g., “not very good”).
  • You write down a score for how strongly that pattern matched what you were looking for.
  • By the time you reach the end, you have a list of scores for the whole sentence.
  • Max Pooling is like looking at that list and picking the single highest score—the most important pattern you found in the whole sentence.

This is exactly how a 1D convolution works on text.


How 1D Convolutions Work on Text

In computer vision, CNNs use 2D convolutions over pixels. In NLP, we use 1D Convolutions over sequences of word embeddings.

TextCNN Architecture

Source: Generated by AI (inspired by Kim, 2014)

The Mechanism

Imagine a sentence represented as a matrix where each row is a word embedding (e.g., d=300d=300).

  1. Filter: A filter matrix of size k×dk \times d (where kk is the window size, e.g., 3 words) slides down the sentence.
  2. Convolution: At each step, it performs an element-wise multiplication and sums the result to produce a single number (a feature).
  3. Feature Map: As the filter slides down, it produces a vector of features (a feature map).
  4. Pooling: Typically, Max-over-time Pooling is applied, taking the maximum value from the feature map to identify the most important feature in the entire sentence.

This allows the network to capture local patterns (n-grams), such as “not good” or “very fast”.


Breakthrough Architectures

  • TextCNN (Kim, 2014) [1]: A simple but effective architecture that uses multiple filter sizes to capture different n-gram lengths, followed by max-pooling. It showed excellent performance on text classification tasks.

Interesting Fact: When Yoon Kim’s TextCNN paper was published, it surprised many researchers. At the time, the dominant view was that “text is a sequence, so we must use RNNs.” However, CNNs were able to capture key word combinations (n-grams) brilliantly, just like they capture local features in images, and most importantly, they were much faster than RNNs.


PyTorch TextCNN

Here is a simplified version of the famous TextCNN architecture (Kim, 2014) [1] in PyTorch, using 1D convolutions for text classification.

import torch
import torch.nn as nn
import torch.nn.functional as F

class TextCNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes):
        super(TextCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # 3 different filter sizes: 3, 4, and 5 words
        self.conv1 = nn.Conv1d(in_channels=embed_dim, out_channels=100, kernel_size=3)
        self.conv2 = nn.Conv1d(in_channels=embed_dim, out_channels=100, kernel_size=4)
        self.conv3 = nn.Conv1d(in_channels=embed_dim, out_channels=100, kernel_size=5)
        
        self.fc = nn.Linear(300, num_classes)
        
    def forward(self, x):
        # x shape: (batch_size, seq_len)
        embedded = self.embedding(x) # (batch, seq_len, embed_dim)
        
        # Conv1d expects (batch, channels, seq_len)
        embedded = embedded.transpose(1, 2)
        
        # Apply convolutions and ReLU
        x1 = F.relu(self.conv1(embedded))
        x2 = F.relu(self.conv2(embedded))
        x3 = F.relu(self.conv3(embedded))
        
        # Max-over-time pooling
        x1 = F.max_pool1d(x1, x1.shape[2]).squeeze(2)
        x2 = F.max_pool1d(x2, x2.shape[2]).squeeze(2)
        x3 = F.max_pool1d(x3, x3.shape[2]).squeeze(2)
        
        # Concatenate pool results
        combined = torch.cat((x1, x2, x3), dim=1)
        
        # Fully connected layer
        logits = self.fc(combined)
        return logits

# Example usage
vocab_size = 1000
embed_dim = 50
num_classes = 2
model = TextCNN(vocab_size, embed_dim, num_classes)

# Random input sequence of 10 words
x = torch.randint(0, vocab_size, (1, 10))
logits = model(x)
print("Output Logits Shape:", logits.shape)

Example: Sliding Window Convolution

Visualize how a filter of size 3 slides over a sentence and produces a feature map. Max-pooling picks the highest value.

Sentence: "This movie was not good at all."
Thismoviewasnotgoodatall
Feature Map Output:
0.1
0.2
0.3
0.9
0.4
0.2
Max Pooled Value: 0.9

Looking Ahead: The Need for Global Context

We have seen that CNNs offer speed through parallelization, but they struggle with long-range dependencies compared to RNNs.

  • Is it possible to achieve both full parallelization and global context?
  • Can we model relationships between all words in a sentence directly, without sequential processing or local windows?
  • How does the concept of “Self-Attention” eliminate the need for recurrence and convolution?

These questions lead us to Chapter 3: Transformers, where we will explore the architecture that revolutionized AI by relying entirely on attention.


Quizzes

Quiz 1: What is the main advantage of CNNs over RNNs in NLP? The main advantage is parallelization. CNNs can process all parts of a sequence simultaneously using convolutions, making them much faster to train on GPUs than RNNs, which must process steps sequentially.

Quiz 2: How do CNNs capture long-range dependencies in text? CNNs capture long-range dependencies by stacking multiple convolutional layers on top of each other or by using dilated convolutions. As layers get deeper, the receptive field of each neuron grows, allowing it to see more of the input sequence.

Quiz 3: Why are Transformers preferred over CNNs for large language models today? Transformers use self-attention, which allows every position to attend to all other positions in a single operation, regardless of distance. This captures long-range dependencies much more effectively than stacking many convolutional layers, while still being highly parallelizable.

Quiz 4: What is the purpose of Max-over-time Pooling in TextCNN? Max-over-time pooling takes the maximum value across the entire sequence length for each feature map. This identifies the strongest activation of a specific pattern anywhere in the sentence, making the model invariant to the exact position of the pattern.

Quiz 5: What is the effect of using multiple filter sizes in TextCNN? Using multiple filter sizes (e.g., 3, 4, 5) allows the model to capture n-grams of different lengths simultaneously. This is analogous to looking for patterns of different sizes (e.g., trigrams, 4-grams, 5-grams) in the text, providing a richer representation of the input.

Quiz 6: Derive the formula for the effective receptive field RlR_l of the ll-th layer in a stacked 1D CNN, assuming kernel size klk_l and stride sls_l for each layer. The receptive field RlR_l of layer ll can be calculated recursively from the receptive field of the previous layer Rl1R_{l-1}. The base case is R0=1R_0 = 1 (the input token itself). For a layer with kernel size klk_l and stride sls_l, the receptive field grows as: Rl=Rl1+(kl1)×i=1l1siR_l = R_{l-1} + (k_l - 1) \times \prod_{i=1}^{l-1} s_i. This shows that while depth increases the receptive field linearly with respect to the kernel size, strides cause it to grow exponentially. If all layers have stride s=1s=1 and kernel size kk, the formula simplifies to Rl=l×(k1)+1R_l = l \times (k - 1) + 1.


References

  1. Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv:1408.5882.
  2. Recommended Video: 3Blue1Brown: But what is a convolution? A fantastic visual explanation of the mathematical concept of convolution.