Natural Language Processing12 min read

Transformers

The architecture behind modern AI — attention is all you need

scope:Core Conceptdifficulty:Intermediate

The Paper That Changed Everything

In June 2017, a team of eight researchers at Google published a paper with an audacious title: "Attention Is All You Need." It introduced the Transformer architecture, and within a few years, it would reshape the entire field of AI.

Before transformers, the dominant models for language were RNNs (Recurrent Neural Networks) and LSTMs. They read text one word at a time, left to right, like reading a book one letter at a time through a keyhole. They were slow to train (you can't parallelize sequential processing) and struggled with long-range dependencies.

The transformer's big idea: throw away the recurrence entirely. Instead, process all words simultaneously using self-attention. Every word can directly attend to every other word. No more keyhole — you see the whole page at once.

The Architecture: Encoder and Decoder

The original transformer has two halves:

Encoder — reads the input and builds a rich representation of it. "I love cats" becomes a set of context-aware vectors.
Decoder — generates the output one token at a time, attending to both the encoder's output and previously generated tokens.

For translation: the encoder reads English, the decoder writes French. But the real magic was that each half could be used independently, spawning two families of models.

The Transformer Family Tree

The original transformer split into three lineages:

Encoder-Only: BERT (2018)

Google's BERT uses only the encoder. It reads text bidirectionally — it sees both the left and right context of every word. This makes it brilliant at understanding text: sentiment analysis, question answering, named entity recognition.

Decoder-Only: GPT (2018–today)

OpenAI's GPT uses only the decoder. It reads text left-to-right and predicts the next token. It's a generation machine — trained to predict what comes next, it learns to write essays, code, poetry, and more. GPT-4, Claude, and most modern chatbots are decoder-only transformers.

Encoder-Decoder: T5 (2019)

Google's T5 uses the full architecture. It frames every NLP task as text-to-text: input "translate English to French: I love cats" and output "J'adore les chats." Great for translation, summarization, and any task that transforms one text into another.

What Happens Inside a Transformer Layer

Each transformer layer does four things:

Multi-Head Self-Attention — every token attends to every other token (through multiple "heads" that each learn different patterns)
Add & Normalize — a residual connection (add the input back) plus layer normalization for stability
Feed-Forward Network — two linear layers with a nonlinearity, applied to each token independently (this is where a lot of the "knowledge" is stored)
Add & Normalize again

Stack 12 to 96+ of these layers, and you have a modern language model.

A Simplified Transformer Block

import numpy as np

def softmax(x, axis=-1):
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / exp_x.sum(axis=axis, keepdims=True)

def self_attention(X, Wq, Wk, Wv):
    """Single-head self-attention."""
    Q = X @ Wq
    K = X @ Wk
    V = X @ Wv
    d_k = K.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    weights = softmax(scores)
    return weights @ V

def feed_forward(X, W1, b1, W2, b2):
    """Simple feed-forward network with ReLU."""
    hidden = np.maximum(0, X @ W1 + b1)  # ReLU
    return hidden @ W2 + b2

def layer_norm(X, eps=1e-5):
    """Layer normalization."""
    mean = X.mean(axis=-1, keepdims=True)
    std = X.std(axis=-1, keepdims=True)
    return (X - mean) / (std + eps)

def transformer_block(X, params):
    """One transformer layer: attention → add+norm → FFN → add+norm."""
    # Self-attention
    attn_out = self_attention(X, params['Wq'], params['Wk'], params['Wv'])
    X = layer_norm(X + attn_out)  # Residual + norm
    
    # Feed-forward
    ff_out = feed_forward(X, params['W1'], params['b1'], params['W2'], params['b2'])
    X = layer_norm(X + ff_out)    # Residual + norm
    
    return X

# Demo: 3 tokens, embedding dimension 4
np.random.seed(42)
d_model, d_ff = 4, 8

params = {
    'Wq': np.random.randn(d_model, d_model) * 0.1,
    'Wk': np.random.randn(d_model, d_model) * 0.1,
    'Wv': np.random.randn(d_model, d_model) * 0.1,
    'W1': np.random.randn(d_model, d_ff) * 0.1,
    'b1': np.zeros(d_ff),
    'W2': np.random.randn(d_ff, d_model) * 0.1,
    'b2': np.zeros(d_model),
}

# Input: 3 token embeddings
X = np.array([
    [1.0, 0.5, 0.2, 0.1],  # Token 1
    [0.3, 0.8, 0.1, 0.9],  # Token 2
    [0.1, 0.2, 0.9, 0.4],  # Token 3
])

print("Input shape:", X.shape)
output = transformer_block(X, params)
print("Output shape:", output.shape)
print("Output (context-aware embeddings):")
print(np.round(output, 3))

Output

Input shape: (3, 4)
Output shape: (3, 4)
Output (context-aware embeddings):
[[ 1.402 -0.632 -0.354 -0.416]
 [-0.42   1.535 -0.691 -0.424]
 [-0.805 -0.413  1.606 -0.388]]

Note: Scale is the secret sauce. The original 2017 transformer had 65 million parameters. GPT-2 (2019) had 1.5 billion. GPT-3 (2020) had 175 billion. GPT-4 is estimated to have over 1 trillion. The architecture barely changed — what changed was scale. More parameters, more data, more compute. And somehow, at each new scale, emergent capabilities appeared that nobody predicted.

Why Transformers Won

Transformers didn't just improve on RNNs — they replaced them. Here's why:

Parallelism — RNNs process tokens sequentially. Transformers process all tokens simultaneously. This means transformers can fully exploit modern GPUs, making training orders of magnitude faster.
Long-range dependencies — in an RNN, information degrades over distance. In a transformer, any token can attend to any other token directly, regardless of distance.
Scalability — transformers scale gracefully. Double the parameters, double the data, and performance reliably improves. This predictable scaling is why companies invest billions in training larger models.

Today, transformers power not just language models, but also image generation (DALL-E, Stable Diffusion), protein folding (AlphaFold), music generation, code completion, and more. The architecture is remarkably general.

Quick check

What was the key innovation of the Transformer over RNNs?

Challenge

Attention Mechanism

Focus on the words that matter most

→

Text Representation

Turning words into numbers a computer can read

→

Generative AI

Create new images, text, and music from patterns

→