Transformers
The Paper That Changed Everything
In June 2017, a team of eight researchers at Google published a paper with an audacious title: "Attention Is All You Need." It introduced the Transformer architecture, and within a few years, it would reshape the entire field of AI.
Before transformers, the dominant models for language were RNNs (Recurrent Neural Networks) and LSTMs. They read text one word at a time, left to right, like reading a book one letter at a time through a keyhole. They were slow to train (you can't parallelize sequential processing) and struggled with long-range dependencies.
The transformer's big idea: throw away the recurrence entirely. Instead, process all words simultaneously using self-attention. Every word can directly attend to every other word. No more keyhole β you see the whole page at once.
The Architecture: Encoder and Decoder
The original transformer has two halves:
- Encoder β reads the input and builds a rich representation of it. "I love cats" becomes a set of context-aware vectors.
- Decoder β generates the output one token at a time, attending to both the encoder's output and previously generated tokens.
For translation: the encoder reads English, the decoder writes French. But the real magic was that each half could be used independently, spawning two families of models.
The Transformer Family Tree
The original transformer split into three lineages:
Encoder-Only: BERT (2018)
Google's BERT uses only the encoder. It reads text bidirectionally β it sees both the left and right context of every word. This makes it brilliant at understanding text: sentiment analysis, question answering, named entity recognition.
Decoder-Only: GPT (2018βtoday)
OpenAI's GPT uses only the decoder. It reads text left-to-right and predicts the next token. It's a generation machine β trained to predict what comes next, it learns to write essays, code, poetry, and more. GPT-4, Claude, and most modern chatbots are decoder-only transformers.
Encoder-Decoder: T5 (2019)
Google's T5 uses the full architecture. It frames every NLP task as text-to-text: input "translate English to French: I love cats" and output "J'adore les chats." Great for translation, summarization, and any task that transforms one text into another.
What Happens Inside a Transformer Layer
Each transformer layer does four things:
- Multi-Head Self-Attention β every token attends to every other token (through multiple "heads" that each learn different patterns)
- Add & Normalize β a residual connection (add the input back) plus layer normalization for stability
- Feed-Forward Network β two linear layers with a nonlinearity, applied to each token independently (this is where a lot of the "knowledge" is stored)
- Add & Normalize again
Stack 12 to 96+ of these layers, and you have a modern language model.
A Simplified Transformer Block
Why Transformers Won
Transformers didn't just improve on RNNs β they replaced them. Here's why:
- Parallelism β RNNs process tokens sequentially. Transformers process all tokens simultaneously. This means transformers can fully exploit modern GPUs, making training orders of magnitude faster.
- Long-range dependencies β in an RNN, information degrades over distance. In a transformer, any token can attend to any other token directly, regardless of distance.
- Scalability β transformers scale gracefully. Double the parameters, double the data, and performance reliably improves. This predictable scaling is why companies invest billions in training larger models.
Today, transformers power not just language models, but also image generation (DALL-E, Stable Diffusion), protein folding (AlphaFold), music generation, code completion, and more. The architecture is remarkably general.