Attention Mechanism
Highlighting a Textbook
When you study a textbook, you don't treat every word equally. You highlight the important parts. Your eyes skip over "the" and "of" and zero in on key concepts. When answering a question about a paragraph, you mentally focus on the relevant sentences and ignore the rest.
That's exactly what attention does for AI. It lets a model dynamically decide which parts of the input to focus on when producing each piece of the output.
The Problem Attention Solves
Before attention, language models had a bottleneck problem. Imagine translating a long English sentence into French. The old approach was:
- Read the entire English sentence and compress it into a single fixed-size vector (the "context")
- Generate the French translation from that one vector
But cramming a 50-word sentence into a single vector is like trying to summarize a whole book in one sentence β you lose information. Long sentences performed terribly.
Attention fixes this by letting the model look back at all input words while generating each output word, and decide how much to focus on each one.
How Attention Works: Query, Key, Value
Think of attention like a library search:
- Query (Q) β your question: "What information do I need right now?"
- Key (K) β the label on each book: "What information do I contain?"
- Value (V) β the actual content of each book
The process:
- Compare your query against every key to get a relevance score
- Turn those scores into weights (using softmax β they sum to 1)
- Take a weighted sum of the values
In math: Attention(Q, K, V) = softmax(QK^T / sqrt(d)) * V
The sqrt(d) is just a scaling factor to keep the numbers stable. That's it β that's the entire attention formula. Simple, but it changed everything.
Self-Attention: Words Talking to Each Other
In self-attention, a sentence attends to itself. Every word asks: "Which other words in this sentence are relevant to understanding me?"
Consider: "The cat sat on the mat because it was tired."
What does "it" refer to? The cat or the mat? Self-attention helps the model figure this out by computing high attention scores between "it" and "cat" (because cats get tired, mats don't).
Simplified Attention Mechanism
Why Attention Changed Everything
Before attention (2014), sequence models processed text one word at a time, left to right, maintaining a single hidden state. Long-range dependencies were nearly impossible β by the time the model reached word 50, it had mostly forgotten word 1.
Attention solved this by allowing direct connections between any two positions in a sequence. Word 50 can directly attend to word 1, no matter how far apart they are. This single idea enabled:
- Machine translation that actually works on long sentences
- Text summarization that captures the key points
- Question answering that finds relevant passages in long documents
- And ultimately, the Transformer architecture that powers all modern LLMs