Natural Language Processing11 min read

Attention Mechanism

Focus on the words that matter most
scope:Core Conceptdifficulty:Intermediate

Highlighting a Textbook

When you study a textbook, you don't treat every word equally. You highlight the important parts. Your eyes skip over "the" and "of" and zero in on key concepts. When answering a question about a paragraph, you mentally focus on the relevant sentences and ignore the rest.

That's exactly what attention does for AI. It lets a model dynamically decide which parts of the input to focus on when producing each piece of the output.

The Problem Attention Solves

Before attention, language models had a bottleneck problem. Imagine translating a long English sentence into French. The old approach was:

  1. Read the entire English sentence and compress it into a single fixed-size vector (the "context")
  2. Generate the French translation from that one vector

But cramming a 50-word sentence into a single vector is like trying to summarize a whole book in one sentence β€” you lose information. Long sentences performed terribly.

Attention fixes this by letting the model look back at all input words while generating each output word, and decide how much to focus on each one.

How Attention Works: Query, Key, Value

Think of attention like a library search:

  • Query (Q) β€” your question: "What information do I need right now?"
  • Key (K) β€” the label on each book: "What information do I contain?"
  • Value (V) β€” the actual content of each book

The process:

  1. Compare your query against every key to get a relevance score
  2. Turn those scores into weights (using softmax β€” they sum to 1)
  3. Take a weighted sum of the values

In math: Attention(Q, K, V) = softmax(QK^T / sqrt(d)) * V

The sqrt(d) is just a scaling factor to keep the numbers stable. That's it β€” that's the entire attention formula. Simple, but it changed everything.

Self-Attention: Words Talking to Each Other

In self-attention, a sentence attends to itself. Every word asks: "Which other words in this sentence are relevant to understanding me?"

Consider: "The cat sat on the mat because it was tired."

What does "it" refer to? The cat or the mat? Self-attention helps the model figure this out by computing high attention scores between "it" and "cat" (because cats get tired, mats don't).

Simplified Attention Mechanism

import numpy as np
def softmax(x):
"""Convert scores to probabilities."""
exp_x = np.exp(x - np.max(x))
return exp_x / exp_x.sum(axis=-1, keepdims=True)
def attention(Q, K, V):
"""Scaled dot-product attention."""
d_k = K.shape[-1]
# Step 1: Compute relevance scores
scores = Q @ K.T / np.sqrt(d_k)
# Step 2: Convert to weights (probabilities)
weights = softmax(scores)
# Step 3: Weighted sum of values
output = weights @ V
return output, weights
# Simulate 4 words, each with 3-dimensional embeddings
np.random.seed(42)
words = ["The", "cat", "is", "sleeping"]
# In real models, Q/K/V come from learned projections
# Here we use simple embeddings
embeddings = np.array([
[0.1, 0.0, 0.2], # The
[0.8, 0.9, 0.1], # cat
[0.2, 0.1, 0.3], # is
[0.7, 0.8, 0.2], # sleeping
])
Q = K = V = embeddings # Self-attention
output, weights = attention(Q, K, V)
print("Attention weights (who focuses on whom):")
for i, word in enumerate(words):
focus = [(words[j], f"{weights[i][j]:.2f}") for j in range(len(words))]
print(f" {word:>10} β†’ {focus}")
print(f"\n'sleeping' focuses most on '{words[np.argmax(weights[3])]}' (makes sense!)")
Output
Attention weights (who focuses on whom):
       The β†’ [('The', '0.23'), ('cat', '0.27'), ('is', '0.24'), ('sleeping', '0.27')]
       cat β†’ [('The', '0.15'), ('cat', '0.35'), ('is', '0.16'), ('sleeping', '0.34')]
        is β†’ [('The', '0.22'), ('cat', '0.27'), ('is', '0.23'), ('sleeping', '0.27')]
  sleeping β†’ [('The', '0.16'), ('cat', '0.34'), ('is', '0.16'), ('sleeping', '0.34')]

'sleeping' focuses most on 'cat' (makes sense!)
Note: Multi-Head Attention: Real transformers don't use just one attention mechanism β€” they use multiple "heads" in parallel, each focusing on different types of relationships. One head might focus on syntax (subject-verb connections), another on coreference (what 'it' refers to), another on semantic similarity. These perspectives are then combined. GPT-4 uses 96 attention heads per layer.

Why Attention Changed Everything

Before attention (2014), sequence models processed text one word at a time, left to right, maintaining a single hidden state. Long-range dependencies were nearly impossible β€” by the time the model reached word 50, it had mostly forgotten word 1.

Attention solved this by allowing direct connections between any two positions in a sequence. Word 50 can directly attend to word 1, no matter how far apart they are. This single idea enabled:

  • Machine translation that actually works on long sentences
  • Text summarization that captures the key points
  • Question answering that finds relevant passages in long documents
  • And ultimately, the Transformer architecture that powers all modern LLMs

Quick check

What problem did attention solve in sequence-to-sequence models?
Challenge

Continue reading