How LLMs Work
The Word-Guessing Game
Imagine you're playing a game with your friends. Someone reads a sentence out loud and stops. Everyone has to guess the next word.
"The cat sat on the ___"
You'd probably guess "mat" or "couch" or "floor." You wouldn't guess "democracy" or "spaceship." Why? Because you've heard enough English to know what fits.
That's exactly how an LLM works. But instead of hearing a few thousand sentences in your lifetime, it has read billions of pages of text. And instead of guessing with gut feeling, it calculates precise probabilities using math.
Step 1: Training — Reading the Internet
Before an LLM can predict anything, it needs to learn. The training process goes like this:
- Gather data — Collect trillions of words from books, websites, code repositories, scientific papers, social media, and more.
- Mask and predict — Show the model a sentence with the last word hidden. Ask it to predict that word. Check if it got it right.
- Adjust the dials — If it guessed wrong, tweak its internal parameters (dials) slightly so it's more likely to get it right next time.
- Repeat billions of times — Do this for every sentence in the training data, over and over, until the model gets very good at prediction.
This process takes weeks to months on thousands of specialized chips (GPUs or TPUs) and can cost tens of millions of dollars.
Step 2: Neural Networks — Layers of Pattern Finders
Inside an LLM is a neural network — a structure inspired by (but very different from) the human brain. Think of it as a factory with many floors:
- Ground floor — Spots simple patterns. "After 'the,' a noun usually follows." "Sentences often end with a period."
- Middle floors — Spots grammar and meaning. "This sentence is a question." "The subject is 'dog' and the action is 'runs.'"
- Top floors — Spots complex ideas. "This paragraph is sarcastic." "This code has a bug on line 3." "This legal clause contradicts the previous one."
Modern LLMs use a specific architecture called a Transformer (introduced in the famous 2017 paper "Attention Is All You Need"). The key innovation? Attention — the ability for each word to look at every other word in the text and figure out which ones matter most.
For example, in "The dog that chased the cat was tired," the word "tired" needs to pay attention to "dog" (not "cat") to understand who's tired. The attention mechanism figures this out automatically.
Step 3: Next-Token Prediction — One Word at a Time
When you ask an LLM a question, here's what happens behind the scenes:
- Your text gets converted into tokens (word pieces — more on this in the Tokens article).
- The tokens flow through all the layers of the neural network.
- The network outputs a probability for every possible next token. Maybe "the" has 12% chance, "a" has 8%, "Paris" has 45%, etc.
- One token is selected (how it's selected depends on the temperature setting).
- That token is added to the text, and the whole process repeats for the next token.
This is why LLM responses appear to stream in word by word — because that's literally how they're generated!
Step 4: Temperature — The Creativity Dial
Temperature controls how an LLM picks from its predicted probabilities:
- Temperature 0 — Always pick the highest probability word. Predictable, safe, repetitive. Great for factual questions.
- Temperature 0.7 — Usually pick high-probability words, but sometimes take a chance on less likely ones. Good balance for most tasks.
- Temperature 1.0+ — Embrace randomness. More creative, surprising, but also more likely to go off the rails. Good for brainstorming and fiction.
Think of it like a chef. At temperature 0, they always make the same reliable dish. At temperature 1, they start improvising — sometimes brilliantly, sometimes... interestingly.
Next-Word Prediction (Simplified Concept)
The Training Pipeline: From Raw Text to Chatbot
Creating a chatbot like ChatGPT or Claude involves three stages:
- Stage 1: Pre-training — Read trillions of words from the internet. Learn to predict the next word. This creates a "base model" that's good at completing text but not at following instructions. Cost: $10M–$100M+.
- Stage 2: Fine-tuning (SFT) — Show the model examples of good conversations: a human asks a question, and a helpful answer follows. The model learns to respond helpfully instead of just completing text. Cost: $100K–$1M.
- Stage 3: RLHF / RLAIF — Human raters (or AI raters) rank different model responses from best to worst. The model is trained to prefer responses that humans prefer. This is where models learn to be helpful, harmless, and honest. Cost: $100K–$1M.
Each stage builds on the last. Pre-training gives the model knowledge. Fine-tuning teaches it manners. RLHF teaches it values.
Why Does This Simple Idea Work So Well?
The surprising thing about LLMs is that next-word prediction — a seemingly trivial task — leads to genuine capabilities:
- To predict words in math textbooks, the model has to learn math.
- To predict words in code, the model has to learn programming.
- To predict words in arguments, the model has to learn logic.
- To predict words in stories, the model has to learn narrative structure.
As the saying goes: "Prediction is compression is intelligence." By learning to predict text really well, the model is forced to build an internal model of the world that generated that text.
Quick check
Continue reading