What are LLMs?12 min read

How LLMs Work

The cat sat on the ___. How a machine guesses 'mat' using math, layers, and billions of examples

scope:Foundationaldifficulty:Beginner

The Word-Guessing Game

Imagine you're playing a game with your friends. Someone reads a sentence out loud and stops. Everyone has to guess the next word.

"The cat sat on the ___"

You'd probably guess "mat" or "couch" or "floor." You wouldn't guess "democracy" or "spaceship." Why? Because you've heard enough English to know what fits.

That's exactly how an LLM works. But instead of hearing a few thousand sentences in your lifetime, it has read billions of pages of text. And instead of guessing with gut feeling, it calculates precise probabilities using math.

Step 1: Training — Reading the Internet

Before an LLM can predict anything, it needs to learn. The training process goes like this:

Gather data — Collect trillions of words from books, websites, code repositories, scientific papers, social media, and more.
Mask and predict — Show the model a sentence with the last word hidden. Ask it to predict that word. Check if it got it right.
Adjust the dials — If it guessed wrong, tweak its internal parameters (dials) slightly so it's more likely to get it right next time.
Repeat billions of times — Do this for every sentence in the training data, over and over, until the model gets very good at prediction.

This process takes weeks to months on thousands of specialized chips (GPUs or TPUs) and can cost tens of millions of dollars.

Step 2: Neural Networks — Layers of Pattern Finders

Inside an LLM is a neural network — a structure inspired by (but very different from) the human brain. Think of it as a factory with many floors:

Ground floor — Spots simple patterns. "After 'the,' a noun usually follows." "Sentences often end with a period."
Middle floors — Spots grammar and meaning. "This sentence is a question." "The subject is 'dog' and the action is 'runs.'"
Top floors — Spots complex ideas. "This paragraph is sarcastic." "This code has a bug on line 3." "This legal clause contradicts the previous one."

Modern LLMs use a specific architecture called a Transformer (introduced in the famous 2017 paper "Attention Is All You Need"). The key innovation? Attention — the ability for each word to look at every other word in the text and figure out which ones matter most.

For example, in "The dog that chased the cat was tired," the word "tired" needs to pay attention to "dog" (not "cat") to understand who's tired. The attention mechanism figures this out automatically.

Step 3: Next-Token Prediction — One Word at a Time

When you ask an LLM a question, here's what happens behind the scenes:

Your text gets converted into tokens (word pieces — more on this in the Tokens article).
The tokens flow through all the layers of the neural network.
The network outputs a probability for every possible next token. Maybe "the" has 12% chance, "a" has 8%, "Paris" has 45%, etc.
One token is selected (how it's selected depends on the temperature setting).
That token is added to the text, and the whole process repeats for the next token.

This is why LLM responses appear to stream in word by word — because that's literally how they're generated!

Step 4: Temperature — The Creativity Dial

Temperature controls how an LLM picks from its predicted probabilities:

Temperature 0 — Always pick the highest probability word. Predictable, safe, repetitive. Great for factual questions.
Temperature 0.7 — Usually pick high-probability words, but sometimes take a chance on less likely ones. Good balance for most tasks.
Temperature 1.0+ — Embrace randomness. More creative, surprising, but also more likely to go off the rails. Good for brainstorming and fiction.

Think of it like a chef. At temperature 0, they always make the same reliable dish. At temperature 1, they start improvising — sometimes brilliantly, sometimes... interestingly.

Next-Word Prediction (Simplified Concept)

import random

# Simplified: a tiny "language model" based on word frequencies
# Real LLMs use neural networks, but the core idea is the same!

# What word typically follows these phrases?
patterns = {
    "the cat sat on the": {"mat": 0.5, "couch": 0.2, "floor": 0.2, "bed": 0.1},
    "once upon a":        {"time": 0.95, "day": 0.03, "hill": 0.02},
    "to be or not to":    {"be": 0.99, "exist": 0.01},
}

def predict_next_word(prompt, temperature=0.0):
    """Predict the next word given a prompt."""
    probs = patterns.get(prompt.lower(), {"...": 1.0})
    
    if temperature == 0:
        # Always pick the most likely word
        return max(probs, key=probs.get)
    else:
        # Higher temperature = more randomness
        words = list(probs.keys())
        weights = list(probs.values())
        return random.choices(words, weights=weights)[0]

# Temperature 0: always picks the top prediction
print("Temp 0:", predict_next_word("the cat sat on the", temperature=0))
print("Temp 0:", predict_next_word("once upon a", temperature=0))

# Temperature 1: sometimes picks less likely words
print("Temp 1:", predict_next_word("the cat sat on the", temperature=1))

Output

Temp 0: mat
Temp 0: time
Temp 1: couch  (varies each run!)

Note: The Attention Mechanism — in plain English: Imagine reading the sentence "The bank by the river was steep." When you see "steep," your brain connects it to "bank" (meaning riverbank, not a financial bank). Attention works the same way — it lets each word look at all other words and decide which ones are most relevant. This is why it's called "Attention Is All You Need" — this single mechanism replaced all the complex tricks earlier models needed.

The Training Pipeline: From Raw Text to Chatbot

Creating a chatbot like ChatGPT or Claude involves three stages:

Stage 1: Pre-training — Read trillions of words from the internet. Learn to predict the next word. This creates a "base model" that's good at completing text but not at following instructions. Cost: $10M–$100M+.
Stage 2: Fine-tuning (SFT) — Show the model examples of good conversations: a human asks a question, and a helpful answer follows. The model learns to respond helpfully instead of just completing text. Cost: $100K–$1M.
Stage 3: RLHF / RLAIF — Human raters (or AI raters) rank different model responses from best to worst. The model is trained to prefer responses that humans prefer. This is where models learn to be helpful, harmless, and honest. Cost: $100K–$1M.

Each stage builds on the last. Pre-training gives the model knowledge. Fine-tuning teaches it manners. RLHF teaches it values.

Why Does This Simple Idea Work So Well?

The surprising thing about LLMs is that next-word prediction — a seemingly trivial task — leads to genuine capabilities:

To predict words in math textbooks, the model has to learn math.
To predict words in code, the model has to learn programming.
To predict words in arguments, the model has to learn logic.
To predict words in stories, the model has to learn narrative structure.

As the saying goes: "Prediction is compression is intelligence." By learning to predict text really well, the model is forced to build an internal model of the world that generated that text.

Challenge

Quick check

What is the core task an LLM is trained to do?

What is an LLM?

A super-reader that has devoured every book ever written and now finishes your sentences

→

Tokens & Context Windows

Words get chopped into puzzle pieces, and the AI can only hold so many at once

→

ChatGPT

The chatbot that broke the internet in 5 days and changed how we think about AI

→

Claude

An AI built from day one to be helpful, harmless, and honest — meet Anthropic's Claude

→