Temperature & Parameters
The LLM Control Panel
Imagine you're driving a car. You don't just have a steering wheel β you have a gas pedal, brake, mirrors, turn signals, and more. Each control changes how the car behaves.
LLMs work the same way. When you send a prompt to an LLM, you're not just sending text β you're also sending a set of parameters that control how the model generates its response. The most important one is called temperature.
Temperature: The Creativity Dial
Temperature is a number, usually between 0 and 2, that controls how "creative" or "random" the model's output is.
Think of it like a music DJ's mixing board:
- Temperature 0 β The DJ plays only the #1 hit song every time. Predictable, reliable, maybe boring. The model always picks the single most likely next word.
- Temperature 0.7 β The DJ plays popular songs but occasionally throws in a deep cut. The model usually picks high-probability words but sometimes surprises you.
- Temperature 1.0 β The DJ plays from the full catalog, weighting by popularity. More variety, more surprises.
- Temperature 2.0 β The DJ randomly grabs records from a dumpster. Maximum chaos. The model picks words almost at random, and the output often makes no sense.
How Temperature Works Under the Hood
When an LLM predicts the next word, it calculates a probability for every word in its vocabulary (usually 50,000-100,000 words). For example, after "The capital of France is," it might say:
- "Paris" β 92%
- "a" β 3%
- "known" β 2%
- everything else β 3% combined
Temperature reshapes these probabilities:
- At temperature 0, the 92% becomes effectively 100%. "Paris" wins every time.
- At temperature 1, the probabilities stay as-is. "Paris" wins most of the time, but occasionally you get "a" or "known."
- At temperature 2, the probabilities flatten out. "Paris" might drop to 40%, and random words get a real shot. You might get "The capital of France is spaghetti."
Mathematically, temperature divides the "logits" (raw scores) before converting them to probabilities. Lower temperature = sharper distribution = more deterministic. Higher temperature = flatter distribution = more random.
Top-p (Nucleus Sampling)
Top-p is another way to control randomness, and it's often used alongside temperature.
Instead of looking at all possible next words, top-p cuts off the tail. It says: "Only consider words whose combined probability adds up to p."
For example, with top-p = 0.9:
- Sort all words by probability: Paris (92%), a (3%), known (2%), the (1%), located (0.5%), ...
- Add up probabilities until you reach 90%: Just "Paris" (92%) already exceeds 90%!
- Only sample from that set.
But for a more uncertain prediction where the top word is only 20%, top-p = 0.9 might include dozens of words.
Think of top-p as a smart filter. Temperature changes how you pick from all words. Top-p changes which words you even consider.
Max Tokens: The Word Limit
Max tokens sets the maximum length of the model's response. If you set max_tokens to 100, the model will stop generating after 100 tokens (roughly 75 words), even if it's mid-sentence.
Why would you limit it?
- Cost control β You pay per token. Setting a limit prevents runaway costs.
- Speed β Shorter responses generate faster.
- Format control β If you want a one-word answer, set max_tokens to a small number.
Common values:
- 50-100 tokens β Short answers, classifications, yes/no
- 500-1000 tokens β Paragraphs, explanations
- 4000+ tokens β Long essays, full code files
Frequency & Presence Penalties
Ever noticed an LLM repeating itself? "The cat is a cat that does cat things in a catlike way." That's because once the word "cat" has high probability, it keeps getting picked.
Frequency penalty (0 to 2) reduces the probability of words proportional to how many times they've already appeared. The more a word repeats, the harder it becomes to use again.
Presence penalty (0 to 2) reduces the probability of any word that has appeared at least once, regardless of how many times. It's a flat penalty: if you've used a word, it's harder to use again.
- Frequency penalty = "Don't keep saying the same word over and over"
- Presence penalty = "Try to use new words you haven't used yet"
For most use cases, the defaults (0 for both) work fine. Increase them if you're getting repetitive output.
Playing with Temperature in Python
Putting It All Together: A Parameter Recipe Book
Here's how experienced AI engineers combine these parameters for different tasks:
- Code generation: temperature=0, top_p=1, max_tokens=2000. You want correct, deterministic code. No creativity needed.
- Customer support chatbot: temperature=0.3, top_p=0.9, max_tokens=500, presence_penalty=0.1. Friendly but accurate. Slight variety so it doesn't sound robotic.
- Creative story writing: temperature=1.0, top_p=0.95, max_tokens=4000, frequency_penalty=0.5. Let the model explore unusual word choices and avoid repetitive language.
- Data extraction: temperature=0, top_p=1, max_tokens=200. Extract exactly what's there. No embellishment.
The key insight: there is no universal best setting. Every task has its own sweet spot. The art of working with LLMs is knowing which dials to turn β and how far.