What are LLMs?11 min read

Temperature & Parameters

Every LLM has a control panel — learn which dials to turn and when

scope:Foundationaldifficulty:Beginner

The LLM Control Panel

Imagine you're driving a car. You don't just have a steering wheel — you have a gas pedal, brake, mirrors, turn signals, and more. Each control changes how the car behaves.

LLMs work the same way. When you send a prompt to an LLM, you're not just sending text — you're also sending a set of parameters that control how the model generates its response. The most important one is called temperature.

Temperature: The Creativity Dial

Temperature is a number, usually between 0 and 2, that controls how "creative" or "random" the model's output is.

Think of it like a music DJ's mixing board:

Temperature 0 — The DJ plays only the #1 hit song every time. Predictable, reliable, maybe boring. The model always picks the single most likely next word.
Temperature 0.7 — The DJ plays popular songs but occasionally throws in a deep cut. The model usually picks high-probability words but sometimes surprises you.
Temperature 1.0 — The DJ plays from the full catalog, weighting by popularity. More variety, more surprises.
Temperature 2.0 — The DJ randomly grabs records from a dumpster. Maximum chaos. The model picks words almost at random, and the output often makes no sense.

How Temperature Works Under the Hood

When an LLM predicts the next word, it calculates a probability for every word in its vocabulary (usually 50,000-100,000 words). For example, after "The capital of France is," it might say:

"Paris" — 92%
"a" — 3%
"known" — 2%
everything else — 3% combined

Temperature reshapes these probabilities:

At temperature 0, the 92% becomes effectively 100%. "Paris" wins every time.
At temperature 1, the probabilities stay as-is. "Paris" wins most of the time, but occasionally you get "a" or "known."
At temperature 2, the probabilities flatten out. "Paris" might drop to 40%, and random words get a real shot. You might get "The capital of France is spaghetti."

Mathematically, temperature divides the "logits" (raw scores) before converting them to probabilities. Lower temperature = sharper distribution = more deterministic. Higher temperature = flatter distribution = more random.

Top-p (Nucleus Sampling)

Top-p is another way to control randomness, and it's often used alongside temperature.

Instead of looking at all possible next words, top-p cuts off the tail. It says: "Only consider words whose combined probability adds up to p."

For example, with top-p = 0.9:

Sort all words by probability: Paris (92%), a (3%), known (2%), the (1%), located (0.5%), ...
Add up probabilities until you reach 90%: Just "Paris" (92%) already exceeds 90%!
Only sample from that set.

But for a more uncertain prediction where the top word is only 20%, top-p = 0.9 might include dozens of words.

Think of top-p as a smart filter. Temperature changes how you pick from all words. Top-p changes which words you even consider.

Max Tokens: The Word Limit

Max tokens sets the maximum length of the model's response. If you set max_tokens to 100, the model will stop generating after 100 tokens (roughly 75 words), even if it's mid-sentence.

Why would you limit it?

Cost control — You pay per token. Setting a limit prevents runaway costs.
Speed — Shorter responses generate faster.
Format control — If you want a one-word answer, set max_tokens to a small number.

Common values:

50-100 tokens — Short answers, classifications, yes/no
500-1000 tokens — Paragraphs, explanations
4000+ tokens — Long essays, full code files

Frequency & Presence Penalties

Ever noticed an LLM repeating itself? "The cat is a cat that does cat things in a catlike way." That's because once the word "cat" has high probability, it keeps getting picked.

Frequency penalty (0 to 2) reduces the probability of words proportional to how many times they've already appeared. The more a word repeats, the harder it becomes to use again.

Presence penalty (0 to 2) reduces the probability of any word that has appeared at least once, regardless of how many times. It's a flat penalty: if you've used a word, it's harder to use again.

Frequency penalty = "Don't keep saying the same word over and over"
Presence penalty = "Try to use new words you haven't used yet"

For most use cases, the defaults (0 for both) work fine. Increase them if you're getting repetitive output.

Playing with Temperature in Python

from openai import OpenAI
client = OpenAI()

prompt = "Write a one-sentence story about a robot."

# Low temperature: predictable, safe
response_low = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0,    # Always pick the most likely word
    max_tokens=50,      # Keep it short
)
print("Temp 0.0:", response_low.choices[0].message.content)

# Medium temperature: balanced
response_mid = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.7,    # Some creativity
    max_tokens=50,
    top_p=0.9,          # Also using top-p
)
print("Temp 0.7:", response_mid.choices[0].message.content)

# High temperature: wild and creative
response_high = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=1.5,    # Maximum creativity
    max_tokens=50,
    frequency_penalty=0.5,  # Avoid repetition
)
print("Temp 1.5:", response_high.choices[0].message.content)

Output

Temp 0.0: A robot named Unit-7 discovered a forgotten garden behind the factory and spent every night watering the flowers.
Temp 0.7: In a city of chrome spires, a delivery robot named Pip began secretly composing symphonies from the sounds of traffic.
Temp 1.5: The last custodial android wove tapestries from recycled satellite signals, dreaming in frequencies no human ear could parse.

Note: Pro tip — the cheat sheet: Use temperature 0 for: math, code generation, factual Q&A, classification, data extraction. Use temperature 0.5-0.8 for: conversational chat, writing emails, explanations, summaries. Use temperature 1.0+ for: creative writing, brainstorming, poetry, coming up with names or ideas. When in doubt, start at 0.7 — it's the sweet spot for most tasks.

Putting It All Together: A Parameter Recipe Book

Here's how experienced AI engineers combine these parameters for different tasks:

Code generation: temperature=0, top_p=1, max_tokens=2000. You want correct, deterministic code. No creativity needed.
Customer support chatbot: temperature=0.3, top_p=0.9, max_tokens=500, presence_penalty=0.1. Friendly but accurate. Slight variety so it doesn't sound robotic.
Creative story writing: temperature=1.0, top_p=0.95, max_tokens=4000, frequency_penalty=0.5. Let the model explore unusual word choices and avoid repetitive language.
Data extraction: temperature=0, top_p=1, max_tokens=200. Extract exactly what's there. No embellishment.

The key insight: there is no universal best setting. Every task has its own sweet spot. The art of working with LLMs is knowing which dials to turn — and how far.

Challenge

Quick check

What does setting temperature to 0 do?

How LLMs Work

The cat sat on the ___. How a machine guesses 'mat' using math, layers, and billions of examples

→

Tokens & Context Windows

Words get chopped into puzzle pieces, and the AI can only hold so many at once

→