What are LLMs?10 min read

Tokens & Context Windows

Words get chopped into puzzle pieces, and the AI can only hold so many at once
scope:Foundationaldifficulty:Beginner

The Puzzle Pieces of Language

When you type a message to an LLM, the AI doesn't read your words the way you do. It doesn't see "Hello, how are you?" as five words. Instead, it chops your text into tokens β€” small pieces that might be words, parts of words, or even single characters.

Think of it like a jigsaw puzzle. The sentence "I love programming" might become three puzzle pieces: [I] [love] [programming]. Simple enough.

But what about "unforgettable"? That might become: [un] [forget] [table] β€” three pieces! The tokenizer breaks it into familiar chunks it recognizes.

And a really unusual word like "defenestration" (throwing someone out a window β€” yes, there's a word for that)? It might become: [def] [en] [est] [ration] β€” four pieces.

Why Not Just Use Whole Words?

Good question! There are a few reasons tokenizers chop words into subpieces:

  • Vocabulary size β€” There are millions of possible words (including names, technical terms, and words from every language). You can't have a slot for every one. But with about 50,000-100,000 token pieces, you can represent any word by combining pieces.
  • Handling new words β€” The model has never seen "ChatGPT" during early training. But it knows the pieces [Chat] [G] [PT], so it can still work with it.
  • Multilingual support β€” The same tokenizer can handle English, Chinese, Arabic, and code by having pieces for each.

The Token Economy

Here's a handy rule of thumb for English text:

  • 1 token β‰ˆ 4 characters (including spaces)
  • 1 token β‰ˆ 0.75 words
  • 100 tokens β‰ˆ 75 words

This means:

  • A tweet (280 characters) β‰ˆ 70 tokens
  • A page of text (500 words) β‰ˆ 670 tokens
  • A short novel (50,000 words) β‰ˆ 67,000 tokens
  • The entire Harry Potter series (1,084,000 words) β‰ˆ 1,450,000 tokens

Why does this matter? Because you pay per token when using LLM APIs. And the model can only handle a limited number of tokens at once β€” that's the context window.

Tokenization Surprises

Some things that trip people up:

  • Spaces count β€” " hello" (with a leading space) and "hello" are different tokens!
  • Numbers are weird β€” "123" might be one token, but "12345" might be two tokens: [123] [45]. This is partly why LLMs can struggle with math.
  • Non-English gets more tokens β€” Chinese, Japanese, and Korean characters often take 2-3 tokens each, because the tokenizer was trained mostly on English text. This means non-English users effectively get a smaller context window.
  • Emojis are expensive β€” A single emoji can cost 2-4 tokens!

Tokenization in Action

# Using OpenAI's tiktoken library to see tokens
import tiktoken
# GPT-4's tokenizer
enc = tiktoken.encoding_for_model("gpt-4")
# Tokenize some text
text = "Hello, world! I love programming."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Count: {len(tokens)} tokens")
print()
# See what each token looks like
for t in tokens:
print(f" Token {t} -> '{enc.decode([t])}'")
print()
# Compare: short vs long words
for word in ["cat", "programming", "Pneumonoultramicroscopicsilicovolcanoconiosis"]:
toks = enc.encode(word)
print(f"'{word}' -> {len(toks)} token(s)")
Output
Text: Hello, world! I love programming.
Tokens: [9906, 11, 1917, 0, 358, 3021, 15840, 13]
Count: 8 tokens

  Token 9906 -> 'Hello'
  Token 11 -> ','
  Token 1917 -> ' world'
  Token 0 -> '!'
  Token 358 -> ' I'
  Token 3021 -> ' love'
  Token 15840 -> ' programming'
  Token 13 -> '.'

'cat' -> 1 token(s)
'programming' -> 1 token(s)
'Pneumonoultramicroscopicsilicovolcanoconiosis' -> 11 token(s)
Note: Why LLMs are bad at counting letters: When you ask "How many r's in 'strawberry'?" the LLM doesn't see individual letters. It sees tokens like [str] [aw] [berry]. The letters are hidden inside the tokens! This is why LLMs infamously struggle with letter-counting, spelling tasks, and character-level operations. They literally can't see the individual characters.

Context Windows: The AI's Short-Term Memory

Imagine you're reading a very long book, but you can only remember the last 100 pages at any time. As you read page 101, you forget page 1. By page 200, everything before page 100 is gone.

That's a context window. It's the maximum amount of text an LLM can "see" at one time β€” including both what you've written and what it has generated.

Context Windows Through the Years

  • GPT-3 (2020) β€” 4,096 tokens (~3,000 words, ~6 pages)
  • GPT-3.5 (2022) β€” 4,096 tokens, later upgraded to 16,384
  • GPT-4 (2023) β€” 8,192 or 32,768 tokens (~25,000 words, ~50 pages)
  • GPT-4 Turbo (2023) β€” 128,000 tokens (~96,000 words, ~1 novel)
  • Claude 3 (2024) β€” 200,000 tokens (~150,000 words, ~1.5 novels)
  • Gemini 1.5 Pro (2024) β€” 1,000,000 tokens (~750,000 words, ~10 novels!)

Context windows have grown 250x in just four years. This is one of the fastest-improving aspects of LLM technology.

What Happens When the Window Fills Up?

When your conversation exceeds the context window, the earliest content gets dropped. The model literally can't see it anymore. This is called truncation.

Some systems try to work around this:

  • Summarization β€” Summarize old messages before dropping them.
  • RAG (Retrieval-Augmented Generation) β€” Store old content in a database and fetch relevant parts when needed.
  • Sliding window β€” Keep the system prompt and the most recent messages, drop the middle.

But there's no free lunch. Every workaround involves some information loss.

Why Context Windows Matter for You

Understanding context windows changes how you use LLMs:

  • Long documents β€” Need to analyze a 100-page contract? Make sure your model's context window is big enough, or break it into chunks.
  • Conversation length β€” In a long chat, the AI might "forget" things you said earlier. It's not being rude β€” it literally can't see those messages anymore.
  • Cost β€” Bigger context = more tokens processed = higher cost. A 100K-token conversation costs more than a 1K-token one.
  • Speed β€” Processing more tokens takes longer. The first response after pasting a long document will be slower.

The best strategy: be concise. Put the most important information upfront. Don't waste tokens on fluff. Every token counts β€” literally.

Challenge

Quick check

Why does the word 'unforgettable' become multiple tokens?

Continue reading