Tokenization
Cutting the Pizza
Before a language model can read a sentence, it needs to chop it into pieces. This is tokenization โ and the way you cut matters a lot.
Think of a pizza. You could cut it into:
- Whole slices (word tokenization) โ each word is a token. "I love pizza" becomes ["I", "love", "pizza"]. Clean and intuitive, but what about "don't"? Is it one token or two? And what about words the model has never seen, like "GPT-4o"?
- Bite-sized pieces (subword tokenization) โ break words into meaningful chunks. "unhappiness" becomes ["un", "happiness"]. "playing" becomes ["play", "ing"]. Now the model can handle words it has never seen before by combining familiar pieces.
- Individual toppings (character tokenization) โ every single letter is a token. "cat" becomes ["c", "a", "t"]. Maximum flexibility, but the model has to figure out that "c-a-t" means a furry animal. That's a lot of work.
Modern AI overwhelmingly uses subword tokenization โ the sweet spot between flexibility and meaning.
BPE: How Modern Tokenizers Work
Byte-Pair Encoding (BPE) is the most popular tokenization algorithm. It was originally a data compression technique, repurposed for NLP by researchers at the University of Edinburgh in 2015.
Here's how BPE builds its vocabulary:
- Start with individual characters as your initial tokens
- Count which pairs of adjacent tokens appear most frequently in the training data
- Merge the most frequent pair into a new token
- Repeat until you've reached your desired vocabulary size
For example, if "th" appears 1000 times and "he" appears 800 times, "th" becomes a token first. Then maybe "the" gets merged. Common words like "the" become single tokens. Rare words get split into subwords.
GPT-4 uses a vocabulary of about 100,000 tokens. Common English words are single tokens. Rare words, technical jargon, and other languages get split into pieces.
Why Tokenization Matters More Than You Think
Tokenization affects everything about a language model:
- Cost โ APIs charge per token. Efficient tokenization = cheaper inference.
- Context length โ models have a token limit (e.g., 128K tokens). Shorter tokens mean you fit less text.
- Multilingual ability โ English-centric tokenizers are wasteful for other languages. A Hindi sentence might need 3x more tokens than the equivalent English sentence.
Tokenization in Practice
Tokenization Gotchas
Tokenization creates some surprising behaviors in language models:
- Counting letters: Ask GPT "how many r's in strawberry?" and it sometimes gets it wrong. Why? Because "strawberry" is a single token โ the model never sees individual letters.
- Math: "12345" might be tokenized as ["123", "45"], making arithmetic harder for the model.
- Code: Indentation in Python becomes multiple space tokens, eating up your context window.
Understanding tokenization helps you understand why language models behave the way they do. Many of their quirks aren't bugs in the model โ they're consequences of how text was chopped up before the model ever saw it.