Natural Language Processing9 min read

Tokenization

Breaking sentences into bite-sized pieces

scope:Core Conceptdifficulty:Beginner

Cutting the Pizza

Before a language model can read a sentence, it needs to chop it into pieces. This is tokenization — and the way you cut matters a lot.

Think of a pizza. You could cut it into:

Whole slices (word tokenization) — each word is a token. "I love pizza" becomes ["I", "love", "pizza"]. Clean and intuitive, but what about "don't"? Is it one token or two? And what about words the model has never seen, like "GPT-4o"?
Bite-sized pieces (subword tokenization) — break words into meaningful chunks. "unhappiness" becomes ["un", "happiness"]. "playing" becomes ["play", "ing"]. Now the model can handle words it has never seen before by combining familiar pieces.
Individual toppings (character tokenization) — every single letter is a token. "cat" becomes ["c", "a", "t"]. Maximum flexibility, but the model has to figure out that "c-a-t" means a furry animal. That's a lot of work.

Modern AI overwhelmingly uses subword tokenization — the sweet spot between flexibility and meaning.

BPE: How Modern Tokenizers Work

Byte-Pair Encoding (BPE) is the most popular tokenization algorithm. It was originally a data compression technique, repurposed for NLP by researchers at the University of Edinburgh in 2015.

Here's how BPE builds its vocabulary:

Start with individual characters as your initial tokens
Count which pairs of adjacent tokens appear most frequently in the training data
Merge the most frequent pair into a new token
Repeat until you've reached your desired vocabulary size

For example, if "th" appears 1000 times and "he" appears 800 times, "th" becomes a token first. Then maybe "the" gets merged. Common words like "the" become single tokens. Rare words get split into subwords.

GPT-4 uses a vocabulary of about 100,000 tokens. Common English words are single tokens. Rare words, technical jargon, and other languages get split into pieces.

Why Tokenization Matters More Than You Think

Tokenization affects everything about a language model:

Cost — APIs charge per token. Efficient tokenization = cheaper inference.
Context length — models have a token limit (e.g., 128K tokens). Shorter tokens mean you fit less text.
Multilingual ability — English-centric tokenizers are wasteful for other languages. A Hindi sentence might need 3x more tokens than the equivalent English sentence.

Tokenization in Practice

# Simple word and subword tokenization

def word_tokenize(text):
    """Split on whitespace and punctuation."""
    import re
    return re.findall(r"\w+|[^\w\s]", text)

def simple_bpe_demo(text, num_merges=5):
    """Demonstrate BPE-style merging."""
    # Start with characters
    tokens = list(text.replace(" ", "_"))
    print(f"Initial: {tokens}")
    
    for step in range(num_merges):
        # Count adjacent pairs
        pairs = {}
        for i in range(len(tokens) - 1):
            pair = (tokens[i], tokens[i+1])
            pairs[pair] = pairs.get(pair, 0) + 1
        
        if not pairs:
            break
        
        # Merge most frequent pair
        best = max(pairs, key=pairs.get)
        merged = best[0] + best[1]
        
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens)-1 and (tokens[i], tokens[i+1]) == best:
                new_tokens.append(merged)
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens
        print(f"Merge {step+1}: '{best[0]}'+'{best[1]}' → '{merged}'  →  {tokens}")
    
    return tokens

text = "the cat and the car"
print("Word tokenization:")
print(f"  {word_tokenize(text)}\n")

print("BPE-style merging:")
simple_bpe_demo(text, num_merges=4)

Output

Word tokenization:
  ['the', 'cat', 'and', 'the', 'car']

BPE-style merging:
Initial: ['t', 'h', 'e', '_', 'c', 'a', 't', '_', 'a', 'n', 'd', '_', 't', 'h', 'e', '_', 'c', 'a', 'r']
Merge 1: 't'+'h' → 'th'  →  ['th', 'e', '_', 'c', 'a', 't', '_', 'a', 'n', 'd', '_', 'th', 'e', '_', 'c', 'a', 'r']
Merge 2: 'th'+'e' → 'the'  →  ['the', '_', 'c', 'a', 't', '_', 'a', 'n', 'd', '_', 'the', '_', 'c', 'a', 'r']
Merge 3: 'c'+'a' → 'ca'  →  ['the', '_', 'ca', 't', '_', 'a', 'n', 'd', '_', 'the', '_', 'ca', 'r']
Merge 4: 'the'+'_' → 'the_'  →  ['the_', 'ca', 't', '_', 'a', 'n', 'd', '_', 'the_', 'ca', 'r']

Note: The token tax on non-English languages: Most tokenizers are trained primarily on English text, so English words are efficiently encoded as single tokens. But languages like Chinese, Japanese, Arabic, and Hindi often get broken into many more tokens for the same meaning. This means non-English users pay more per API call and hit context limits sooner — a real equity issue in AI.

Tokenization Gotchas

Tokenization creates some surprising behaviors in language models:

Counting letters: Ask GPT "how many r's in strawberry?" and it sometimes gets it wrong. Why? Because "strawberry" is a single token — the model never sees individual letters.
Math: "12345" might be tokenized as ["123", "45"], making arithmetic harder for the model.
Code: Indentation in Python becomes multiple space tokens, eating up your context window.

Understanding tokenization helps you understand why language models behave the way they do. Many of their quirks aren't bugs in the model — they're consequences of how text was chopped up before the model ever saw it.

Quick check

Why is subword tokenization preferred over word-level tokenization?

Challenge

Text Representation

Turning words into numbers a computer can read

→

Attention Mechanism

Focus on the words that matter most

→

Transformers

The architecture behind modern AI — attention is all you need

→