Natural Language Processing10 min read

Text Representation

Turning words into numbers a computer can read
scope:Core Conceptdifficulty:Beginner

Words Need Addresses

Computers are number machines. They add, multiply, and compare numbers with blinding speed. But they can't do math on the word "cat." So how do you teach a computer to understand language?

You give every word a number. Or better yet, a list of numbers.

Think of it like a city. Every word lives at an address. Similar words live on the same street. "King" and "queen" are neighbors. "Cat" and "dog" live in the same neighborhood. "Pizza" and "democracy" live on opposite sides of town.

This is text representation β€” the art of converting words into numbers that capture their meaning.

The Simplest Approach: One-Hot Encoding

Imagine your vocabulary has 4 words: [cat, dog, fish, bird]. You represent each as:

  • cat = [1, 0, 0, 0]
  • dog = [0, 1, 0, 0]
  • fish = [0, 0, 1, 0]
  • bird = [0, 0, 0, 1]

Simple, but useless for meaning. Every word is equally far from every other word. "Cat" is just as different from "dog" as it is from "democracy." We need something smarter.

Bag of Words: Count and Forget

A slightly better approach: represent a document by counting how many times each word appears. "The cat sat on the mat" becomes a vector of word counts. Simple, and surprisingly effective for basic classification tasks.

But it throws away word order. "Dog bites man" and "Man bites dog" have identical bag-of-words representations, even though they mean very different things.

TF-IDF: What Makes a Word Special?

TF-IDF (Term Frequency - Inverse Document Frequency) improves on bag of words. Common words like "the" get low scores (they appear everywhere). Rare, distinctive words like "mitochondria" get high scores (they're unique to specific documents).

It's like highlighting: common words are invisible, distinctive words glow bright.

Word Embeddings: The Breakthrough

In 2013, a team at Google published Word2Vec, and everything changed. Instead of sparse, meaningless vectors, each word becomes a dense vector of 100-300 numbers, learned by training on massive amounts of text.

The magic: words with similar meanings end up with similar vectors. Even more remarkably, you can do arithmetic with meaning:

King - Man + Woman β‰ˆ Queen

The vector for "king" minus "man" plus "woman" lands you right next to "queen." The model has learned that royalty and gender are separate dimensions of meaning.

Text Representation: From Bag of Words to TF-IDF

from collections import Counter
import math
def bag_of_words(doc, vocab):
"""Count word occurrences."""
words = doc.lower().split()
counts = Counter(words)
return [counts.get(w, 0) for w in vocab]
def tf_idf(docs, vocab):
"""Compute TF-IDF for each document."""
n_docs = len(docs)
results = []
# Document frequency: how many docs contain each word
df = {}
for w in vocab:
df[w] = sum(1 for d in docs if w in d.lower())
for doc in docs:
words = doc.lower().split()
n_words = len(words)
counts = Counter(words)
tfidf = []
for w in vocab:
tf = counts.get(w, 0) / n_words
idf = math.log(n_docs / (1 + df[w]))
tfidf.append(round(tf * idf, 3))
results.append(tfidf)
return results
# Example documents
docs = [
"the cat sat on the mat",
"the dog chased the cat",
"the bird flew over the mat"
]
vocab = ["cat", "dog", "bird", "mat", "the"]
print("Bag of Words:")
for i, doc in enumerate(docs):
print(f" Doc {i+1}: {bag_of_words(doc, vocab)}")
print("\nTF-IDF (notice 'the' has low score):")
for i, vec in enumerate(tf_idf(docs, vocab)):
print(f" Doc {i+1}: {vec}")
Output
Bag of Words:
  Doc 1: [1, 0, 0, 1, 2]
  Doc 2: [1, 1, 0, 0, 2]
  Doc 3: [0, 0, 1, 1, 2]

TF-IDF (notice 'the' has low score):
  Doc 1: [0.034, 0.0, 0.0, 0.034, -0.234]
  Doc 2: [0.041, 0.22, 0.0, 0.0, -0.281]
  Doc 3: [0.0, 0.0, 0.183, 0.034, -0.234]
Note: King - Man + Woman = Queen: This famous example from Word2Vec showed that word embeddings capture deep semantic relationships. The model was never told about gender or royalty β€” it discovered these concepts by observing patterns in billions of words. This was a pivotal moment in NLP, proving that meaning could emerge from statistics.

Modern Text Representation

Today, text representation has evolved even further:

  • Word2Vec / GloVe β€” one fixed vector per word. "Bank" has the same vector whether it means a riverbank or a financial bank.
  • ELMo β€” context-dependent embeddings. "Bank" gets different vectors in "river bank" vs "bank account."
  • BERT / GPT β€” transformer-based models that create embeddings for entire sentences, capturing context, nuance, and meaning far better than any previous approach.

The trend is clear: from simple word counts to context-aware, meaning-rich representations. Every improvement has unlocked new capabilities β€” better search, better translation, better chatbots, and ultimately, the large language models we use today.

Quick check

What is the main problem with one-hot encoding for words?
Challenge

Continue reading