Text Representation
Words Need Addresses
Computers are number machines. They add, multiply, and compare numbers with blinding speed. But they can't do math on the word "cat." So how do you teach a computer to understand language?
You give every word a number. Or better yet, a list of numbers.
Think of it like a city. Every word lives at an address. Similar words live on the same street. "King" and "queen" are neighbors. "Cat" and "dog" live in the same neighborhood. "Pizza" and "democracy" live on opposite sides of town.
This is text representation β the art of converting words into numbers that capture their meaning.
The Simplest Approach: One-Hot Encoding
Imagine your vocabulary has 4 words: [cat, dog, fish, bird]. You represent each as:
- cat = [1, 0, 0, 0]
- dog = [0, 1, 0, 0]
- fish = [0, 0, 1, 0]
- bird = [0, 0, 0, 1]
Simple, but useless for meaning. Every word is equally far from every other word. "Cat" is just as different from "dog" as it is from "democracy." We need something smarter.
Bag of Words: Count and Forget
A slightly better approach: represent a document by counting how many times each word appears. "The cat sat on the mat" becomes a vector of word counts. Simple, and surprisingly effective for basic classification tasks.
But it throws away word order. "Dog bites man" and "Man bites dog" have identical bag-of-words representations, even though they mean very different things.
TF-IDF: What Makes a Word Special?
TF-IDF (Term Frequency - Inverse Document Frequency) improves on bag of words. Common words like "the" get low scores (they appear everywhere). Rare, distinctive words like "mitochondria" get high scores (they're unique to specific documents).
It's like highlighting: common words are invisible, distinctive words glow bright.
Word Embeddings: The Breakthrough
In 2013, a team at Google published Word2Vec, and everything changed. Instead of sparse, meaningless vectors, each word becomes a dense vector of 100-300 numbers, learned by training on massive amounts of text.
The magic: words with similar meanings end up with similar vectors. Even more remarkably, you can do arithmetic with meaning:
King - Man + Woman β Queen
The vector for "king" minus "man" plus "woman" lands you right next to "queen." The model has learned that royalty and gender are separate dimensions of meaning.
Text Representation: From Bag of Words to TF-IDF
Modern Text Representation
Today, text representation has evolved even further:
- Word2Vec / GloVe β one fixed vector per word. "Bank" has the same vector whether it means a riverbank or a financial bank.
- ELMo β context-dependent embeddings. "Bank" gets different vectors in "river bank" vs "bank account."
- BERT / GPT β transformer-based models that create embeddings for entire sentences, capturing context, nuance, and meaning far better than any previous approach.
The trend is clear: from simple word counts to context-aware, meaning-rich representations. Every improvement has unlocked new capabilities β better search, better translation, better chatbots, and ultimately, the large language models we use today.