Embeddings & Vector Search
Words as Arrows in Space
Here's a mind-bending idea: what if we could turn words into arrows (vectors) that point in specific directions in space? And what if similar words pointed in similar directions?
That's exactly what embeddings are. An embedding is a list of numbers (a vector) that represents the meaning of a word, sentence, or document. The key property: similar meanings produce similar vectors.
For example:
- "happy" β [0.8, 0.3, -0.1, 0.5, ...] (hundreds of numbers)
- "joyful" β [0.79, 0.31, -0.08, 0.48, ...] (very similar numbers!)
- "sad" β [-0.7, 0.2, 0.6, -0.4, ...] (very different numbers)
The numbers don't mean anything individually. But their pattern encodes meaning. Words with similar meanings have similar patterns.
The King - Man + Woman = Queen Miracle
In 2013, researchers at Google discovered something remarkable about word vectors (called Word2Vec). They found that you could do math with word meanings:
- king - man + woman = queen
- Paris - France + Italy = Rome
- walked - walking + swimming = swam
This blew the AI world's mind. The model had learned, purely from reading text, that "king" is to "man" as "queen" is to "woman." Nobody programmed this relationship β the model discovered it from patterns in billions of words.
It's like the model built an internal map of concepts, where directions have meaning: the direction from "man" to "woman" represents gender, the direction from "walking" to "walked" represents tense, and so on.
From Words to Sentences to Documents
Early embeddings (Word2Vec, GloVe) only worked on individual words. But a single word is ambiguous β "bank" could mean a riverbank or a financial institution.
Modern embedding models can embed entire sentences or paragraphs:
- Sentence embeddings β "How do I return a product?" becomes a single vector that captures the full meaning
- Document embeddings β An entire paragraph or page becomes one vector
This is crucial for real applications. When you're building a RAG system, you need to compare the meaning of a question against the meaning of document chunks β both are multi-word texts.
Popular embedding models:
- OpenAI text-embedding-3-small β 1,536 dimensions. Fast, cheap, good for most use cases.
- OpenAI text-embedding-3-large β 3,072 dimensions. More accurate, twice the cost.
- Cohere embed-v3 β Excellent multilingual support.
- all-MiniLM-L6-v2 β Free, open-source, runs locally. Great for prototyping.
- BGE-large β Open-source, state-of-the-art accuracy.
What Do the Numbers Mean?
A typical embedding might be a list of 1,536 numbers. What do they represent?
Honestly? We don't fully know. Unlike hand-crafted features ("word length," "is it a noun?"), embedding dimensions are learned automatically. Each dimension captures some abstract aspect of meaning. Some might relate to sentiment, some to formality, some to topic area β but they're not individually interpretable.
What we do know: the relationships between vectors are meaningful. If two vectors are close together (pointing in similar directions), the texts they represent have similar meanings. That's the only thing that matters.
Cosine Similarity: Measuring "Closeness"
How do you measure whether two vectors are "close" or "far"? The most common method is cosine similarity.
Cosine similarity measures the angle between two vectors:
- 1.0 β Vectors point in the exact same direction. Maximum similarity.
- 0.0 β Vectors are perpendicular. No relationship.
- -1.0 β Vectors point in opposite directions. Opposite meanings.
In practice, most text embeddings fall between 0 and 1. Here's what typical cosine similarity scores look like:
- 0.9+ β Near-identical meaning ("happy" vs "joyful")
- 0.7-0.9 β Closely related ("dog" vs "puppy")
- 0.4-0.7 β Somewhat related ("dog" vs "pet")
- 0.0-0.4 β Unrelated ("dog" vs "algebra")
Vector Databases: The Search Engine for Meaning
Regular databases search by exact values: WHERE name = 'John'. But embedding search needs to find the closest vectors out of millions. This is a fundamentally different problem.
Vector databases are specialized databases built for this kind of search. They use clever algorithms (like HNSW or IVF) to find the nearest vectors without comparing against every single one β searching millions of vectors in milliseconds.
Popular vector databases:
- Pinecone β Fully managed cloud service. Most popular for production apps. No infrastructure to manage.
- Chroma β Open source, easy to use locally. Perfect for learning and prototyping.
- Weaviate β Open source, supports hybrid search (combining vector search with keyword search).
- pgvector β PostgreSQL extension. Great if you're already using Postgres.
- Qdrant β Open source, built in Rust for speed.
- Milvus β Open source, designed for massive scale (billions of vectors).
Creating and Searching Embeddings in Python
Real-World Applications of Embeddings
Embeddings power far more than just search:
- Semantic search β Find documents by meaning, not keywords. "How to fix a flat tire" matches "tire puncture repair guide" even though they share few words.
- RAG systems β The retrieval step in RAG is entirely powered by embeddings. Questions and documents are compared in vector space.
- Recommendation engines β Netflix, Spotify, and Amazon use embeddings to recommend content. If you liked Movie A and Movie B has a similar embedding... you'll probably like it.
- Duplicate detection β Find near-duplicate support tickets, bug reports, or forum posts even when worded differently.
- Clustering β Group similar documents, customer feedback, or social media posts into topics automatically.
- Anomaly detection β Find outliers in data by spotting vectors that are far from everything else.
Embeddings are one of the most versatile tools in modern AI. Understanding them unlocks a huge range of practical applications.