AI Infrastructure & Tools12 min read

Embeddings & Vector Search

Words as arrows in space β€” how AI understands that 'king' minus 'man' plus 'woman' equals 'queen'
scope:Intermediatedifficulty:Intermediate

Words as Arrows in Space

Here's a mind-bending idea: what if we could turn words into arrows (vectors) that point in specific directions in space? And what if similar words pointed in similar directions?

That's exactly what embeddings are. An embedding is a list of numbers (a vector) that represents the meaning of a word, sentence, or document. The key property: similar meanings produce similar vectors.

For example:

  • "happy" β†’ [0.8, 0.3, -0.1, 0.5, ...] (hundreds of numbers)
  • "joyful" β†’ [0.79, 0.31, -0.08, 0.48, ...] (very similar numbers!)
  • "sad" β†’ [-0.7, 0.2, 0.6, -0.4, ...] (very different numbers)

The numbers don't mean anything individually. But their pattern encodes meaning. Words with similar meanings have similar patterns.

The King - Man + Woman = Queen Miracle

In 2013, researchers at Google discovered something remarkable about word vectors (called Word2Vec). They found that you could do math with word meanings:

  • king - man + woman = queen
  • Paris - France + Italy = Rome
  • walked - walking + swimming = swam

This blew the AI world's mind. The model had learned, purely from reading text, that "king" is to "man" as "queen" is to "woman." Nobody programmed this relationship β€” the model discovered it from patterns in billions of words.

It's like the model built an internal map of concepts, where directions have meaning: the direction from "man" to "woman" represents gender, the direction from "walking" to "walked" represents tense, and so on.

From Words to Sentences to Documents

Early embeddings (Word2Vec, GloVe) only worked on individual words. But a single word is ambiguous β€” "bank" could mean a riverbank or a financial institution.

Modern embedding models can embed entire sentences or paragraphs:

  • Sentence embeddings β€” "How do I return a product?" becomes a single vector that captures the full meaning
  • Document embeddings β€” An entire paragraph or page becomes one vector

This is crucial for real applications. When you're building a RAG system, you need to compare the meaning of a question against the meaning of document chunks β€” both are multi-word texts.

Popular embedding models:

  • OpenAI text-embedding-3-small β€” 1,536 dimensions. Fast, cheap, good for most use cases.
  • OpenAI text-embedding-3-large β€” 3,072 dimensions. More accurate, twice the cost.
  • Cohere embed-v3 β€” Excellent multilingual support.
  • all-MiniLM-L6-v2 β€” Free, open-source, runs locally. Great for prototyping.
  • BGE-large β€” Open-source, state-of-the-art accuracy.

What Do the Numbers Mean?

A typical embedding might be a list of 1,536 numbers. What do they represent?

Honestly? We don't fully know. Unlike hand-crafted features ("word length," "is it a noun?"), embedding dimensions are learned automatically. Each dimension captures some abstract aspect of meaning. Some might relate to sentiment, some to formality, some to topic area β€” but they're not individually interpretable.

What we do know: the relationships between vectors are meaningful. If two vectors are close together (pointing in similar directions), the texts they represent have similar meanings. That's the only thing that matters.

Cosine Similarity: Measuring "Closeness"

How do you measure whether two vectors are "close" or "far"? The most common method is cosine similarity.

Cosine similarity measures the angle between two vectors:

  • 1.0 β€” Vectors point in the exact same direction. Maximum similarity.
  • 0.0 β€” Vectors are perpendicular. No relationship.
  • -1.0 β€” Vectors point in opposite directions. Opposite meanings.

In practice, most text embeddings fall between 0 and 1. Here's what typical cosine similarity scores look like:

  • 0.9+ β€” Near-identical meaning ("happy" vs "joyful")
  • 0.7-0.9 β€” Closely related ("dog" vs "puppy")
  • 0.4-0.7 β€” Somewhat related ("dog" vs "pet")
  • 0.0-0.4 β€” Unrelated ("dog" vs "algebra")

Vector Databases: The Search Engine for Meaning

Regular databases search by exact values: WHERE name = 'John'. But embedding search needs to find the closest vectors out of millions. This is a fundamentally different problem.

Vector databases are specialized databases built for this kind of search. They use clever algorithms (like HNSW or IVF) to find the nearest vectors without comparing against every single one β€” searching millions of vectors in milliseconds.

Popular vector databases:

  • Pinecone β€” Fully managed cloud service. Most popular for production apps. No infrastructure to manage.
  • Chroma β€” Open source, easy to use locally. Perfect for learning and prototyping.
  • Weaviate β€” Open source, supports hybrid search (combining vector search with keyword search).
  • pgvector β€” PostgreSQL extension. Great if you're already using Postgres.
  • Qdrant β€” Open source, built in Rust for speed.
  • Milvus β€” Open source, designed for massive scale (billions of vectors).

Creating and Searching Embeddings in Python

from openai import OpenAI
import numpy as np
client = OpenAI()
# Step 1: Create embeddings for some texts
texts = [
"The cat sat on the mat",
"A kitten rested on the rug",
"The stock market crashed today",
"Dogs are loyal companions",
]
# Get embeddings from OpenAI
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
# Extract the vectors
vectors = [item.embedding for item in response.data]
# Step 2: Calculate cosine similarity between texts
def cosine_similarity(a, b):
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Compare all pairs
print("Similarity scores:")
for i in range(len(texts)):
for j in range(i + 1, len(texts)):
sim = cosine_similarity(vectors[i], vectors[j])
print(f" '{texts[i][:30]}...' vs '{texts[j][:30]}...'")
print(f" β†’ Similarity: {sim:.3f}")
print()
# Step 3: Search β€” find the most similar text to a query
query = "Where is the kitty?"
query_vec = client.embeddings.create(
model="text-embedding-3-small",
input=[query]
).data[0].embedding
print(f"Query: '{query}'")
print("Most similar texts:")
similarities = [(texts[i], cosine_similarity(query_vec, vectors[i]))
for i in range(len(texts))]
for text, sim in sorted(similarities, key=lambda x: -x[1]):
print(f" {sim:.3f} β€” {text}")
Output
Similarity scores:
  'The cat sat on the mat...' vs 'A kitten rested on the rug...'
  β†’ Similarity: 0.847

  'The cat sat on the mat...' vs 'The stock market crashed to...'
  β†’ Similarity: 0.112

  'The cat sat on the mat...' vs 'Dogs are loyal companions...'
  β†’ Similarity: 0.523

  'A kitten rested on the rug...' vs 'The stock market crashed to...'
  β†’ Similarity: 0.089

Query: 'Where is the kitty?'
Most similar texts:
  0.831 β€” The cat sat on the mat
  0.814 β€” A kitten rested on the rug
  0.492 β€” Dogs are loyal companions
  0.097 β€” The stock market crashed today
Note: Why 1,536 numbers? The dimension count is a design choice. More dimensions can capture more nuanced meanings but use more memory and compute. OpenAI's small model uses 1,536 dimensions; the large model uses 3,072. The open-source MiniLM model uses 384 β€” much smaller but still effective for many tasks. It's a tradeoff between accuracy and efficiency.

Real-World Applications of Embeddings

Embeddings power far more than just search:

  • Semantic search β€” Find documents by meaning, not keywords. "How to fix a flat tire" matches "tire puncture repair guide" even though they share few words.
  • RAG systems β€” The retrieval step in RAG is entirely powered by embeddings. Questions and documents are compared in vector space.
  • Recommendation engines β€” Netflix, Spotify, and Amazon use embeddings to recommend content. If you liked Movie A and Movie B has a similar embedding... you'll probably like it.
  • Duplicate detection β€” Find near-duplicate support tickets, bug reports, or forum posts even when worded differently.
  • Clustering β€” Group similar documents, customer feedback, or social media posts into topics automatically.
  • Anomaly detection β€” Find outliers in data by spotting vectors that are far from everything else.

Embeddings are one of the most versatile tools in modern AI. Understanding them unlocks a huge range of practical applications.

Challenge

Quick check

What is an embedding?

Continue reading