AI Infrastructure & Tools12 min read

RAG (Retrieval-Augmented Generation)

Giving AI a library card β€” so it can look things up instead of making things up
scope:Intermediatedifficulty:Intermediate

The Library Card for AI

Imagine you're a student taking an exam. There are two kinds of exams:

  • Closed-book exam β€” You can only use what's in your memory. If you didn't study it, you're stuck. If you remember it wrong, you write wrong answers confidently.
  • Open-book exam β€” You can look things up in your textbook. You don't need to memorize everything. And when you look something up, your answer is way more likely to be correct.

A regular LLM is like a closed-book exam. It can only use what it learned during training. If the information is outdated, missing, or was learned imprecisely... it makes things up. This is called hallucination.

RAG (Retrieval-Augmented Generation) gives the LLM an open book. Before answering your question, it retrieves relevant documents from a knowledge base, then generates its answer using those documents as reference. The result: more accurate, more current, and more trustworthy answers.

The Hallucination Problem

LLMs have a fundamental weakness: they don't know things β€” they predict words that sound right. This means they can:

  • Invent fake citations β€” "According to Smith et al. (2023)..." β€” a paper that doesn't exist
  • State outdated facts β€” "The CEO of OpenAI is Sam Altman" might be wrong if leadership changed after training
  • Confuse similar concepts β€” Mixing up details between similar topics
  • Fabricate statistics β€” "Studies show that 73% of..." β€” a number pulled from nowhere

RAG dramatically reduces hallucinations by giving the LLM actual source material to reference. Instead of predicting what the refund policy probably says, the LLM reads the actual policy and quotes from it.

The RAG Pipeline: Step by Step

A RAG system has two phases: indexing (preparation) and querying (answering questions).

Phase 1: Indexing (Done Once)

  1. Collect documents β€” Gather all the documents you want the AI to know about: PDFs, web pages, database records, internal docs, etc.
  2. Chunk the documents β€” Split large documents into smaller pieces ("chunks") of 200-500 words each. Why? Because LLMs have limited context windows, and smaller chunks are more precise for retrieval.
  3. Embed the chunks β€” Convert each text chunk into a vector (a list of numbers) using an embedding model. Similar text produces similar vectors. "How to return a product" and "What's the refund process" would get similar vectors.
  4. Store in a vector database β€” Save all the vectors and their original text in a vector database like Pinecone, Chroma, Weaviate, or pgvector.

Phase 2: Querying (Every Time a User Asks)

  1. Embed the question β€” Convert the user's question into a vector using the same embedding model.
  2. Search for similar chunks β€” Find the chunks whose vectors are closest to the question's vector. This is called semantic search β€” it finds content by meaning, not just keywords.
  3. Build the prompt β€” Take the top 3-5 most relevant chunks and insert them into the LLM's prompt: "Based on the following documents, answer the user's question: [documents] [question]"
  4. Generate the answer β€” The LLM reads the retrieved documents and generates an answer grounded in real source material.

Chunking: The Art of Cutting Documents

How you split documents into chunks matters more than you'd think:

  • Too large (2000+ words) β€” Wastes context window space. May include irrelevant information that confuses the LLM.
  • Too small (50 words) β€” Loses context. A sentence fragment about refunds doesn't make sense without the surrounding policy.
  • Just right (200-500 words) β€” Contains enough context to be self-explanatory but focused enough to be relevant.

Common chunking strategies:

  • Fixed-size β€” Split every 300 words. Simple but may cut mid-sentence.
  • Sentence-based β€” Split at sentence boundaries. Preserves complete thoughts.
  • Paragraph-based β€” Split at paragraph breaks. Natural topic boundaries.
  • Recursive β€” Try paragraph splits first, then sentence, then word. The most popular approach.
  • Semantic β€” Use an AI to detect topic shifts and split there. Most sophisticated.

Vector Databases

Regular databases are great at finding exact matches: "Show me all users named John." But RAG needs semantic search: "Show me documents about return policies" β€” even if the documents never use the word "return."

Vector databases store vectors (lists of numbers) and can quickly find the closest matches using similarity metrics like cosine similarity. Popular options:

  • Pinecone β€” Fully managed, scales automatically. Most popular for production.
  • Chroma β€” Open source, easy to set up. Great for prototyping and learning.
  • Weaviate β€” Open source, supports hybrid search (vectors + keywords).
  • pgvector β€” PostgreSQL extension. Great if you already use Postgres.
  • Qdrant β€” Open source, written in Rust. Blazing fast.

Building a Simple RAG Pipeline in Python

from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.Client()
# Step 1: Create a collection (like a database table for vectors)
collection = chroma.create_collection("company_docs")
# Step 2: Add documents (Chroma auto-embeds them!)
documents = [
"Our refund policy allows returns within 30 days of purchase. "
"Items must be unused and in original packaging.",
"Shipping takes 3-5 business days for standard delivery. "
"Express shipping (1-2 days) is available for $15 extra.",
"Our support team is available Monday-Friday, 9am-5pm EST. "
"Email [email protected] or call 1-800-555-0123.",
]
collection.add(
documents=documents,
ids=["refund", "shipping", "support"]
)
# Step 3: User asks a question β€” retrieve relevant docs
question = "Can I return something I bought last week?"
results = collection.query(
query_texts=[question],
n_results=2 # Get the 2 most relevant chunks
)
# Step 4: Build prompt with retrieved context
context = "\n\n".join(results["documents"][0])
prompt = f"""Based on the following company documents, answer the
customer's question. Only use information from the documents.
Documents:
{context}
Question: {question}"""
# Step 5: Generate answer using the LLM
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.2 # Low temperature for accuracy
)
print(response.choices[0].message.content)
Output
Yes! Since you bought it last week, you're well within our
30-day return window. Just make sure the item is unused and
in its original packaging. You can reach our support team at
[email protected] or 1-800-555-0123 (Monday-Friday,
9am-5pm EST) to start the return process.
Note: When should you use RAG? Use RAG when: your data changes frequently (news, product catalogs), you have private/proprietary data the LLM was never trained on (internal docs, customer records), accuracy matters (medical, legal, financial), or you need citations and source attribution. Don't use RAG for: general knowledge questions, creative writing, brainstorming, or tasks where the LLM's training data is sufficient.

RAG vs. Fine-Tuning: Which Should You Choose?

Two approaches exist for teaching an LLM about your data:

  • RAG β€” Keep the model as-is, but give it documents to reference at query time. Like an open-book exam.
  • Fine-tuning β€” Retrain the model on your data so it "memorizes" it. Like studying for a closed-book exam.

How to choose:

  • RAG wins when: data changes often, you need citations, you have a lot of documents, or you want to avoid retraining costs.
  • Fine-tuning wins when: you need the model to adopt a specific style/tone, you want faster inference (no retrieval step), or you need the model to deeply understand complex domain concepts.
  • Both together β€” The best production systems often combine fine-tuning (for style and domain understanding) with RAG (for current, accurate information).
Challenge

Quick check

What does RAG stand for, and what is the core idea?

Continue reading