RAG (Retrieval-Augmented Generation)
The Library Card for AI
Imagine you're a student taking an exam. There are two kinds of exams:
- Closed-book exam β You can only use what's in your memory. If you didn't study it, you're stuck. If you remember it wrong, you write wrong answers confidently.
- Open-book exam β You can look things up in your textbook. You don't need to memorize everything. And when you look something up, your answer is way more likely to be correct.
A regular LLM is like a closed-book exam. It can only use what it learned during training. If the information is outdated, missing, or was learned imprecisely... it makes things up. This is called hallucination.
RAG (Retrieval-Augmented Generation) gives the LLM an open book. Before answering your question, it retrieves relevant documents from a knowledge base, then generates its answer using those documents as reference. The result: more accurate, more current, and more trustworthy answers.
The Hallucination Problem
LLMs have a fundamental weakness: they don't know things β they predict words that sound right. This means they can:
- Invent fake citations β "According to Smith et al. (2023)..." β a paper that doesn't exist
- State outdated facts β "The CEO of OpenAI is Sam Altman" might be wrong if leadership changed after training
- Confuse similar concepts β Mixing up details between similar topics
- Fabricate statistics β "Studies show that 73% of..." β a number pulled from nowhere
RAG dramatically reduces hallucinations by giving the LLM actual source material to reference. Instead of predicting what the refund policy probably says, the LLM reads the actual policy and quotes from it.
The RAG Pipeline: Step by Step
A RAG system has two phases: indexing (preparation) and querying (answering questions).
Phase 1: Indexing (Done Once)
- Collect documents β Gather all the documents you want the AI to know about: PDFs, web pages, database records, internal docs, etc.
- Chunk the documents β Split large documents into smaller pieces ("chunks") of 200-500 words each. Why? Because LLMs have limited context windows, and smaller chunks are more precise for retrieval.
- Embed the chunks β Convert each text chunk into a vector (a list of numbers) using an embedding model. Similar text produces similar vectors. "How to return a product" and "What's the refund process" would get similar vectors.
- Store in a vector database β Save all the vectors and their original text in a vector database like Pinecone, Chroma, Weaviate, or pgvector.
Phase 2: Querying (Every Time a User Asks)
- Embed the question β Convert the user's question into a vector using the same embedding model.
- Search for similar chunks β Find the chunks whose vectors are closest to the question's vector. This is called semantic search β it finds content by meaning, not just keywords.
- Build the prompt β Take the top 3-5 most relevant chunks and insert them into the LLM's prompt: "Based on the following documents, answer the user's question: [documents] [question]"
- Generate the answer β The LLM reads the retrieved documents and generates an answer grounded in real source material.
Chunking: The Art of Cutting Documents
How you split documents into chunks matters more than you'd think:
- Too large (2000+ words) β Wastes context window space. May include irrelevant information that confuses the LLM.
- Too small (50 words) β Loses context. A sentence fragment about refunds doesn't make sense without the surrounding policy.
- Just right (200-500 words) β Contains enough context to be self-explanatory but focused enough to be relevant.
Common chunking strategies:
- Fixed-size β Split every 300 words. Simple but may cut mid-sentence.
- Sentence-based β Split at sentence boundaries. Preserves complete thoughts.
- Paragraph-based β Split at paragraph breaks. Natural topic boundaries.
- Recursive β Try paragraph splits first, then sentence, then word. The most popular approach.
- Semantic β Use an AI to detect topic shifts and split there. Most sophisticated.
Vector Databases
Regular databases are great at finding exact matches: "Show me all users named John." But RAG needs semantic search: "Show me documents about return policies" β even if the documents never use the word "return."
Vector databases store vectors (lists of numbers) and can quickly find the closest matches using similarity metrics like cosine similarity. Popular options:
- Pinecone β Fully managed, scales automatically. Most popular for production.
- Chroma β Open source, easy to set up. Great for prototyping and learning.
- Weaviate β Open source, supports hybrid search (vectors + keywords).
- pgvector β PostgreSQL extension. Great if you already use Postgres.
- Qdrant β Open source, written in Rust. Blazing fast.
Building a Simple RAG Pipeline in Python
RAG vs. Fine-Tuning: Which Should You Choose?
Two approaches exist for teaching an LLM about your data:
- RAG β Keep the model as-is, but give it documents to reference at query time. Like an open-book exam.
- Fine-tuning β Retrain the model on your data so it "memorizes" it. Like studying for a closed-book exam.
How to choose:
- RAG wins when: data changes often, you need citations, you have a lot of documents, or you want to avoid retraining costs.
- Fine-tuning wins when: you need the model to adopt a specific style/tone, you want faster inference (no retrieval step), or you need the model to deeply understand complex domain concepts.
- Both together β The best production systems often combine fine-tuning (for style and domain understanding) with RAG (for current, accurate information).