Snippet: RAG (Retrieval-Augmented Generation)¶
Domain Context¶
Building systems that retrieve relevant context to ground LLM generation. The retrieval quality is the ceiling for generation quality — invest heavily in retrieval.
Chunking Strategy¶
- Chunk size matters enormously — experiment with 256, 512, 1024 tokens systematically
- Prefer semantic chunking (by paragraph/section) over fixed-size windowing
- Always include overlap between chunks (10-20% of chunk size)
- Preserve metadata with each chunk: source document, page, section title, timestamp
- Never split code blocks, tables, or lists mid-element
Embedding & Indexing¶
- Evaluate multiple embedding models on YOUR data before committing to one
- Benchmark embedding models on domain-specific retrieval tasks, not just MTEB leaderboard
- Normalize embeddings before storing if using cosine similarity
- Index type selection: HNSW for <10M vectors; IVF or ScaNN for larger scale
- Always version your embedding model — re-index when the model changes
Retrieval Pipeline¶
- Hybrid search (dense + sparse/BM25) consistently outperforms dense-only — use it by default
- Reranking stage (cross-encoder) after initial retrieval significantly improves precision
- Retrieve more than you need (top-k=20), then rerank and select (top-n=5)
- Implement and log retrieval metrics: Recall@K, MRR, NDCG before optimizing generation
- Handle "no relevant results" explicitly — don't force the model to answer with irrelevant context
Generation¶
- Always include retrieved sources in the prompt with clear delimiters
- Instruct the model to cite sources and say "I don't know" when context is insufficient
- System prompt must specify: answer based on provided context only
- Max context window utilization: leave 20% headroom for output tokens
- Test with adversarial queries: irrelevant questions, contradictory context, empty retrieval
Evaluation¶
- Evaluate retrieval and generation separately — don't conflate pipeline errors
- Retrieval eval: relevance of retrieved chunks (human-labeled ground truth)
- Generation eval: faithfulness (no hallucination), relevance, completeness
- End-to-end eval: answer correctness against gold-standard QA pairs
- Track hallucination rate as a first-class metric — it's the #1 user complaint
Common Pitfalls¶
- Chunk size too large → retrieves irrelevant content; too small → loses context
- Embedding model trained on English performs poorly on other languages — verify explicitly
- Stale index: documents updated but embeddings not re-indexed
- Context stuffing: more retrieved chunks ≠ better answers (often worse)