Skip to content

Snippet: RAG (Retrieval-Augmented Generation)

Domain Context

Building systems that retrieve relevant context to ground LLM generation. The retrieval quality is the ceiling for generation quality — invest heavily in retrieval.

Chunking Strategy

  • Chunk size matters enormously — experiment with 256, 512, 1024 tokens systematically
  • Prefer semantic chunking (by paragraph/section) over fixed-size windowing
  • Always include overlap between chunks (10-20% of chunk size)
  • Preserve metadata with each chunk: source document, page, section title, timestamp
  • Never split code blocks, tables, or lists mid-element

Embedding & Indexing

  • Evaluate multiple embedding models on YOUR data before committing to one
  • Benchmark embedding models on domain-specific retrieval tasks, not just MTEB leaderboard
  • Normalize embeddings before storing if using cosine similarity
  • Index type selection: HNSW for <10M vectors; IVF or ScaNN for larger scale
  • Always version your embedding model — re-index when the model changes

Retrieval Pipeline

  • Hybrid search (dense + sparse/BM25) consistently outperforms dense-only — use it by default
  • Reranking stage (cross-encoder) after initial retrieval significantly improves precision
  • Retrieve more than you need (top-k=20), then rerank and select (top-n=5)
  • Implement and log retrieval metrics: Recall@K, MRR, NDCG before optimizing generation
  • Handle "no relevant results" explicitly — don't force the model to answer with irrelevant context

Generation

  • Always include retrieved sources in the prompt with clear delimiters
  • Instruct the model to cite sources and say "I don't know" when context is insufficient
  • System prompt must specify: answer based on provided context only
  • Max context window utilization: leave 20% headroom for output tokens
  • Test with adversarial queries: irrelevant questions, contradictory context, empty retrieval

Evaluation

  • Evaluate retrieval and generation separately — don't conflate pipeline errors
  • Retrieval eval: relevance of retrieved chunks (human-labeled ground truth)
  • Generation eval: faithfulness (no hallucination), relevance, completeness
  • End-to-end eval: answer correctness against gold-standard QA pairs
  • Track hallucination rate as a first-class metric — it's the #1 user complaint

Common Pitfalls

  • Chunk size too large → retrieves irrelevant content; too small → loses context
  • Embedding model trained on English performs poorly on other languages — verify explicitly
  • Stale index: documents updated but embeddings not re-indexed
  • Context stuffing: more retrieved chunks ≠ better answers (often worse)