Skip to content

Snippet: Evaluation Framework

Domain Context

Systematic evaluation across the full ML lifecycle: offline metrics, online experiments, and monitoring. If you can't measure it rigorously, you can't improve it reliably.

Evaluation Design Principles

  • Define metrics and success criteria before building the model — never after seeing results
  • Every model must be compared against at least one baseline (heuristic, previous model, or random)
  • Separate evaluation into layers: component-level → pipeline-level → end-to-end → user-facing
  • Evaluation code is production code — it must be tested, reviewed, and versioned

Offline Evaluation

  • Holdout test set: must be created ONCE and never used during development (only final evaluation)
  • Development eval: use a separate validation set for iterative improvement
  • Report metrics with confidence intervals: bootstrap with ≥1000 resamples
  • Segment analysis is mandatory: break metrics by key dimensions (user type, data source, difficulty)
  • Statistical significance: use paired tests (McNemar, Wilcoxon) to compare models, not just point estimates

Online Evaluation (A/B Testing)

  • Define primary metric, guardrail metrics, and minimum detectable effect (MDE) before launch
  • Sample size calculation: run power analysis to determine required traffic and duration
  • Minimum experiment duration: 1 full business cycle (usually 1-2 weeks) to capture temporal effects
  • Check for Sample Ratio Mismatch (SRM) — if traffic split is unexpected, results are invalid
  • Guardrail metrics: latency p99, error rate, revenue — must not degrade beyond threshold

Offline → Online Alignment

  • Track and document the gap between offline metrics and online outcomes
  • Build a mapping table: which offline metrics best predict online wins
  • If offline and online metrics consistently disagree, the offline eval is broken — fix it
  • Replay analysis: use logged production data to backtest new models before A/B deployment

Evaluation Datasets

  • Minimum size: 500+ examples for reliable metric estimation; 1000+ for segment analysis
  • Update eval sets periodically — data drift makes static eval sets misleading over time
  • Include adversarial/hard examples: 10-20% of eval set should be known difficult cases
  • Never use eval data for any form of training, fine-tuning, or prompt optimization
  • Version and hash eval datasets — any change must be documented and metrics rebased

Human Evaluation

  • Required for: any generative/creative output, safety-critical applications, subjective quality
  • Use structured rubrics with 3-5 point scales — free-form feedback is supplementary, not primary
  • Minimum 3 evaluators per item; report IAA (Inter-Annotator Agreement)
  • Blind evaluation: evaluators should not know which model produced which output
  • LLM-as-judge: acceptable for development iteration, but calibrate against human ratings (report agreement ≥80%)

Reporting Standards

  • Always report: metric name, value, confidence interval, sample size, evaluation date
  • Format: metric: value ± CI (n=sample_size)
  • Include both primary metric AND cost/efficiency metrics (latency, token usage, compute cost)
  • Negative results are results — document what was tried and why it didn't work
  • Evaluation dashboard: automated, always up-to-date, accessible to the full team

Common Pitfalls

  • Overfitting to the eval set: if you look at eval results and iterate, you're implicitly training on it
  • Metric gaming: optimizing a proxy metric that diverges from actual user value
  • Cherry-picking: showing only the metrics that look good — report all pre-defined metrics
  • Ignoring variance: single-run results on small eval sets are unreliable
  • Stale eval sets: production data evolves but eval set stays frozen — revisit quarterly