Snippet: Evaluation Framework¶
Domain Context¶
Systematic evaluation across the full ML lifecycle: offline metrics, online experiments, and monitoring. If you can't measure it rigorously, you can't improve it reliably.
Evaluation Design Principles¶
- Define metrics and success criteria before building the model — never after seeing results
- Every model must be compared against at least one baseline (heuristic, previous model, or random)
- Separate evaluation into layers: component-level → pipeline-level → end-to-end → user-facing
- Evaluation code is production code — it must be tested, reviewed, and versioned
Offline Evaluation¶
- Holdout test set: must be created ONCE and never used during development (only final evaluation)
- Development eval: use a separate validation set for iterative improvement
- Report metrics with confidence intervals: bootstrap with ≥1000 resamples
- Segment analysis is mandatory: break metrics by key dimensions (user type, data source, difficulty)
- Statistical significance: use paired tests (McNemar, Wilcoxon) to compare models, not just point estimates
Online Evaluation (A/B Testing)¶
- Define primary metric, guardrail metrics, and minimum detectable effect (MDE) before launch
- Sample size calculation: run power analysis to determine required traffic and duration
- Minimum experiment duration: 1 full business cycle (usually 1-2 weeks) to capture temporal effects
- Check for Sample Ratio Mismatch (SRM) — if traffic split is unexpected, results are invalid
- Guardrail metrics: latency p99, error rate, revenue — must not degrade beyond threshold
Offline → Online Alignment¶
- Track and document the gap between offline metrics and online outcomes
- Build a mapping table: which offline metrics best predict online wins
- If offline and online metrics consistently disagree, the offline eval is broken — fix it
- Replay analysis: use logged production data to backtest new models before A/B deployment
Evaluation Datasets¶
- Minimum size: 500+ examples for reliable metric estimation; 1000+ for segment analysis
- Update eval sets periodically — data drift makes static eval sets misleading over time
- Include adversarial/hard examples: 10-20% of eval set should be known difficult cases
- Never use eval data for any form of training, fine-tuning, or prompt optimization
- Version and hash eval datasets — any change must be documented and metrics rebased
Human Evaluation¶
- Required for: any generative/creative output, safety-critical applications, subjective quality
- Use structured rubrics with 3-5 point scales — free-form feedback is supplementary, not primary
- Minimum 3 evaluators per item; report IAA (Inter-Annotator Agreement)
- Blind evaluation: evaluators should not know which model produced which output
- LLM-as-judge: acceptable for development iteration, but calibrate against human ratings (report agreement ≥80%)
Reporting Standards¶
- Always report: metric name, value, confidence interval, sample size, evaluation date
- Format:
metric: value ± CI (n=sample_size) - Include both primary metric AND cost/efficiency metrics (latency, token usage, compute cost)
- Negative results are results — document what was tried and why it didn't work
- Evaluation dashboard: automated, always up-to-date, accessible to the full team
Common Pitfalls¶
- Overfitting to the eval set: if you look at eval results and iterate, you're implicitly training on it
- Metric gaming: optimizing a proxy metric that diverges from actual user value
- Cherry-picking: showing only the metrics that look good — report all pre-defined metrics
- Ignoring variance: single-run results on small eval sets are unreliable
- Stale eval sets: production data evolves but eval set stays frozen — revisit quarterly