Snippet: CTR Prediction / Recommendation Systems¶
Domain Context¶
Click-through rate prediction on advertising platforms. Data is large-scale, sparse, and highly temporal — treat accordingly.
Critical Rules¶
- Feature leakage check is mandatory before any model evaluation — this is non-negotiable
- Temporal split only — never use random split; future data must not appear in training
- Sparse categorical features must use embedding layers, not one-hot at scale
- Log feature importance for every model — helps debug both model and data issues
Evaluation Metrics¶
- Always report both: AUC-ROC and Log Loss — never just one
- Online metric proxy: compare predicted CTR distribution vs. actual CTR distribution
- Calibration check: plot reliability diagram for every new model before deployment
- Offline → online metric gap must be documented and monitored
Data Pipeline¶
- User/item features: distinguish between static features and real-time features clearly
- Handle cold-start items explicitly — don't let them silently degrade model performance
- Sample negative examples carefully — random negative sampling may not reflect real distribution
Scale Considerations¶
- Assume dataset doesn't fit in RAM — use chunked processing or Spark
- Feature store lookups must be batched — never row-by-row in production
- Embedding tables dominate model size — monitor and cap embedding dimensions (start with dim=16)
Common Pitfalls¶
- Using future impressions/clicks in training features — temporal leakage is the #1 silent killer
- Ignoring position bias: items shown in position 1 always get more clicks regardless of relevance
- Evaluating on uniformly random traffic but deploying on biased traffic — offline/online gap is inevitable
- One-hot encoding high-cardinality features (>10K categories) — use hashing or embeddings instead
- Not monitoring prediction distribution shift after deployment — models degrade within days as user behavior changes