Snippet: CTR Prediction / Recommendation Systems¶

Domain Context¶

Click-through rate prediction on advertising platforms. Data is large-scale, sparse, and highly temporal — treat accordingly.

Feature leakage check is mandatory before any model evaluation — this is non-negotiable
Temporal split only — never use random split; future data must not appear in training
Sparse categorical features must use embedding layers, not one-hot at scale
Log feature importance for every model — helps debug both model and data issues

Always report both: AUC-ROC and Log Loss — never just one
Online metric proxy: compare predicted CTR distribution vs. actual CTR distribution
Calibration check: plot reliability diagram for every new model before deployment
Offline → online metric gap must be documented and monitored

User/item features: distinguish between static features and real-time features clearly
Handle cold-start items explicitly — don't let them silently degrade model performance
Sample negative examples carefully — random negative sampling may not reflect real distribution

Assume dataset doesn't fit in RAM — use chunked processing or Spark
Feature store lookups must be batched — never row-by-row in production
Embedding tables dominate model size — monitor and cap embedding dimensions (start with dim=16)

Using future impressions/clicks in training features — temporal leakage is the #1 silent killer
Ignoring position bias: items shown in position 1 always get more clicks regardless of relevance
Evaluating on uniformly random traffic but deploying on biased traffic — offline/online gap is inevitable
One-hot encoding high-cardinality features (>10K categories) — use hashing or embeddings instead
Not monitoring prediction distribution shift after deployment — models degrade within days as user behavior changes