Skip to content

Snippet: Tabular ML / Traditional Machine Learning

Domain Context

Structured/tabular data modeling: classification, regression, ranking on feature-engineered datasets. Feature engineering and data quality drive most of the performance gains — not model complexity.

Feature Engineering

  • Start with simple features, add complexity only when baselines are established
  • Encode categorical variables properly: label encoding for tree models, target/one-hot for linear
  • Handle missing values explicitly: document the imputation strategy and why
  • Create interaction features only when domain knowledge supports them
  • Time-based features: always use point-in-time correctness — no future leakage
  • Log all feature transformations for reproducibility (use sklearn Pipeline or equivalent)

Model Selection

  • Always start with a strong baseline: logistic regression / linear regression → then gradient boosting
  • Tree-based models (XGBoost, LightGBM, CatBoost) are default for tabular — justify using anything else
  • Neural networks on tabular data: only when >100K rows AND non-linear interactions are proven
  • Ensemble only if marginal gain justifies the complexity — document the improvement

Cross-Validation

  • Use stratified K-fold (K=5) for classification; standard K-fold for regression
  • For time-dependent data: time-series split only — never random shuffle
  • For grouped data (e.g., per-user): group K-fold — same group never in both train and val
  • Report mean ± std across folds — a single fold result is not reliable
  • Nested CV for hyperparameter tuning: inner loop tunes, outer loop evaluates

Hyperparameter Tuning

  • Use Optuna or similar Bayesian optimization — avoid grid search on large spaces
  • Define the search space based on domain knowledge, not arbitrary ranges
  • Budget: 50-100 trials for tree models; fewer for expensive models
  • Always compare tuned model against default hyperparameters — report the delta

Explainability

  • SHAP values: mandatory for any model going to production or stakeholder review
  • Feature importance: compute and log for every trained model
  • Partial dependence plots for top-5 features — sanity check against domain knowledge
  • If top features don't make domain sense, investigate data issues before celebrating metrics

Evaluation

  • Classification: report precision, recall, F1, AUC-ROC; confusion matrix for multi-class
  • Regression: report RMSE, MAE, R²; plot predicted vs. actual
  • Always evaluate on holdout test set after all tuning is done (never peek during tuning)
  • Segment-level evaluation: check performance across key demographic/business segments

Common Pitfalls

  • Target leakage from features computed using the target variable
  • Class imbalance: random accuracy baseline is misleading — use appropriate metrics
  • High cardinality categoricals: naive one-hot encoding causes memory explosion
  • Train/test distribution mismatch: validate feature distributions are consistent across splits