Data Science / ML Project Overlay¶

Appended on top of core.md for DS/ML projects.

Project Mindset¶

This is a research-engineering hybrid. Reproducibility > code elegance. Every experiment must be traceable, comparable, and reversible. Treat each run as a scientific trial — not a software deployment.

Experiment Management¶

Set ALL random seeds at the start: torch, numpy, random, transformers
All hyperparameters live in YAML config — never hardcoded in training scripts
Every training run must log to an experiment tracker (W&B or MLflow)
Run naming convention: {model}_{dataset}_{key_param}_{YYYYMMDD_HHMM}
Save the config file alongside every checkpoint

Data Pipeline Rules¶

data/raw/ is read-only — never overwrite source data
Always write outputs to data/processed/ or data/features/
Validate schema at every pipeline boundary: shape, dtype, value ranges, null %
Log dataset statistics before any training: class distribution, size, sample previews
Large files (>50MB) managed by DVC, not Git

Code Organization¶

Notebooks are for exploration only — no production logic inside .ipynb
Refactor notebook code to src/ modules before marking any task complete
Model classes must implement: train(), evaluate(), save(), load(), predict()
No business logic in if __name__ == "__main__" blocks

Evaluation Standards¶

Never evaluate on training data — always maintain strict train/val/test split
For time-series data: temporal split only, never random split
Always include at least one baseline comparison
Report format: metric: mean ± std (run with ≥3 seeds)
Always report alongside primary metric: inference latency (ms), model size (MB)
Feature leakage check is mandatory before any model evaluation

Training Checklist (remind me if I skip any)¶

[ ] Seeds set
[ ] Config logged to experiment tracker
[ ] Data validation passed
[ ] Baseline exists for comparison
[ ] Early stopping configured
[ ] Checkpoint saving enabled