Snippet: MLOps & Model Deployment¶

Domain Context¶

Taking models from experimentation to production with reliability, monitoring, and automation. The model is only as good as the system around it.

Model training pipelines must be reproducible from a single command or config
Every model artifact must be traceable: commit hash, data version, config, training logs
Automated tests for data pipelines: schema validation, distribution drift, missing values
Automated tests for model quality: eval metrics must pass a minimum threshold gate
Separate pipelines: data prep → training → evaluation → packaging → deployment

Every production model must be registered with: version, metrics, training config, data hash
Promote models through stages: dev → staging → production — never skip staging
Keep at least 2 previous production model versions for instant rollback
Model metadata must include: input/output schema, expected latency, resource requirements

Define SLA upfront: latency p99, throughput, availability target
Health checks must verify model is loaded and producing valid outputs, not just HTTP 200
Implement graceful degradation: fallback model or cached responses when primary fails
Batch inference: prefer offline batch processing for non-real-time use cases (cheaper, simpler)
A/B testing infrastructure: route traffic by percentage, log predictions for both models

Data drift detection is mandatory — monitor input feature distributions daily
Model performance monitoring: track prediction distribution shift, not just system metrics
Alert on: prediction latency spike, error rate increase, confidence score distribution shift
Log all predictions with timestamps — enables retroactive analysis when issues are discovered
Dashboard must show: request volume, latency percentiles, error rate, model version, drift score

Containerize everything: model serving, data pipelines, evaluation jobs
GPU resource management: right-size instances, use spot/preemptible for training
Model artifacts stored in versioned object storage (S3/GCS), not local filesystem
Secrets and credentials: use vault/secret manager, never env files in containers

Training-serving skew: feature engineering differs between training and inference
Silent model degradation: model still returns predictions but quality drops over weeks
Missing monitoring: team discovers issues from user complaints, not alerts
Over-engineering: not every model needs Kubernetes — start simple, scale when needed