Skip to content

Snippet: Streaming ML & Online Learning

Domain Context

Real-time feature engineering, online model updates, and streaming inference pipelines. Latency and correctness under time pressure are the primary constraints.

Streaming Architecture

  • Clearly separate: event ingestion → feature computation → model serving → output action
  • Each stage must be independently scalable and monitorable
  • Use message queues (Kafka, Pulsar) between stages — never direct function calls in production
  • Exactly-once processing: understand your framework's guarantees and document them
  • Define and enforce SLA per stage: max latency, max backlog, error rate threshold

Real-Time Feature Engineering

  • Point-in-time correctness is non-negotiable — a feature must never use data from the future
  • Windowed aggregations: clearly define window size, slide interval, and late-arrival policy
  • Feature consistency: same feature definition must be used in training (offline) and serving (online)
  • Feature store integration: compute features once, serve everywhere — avoid training/serving skew
  • Late data handling: define a watermark policy — how long to wait for late events before closing the window

Online Learning

  • Model update frequency: tune based on concept drift rate, not arbitrary schedule
  • Learning rate for online updates: lower than batch training (typically 1/10th to 1/100th)
  • Catastrophic forgetting: monitor performance on historical data after each update
  • Replay buffer: maintain a sample of historical data to mix with new data during updates
  • Rollback capability: always maintain the previous model version for instant revert

Concept Drift Detection

  • Monitor input feature distributions in real-time — use PSI (Population Stability Index) or KS test
  • Prediction distribution monitoring: alert when output distribution shifts beyond threshold
  • Supervised drift: when labels arrive (even delayed), compare live accuracy vs. training accuracy
  • Drift response policy: retrain on fresh data, expand training window, or trigger human review
  • Log drift metrics alongside model predictions for retrospective analysis

Inference Pipeline

  • Warm-up: pre-load models and caches before routing traffic — cold-start latency is unacceptable
  • Timeout policy: if inference exceeds SLA, return fallback prediction (cached, rule-based, or default)
  • Feature missing handling: define default values for each feature — never fail silently
  • Batch micro-batching: accumulate requests for short window (10-50ms) for GPU efficiency
  • Shadow mode: run new model alongside production model, log both predictions, compare before switching

Evaluation for Streaming Systems

  • Prequential evaluation: evaluate on each sample BEFORE using it for training (test-then-train)
  • Sliding window metrics: report accuracy on recent window (e.g., last 1 hour, last 1 day)
  • Throughput: events processed per second — must exceed peak ingestion rate
  • End-to-end latency: from event creation to prediction output — measure p50, p95, p99
  • Backpressure metrics: monitor queue depth to detect when processing can't keep up with input

Common Pitfalls

  • Training/serving skew: feature computation differs between offline pipeline and streaming pipeline
  • Clock skew between systems causing temporal feature corruption
  • Unbounded state: streaming aggregations that grow indefinitely → OOM
  • Testing only with steady-state load — also test burst traffic, recovery after failures, and data gaps
  • Assuming events arrive in order — they don't in distributed systems; handle out-of-order explicitly