Snippet: Edge / Mobile LLM Inference¶

Domain Context¶

Deploying LLMs on mobile or edge devices with constrained compute, memory, and battery. Every optimization decision has real-world tradeoffs — document them explicitly.

Optimization Priority Order¶

Quantization (INT8/INT4) — biggest wins with acceptable quality loss
Pruning — effective for attention heads; structured pruning preferred (2:4 sparsity on Ampere+)
Knowledge distillation — best quality retention, highest engineering cost
Architecture changes (e.g., replace MHA with GQA/MQA) — last resort, requires retraining

Benchmarking Standards¶

Always measure on target hardware, not development machine — results are non-transferable
Report: latency (p50, p95, p99), throughput (tokens/sec), memory footprint (peak RSS), model file size
Test across battery levels (100%, 50%, 20%) — thermal throttling reduces perf 30–50% on mobile
Include cold-start latency separately from steady-state latency
Run ≥100 inference passes for each measurement — first 10 are warm-up, discard them
Profile power consumption (mW) using platform tools: Xcode Energy Gauge, Android Battery Historian

Quality vs. Efficiency Tradeoff¶

Always report quality degradation from FP32 baseline alongside efficiency gains
Define acceptable quality loss threshold before starting optimization (e.g., "≤2% accuracy drop")
Run human evaluation for subjective quality checks — perplexity alone is insufficient for user-facing apps
Measure on ≥3 task-specific benchmarks — a single metric can hide regressions in specific domains

Mobile-Specific Rules¶

Test on both iOS (Apple Silicon) and Android (Qualcomm/MediaTek) if cross-platform
Memory budget: assume 2GB max for model + runtime on mid-range devices (4GB on flagship)
Prefer ONNX Runtime Mobile, CoreML, or TFLite for deployment — document which backend and version
Use platform-specific NPU/GPU delegates when available: CoreML ANE, Qualcomm QNN, MediaTek NeuroPilot
Bundle model with app or download on first launch — document the tradeoff and model size budget
Background inference: yield CPU cores to foreground tasks — never pin all cores to model inference
Handle device rotation and app lifecycle: save/restore KV-cache state across onPause/viewDidDisappear

Quantization Guidelines¶

Start with post-training quantization (PTQ) INT8 — cheapest and usually sufficient
If quality drops >2% on key metric, use quantization-aware training (QAT)
Mixed-precision: keep first/last layers and attention logits in FP16, quantize feed-forward to INT8/INT4
Always compare: FP32 → FP16 → INT8 → INT4 — report each step's quality/speed tradeoff in a table
For LLMs: GPTQ or AWQ quantization to 4-bit is the current sweet spot for on-device inference
Calibration dataset: use ≥500 representative samples from production-like distribution for PTQ
Validate quantized model on edge cases: long sequences, rare tokens, multilingual input — failures cluster there

Model Packaging & Updates¶

Version every model artifact with a SHA-256 checksum — the app must verify integrity at load time
OTA model updates: implement automatic rollback if inference error rate > baseline + 5%
Model size budget: <100MB for bundled models, <500MB for on-demand downloads — enforce in CI
Split large models into shards (≤50MB each) for incremental download — users on slow networks will abandon large downloads
Store model metadata alongside binary: input/output shapes, quantization config, min framework version

CI/CD Integration¶

Run inference benchmarks in CI on target device simulators (Xcode Instruments, Android Emulator with HW profile)
Gate deployments on latency regression: p95 latency must not regress >10% vs. previous release
Automated model size checks: fail CI if model exceeds size budget
Include a smoke test that loads the model and runs ≥5 inferences on each target platform
Track latency trend over releases — alert if 3 consecutive releases show >5% regression trend

Common Pitfalls¶

Benchmarking on desktop GPU instead of target device — numbers are meaningless without target hardware
Ignoring thermal throttling: sustained inference perf is 30–50% lower than burst perf on mobile
Assuming ONNX export "just works" — dynamic shapes, custom ops, and control flow frequently break
Model compiles fine but OOMs at runtime — profile peak RSS + 20% headroom, not just model file size
Skipping battery impact testing — a model that drains 10% battery/hour will get your app uninstalled
Quantization without calibration data leads to outlier activations destroying quality — always calibrate
Testing only on high-end devices (iPhone 16 Pro, Pixel 9 Pro) — mid-range devices are 2–3× slower
Ignoring tokenizer overhead: on-device tokenizer can add 50–100ms per request — profile it separately