Snippet: Computer Vision¶

Domain Context¶

Image/video understanding tasks: classification, detection, segmentation, generation. Visual data has unique pitfalls around augmentation, resolution, and compute cost.

Data Pipeline¶

Store images in original resolution; resize on-the-fly during data loading
Validate images on ingest: check for corruption, zero-byte files, unusual aspect ratios
Maintain a class distribution summary — log before every training run
For detection/segmentation: validate annotation format (COCO, VOC, YOLO) before training starts
Dataset versioning is mandatory — hash the image list + annotations together

Augmentation Strategy¶

Start with standard augmentations: random crop, horizontal flip, color jitter, normalize
Test-time augmentation (TTA): use for final evaluation, not during development iteration
Domain-specific augmentations: medical (elastic deform), satellite (rotation invariance), etc.
Never augment the validation/test set — only training data
Document the augmentation pipeline in config — random augmentations must be reproducible via seed

Model Selection¶

Use pretrained backbones by default — training from scratch requires strong justification
Architecture heuristics:
Classification: start with EfficientNet/ConvNeXt; ViT for >10K training images
Detection: YOLO family for speed; DETR family for accuracy
Segmentation: U-Net for medical; Mask2Former for general purpose
Always benchmark against the previous SOTA on your specific dataset, not just ImageNet numbers

Training Practices¶

Learning rate: use cosine schedule with warmup; lr finder for initial value
Freeze backbone initially (2-5 epochs), then unfreeze with lower lr (10x smaller)
Mixed precision (bf16/fp16) by default — no reason not to on modern hardware
Monitor both loss and metric curves — divergence between them signals issues
Save best checkpoint by validation metric, not by lowest loss

Evaluation¶

Report: accuracy, precision, recall, F1 (per-class for imbalanced datasets)
For detection: mAP@0.5 and mAP@0.5:0.95; report per-class AP for important classes
Visualize predictions on failure cases — not just aggregate metrics
Test on different conditions: lighting, occlusion, scale variation if applicable
Inference benchmarks: FPS and latency on target hardware

Common Pitfalls¶

Data leakage: same object/scene in both train and test splits (especially video frames)
Resolution mismatch: training on 224px but deploying on 1080p without proper handling
Augmentation too aggressive → model underfits; too weak → model overfits
Batch normalization behaves differently at train vs. eval — always call model.eval()
Color space issues: OpenCV loads BGR, PIL loads RGB — inconsistency causes silent bugs