Snippet: Vision-Language Models (VLM)¶

Domain Context¶

This project involves VLM training, fine-tuning, or dataset construction. Images and text are jointly processed — treat multimodal alignment as a first-class concern.

Data Quality Rules¶

Caption quality gate: CLIPScore > 0.28 before adding to training set; reject below 0.20 outright
Image resolution minimum: 224×224; log and skip corrupted files rather than crashing
Aspect ratio: preserve original aspect ratios — pad or resize, never distort
De-duplicate images by perceptual hash (pHash, hamming distance ≤ 8) — near-duplicates bias contrastive learning
For multilingual captions, verify alignment per language pair (not just source language quality)
Validate image-text pairing: run a CLIP retrieval check — if the correct caption is not in top-10 for 20%+ of images, re-clean the dataset
Log dataset health: images per size bucket, caption length distribution (median, p5, p95), language mix

Multi-Stage Pipeline¶

Each pipeline stage output must be checkpointed separately (never recompute from scratch)
Stage naming: 00_raw → 01_filtered → 02_captioned → 03_quality_checked → 04_final
Log per-stage statistics: count, rejection rate, avg quality score
Set max retry = 3 for captioning API calls; log failures separately — do not silently skip images

Training & Fine-Tuning¶

Contrastive loss is sensitive to batch size — minimum effective batch size is 256 for CLIP-style; document actual batch size in every run
Use gradient checkpointing for models >1B params — memory savings outweigh speed cost
Freeze vision encoder first, fine-tune text adapter, then optionally unfreeze with LR = base_lr / 10
Mixed-precision: BF16 default on Ampere+; use FP32 for loss computation to avoid NaN instability
Learning rate schedule: warmup 5–10% of steps, cosine decay to 1e-6 — log the schedule
Image augmentation: use RandomResizedCrop + horizontal flip during training; disable augmentation during eval
Monitor vision–text loss components separately — an imbalanced loss indicates alignment drift

Model Selection Heuristics¶

Image understanding: start with LLaVA-style architecture for general VLM tasks
For OCR / document understanding: prefer models with high-res support (>768px) — InternVL, Qwen-VL, GOT-OCR
For video: decide frame sampling strategy before choosing architecture (uniform vs. keyframe, max 32 frames typical)
Small models (<3B): MobileVLM, PaliGemma — target edge/mobile deployment
Region-level tasks (grounding, referring): use models with coordinate output (Kosmos-2, Qwen-VL, Ferret)
Document the tokenizer's image token budget: e.g., LLaVA uses 576 tokens per image — this affects max context

Inference & Deployment¶

Image preprocessing must match training config exactly — resolution, normalization, padding strategy
KV-cache management: vision tokens consume significant cache; profile memory before setting batch size
For multi-image inputs, document the max image count the model supports — silent truncation is common
Streaming output: verify that partial token generation doesn't produce hallucinated image descriptions
Latency budget: profile encoder vs. LLM decode time separately — encoder is often 40–60% of total latency
Quantization: INT8 for vision encoder is usually safe; INT4 for LLM backbone — verify with ≥200 test images

Evaluation¶

Report both: image→text retrieval and text→image retrieval (R@1, R@5, R@10)
For generative VLMs: use GPT-4o or human evaluation for open-ended quality — BLEU < 0.3 correlation with human judgment on VLM tasks
Benchmark on ≥2 standard sets: MMMU, MMBench, SEEDBench, RealWorldQA — report version and split
Hallucination detection: run object presence checks — does the model describe objects not in the image? (use POPE benchmark, target F1 > 0.85)
Human evaluation required for cultural/locale-specific accuracy — CLIPScore alone is insufficient
Report inference cost: tokens/image, average latency per image, peak GPU memory

Common Pitfalls¶

Caption hallucinations on low-resolution inputs — the model invents details it cannot see
CLIP embeddings are poor for non-English text — verify with dedicated multilingual CLIP (e.g., OpenCLIP xlm-roberta)
Batch size affects contrastive loss significantly — note batch size in every experiment log
Training on image-text pairs with misaligned captions silently degrades quality — verify alignment before training
Forgetting to normalize inputs to the model's expected range ([0,1] vs. [-1,1] vs. ImageNet mean/std)
Resizing images with wrong interpolation (nearest vs. bilinear vs. bicubic) — match the original training config
Ignoring EXIF orientation: rotated images get fed in wrong orientation — use ImageOps.exif_transpose() before processing