Skip to content

Snippet: Vision-Language Models (VLM)

Domain Context

This project involves VLM training, fine-tuning, or dataset construction. Images and text are jointly processed — treat multimodal alignment as a first-class concern.

Data Quality Rules

  • Caption quality gate: CLIPScore > 0.28 before adding to training set; reject below 0.20 outright
  • Image resolution minimum: 224×224; log and skip corrupted files rather than crashing
  • Aspect ratio: preserve original aspect ratios — pad or resize, never distort
  • De-duplicate images by perceptual hash (pHash, hamming distance ≤ 8) — near-duplicates bias contrastive learning
  • For multilingual captions, verify alignment per language pair (not just source language quality)
  • Validate image-text pairing: run a CLIP retrieval check — if the correct caption is not in top-10 for 20%+ of images, re-clean the dataset
  • Log dataset health: images per size bucket, caption length distribution (median, p5, p95), language mix

Multi-Stage Pipeline

  • Each pipeline stage output must be checkpointed separately (never recompute from scratch)
  • Stage naming: 00_raw → 01_filtered → 02_captioned → 03_quality_checked → 04_final
  • Log per-stage statistics: count, rejection rate, avg quality score
  • Set max retry = 3 for captioning API calls; log failures separately — do not silently skip images

Training & Fine-Tuning

  • Contrastive loss is sensitive to batch size — minimum effective batch size is 256 for CLIP-style; document actual batch size in every run
  • Use gradient checkpointing for models >1B params — memory savings outweigh speed cost
  • Freeze vision encoder first, fine-tune text adapter, then optionally unfreeze with LR = base_lr / 10
  • Mixed-precision: BF16 default on Ampere+; use FP32 for loss computation to avoid NaN instability
  • Learning rate schedule: warmup 5–10% of steps, cosine decay to 1e-6 — log the schedule
  • Image augmentation: use RandomResizedCrop + horizontal flip during training; disable augmentation during eval
  • Monitor vision–text loss components separately — an imbalanced loss indicates alignment drift

Model Selection Heuristics

  • Image understanding: start with LLaVA-style architecture for general VLM tasks
  • For OCR / document understanding: prefer models with high-res support (>768px) — InternVL, Qwen-VL, GOT-OCR
  • For video: decide frame sampling strategy before choosing architecture (uniform vs. keyframe, max 32 frames typical)
  • Small models (<3B): MobileVLM, PaliGemma — target edge/mobile deployment
  • Region-level tasks (grounding, referring): use models with coordinate output (Kosmos-2, Qwen-VL, Ferret)
  • Document the tokenizer's image token budget: e.g., LLaVA uses 576 tokens per image — this affects max context

Inference & Deployment

  • Image preprocessing must match training config exactly — resolution, normalization, padding strategy
  • KV-cache management: vision tokens consume significant cache; profile memory before setting batch size
  • For multi-image inputs, document the max image count the model supports — silent truncation is common
  • Streaming output: verify that partial token generation doesn't produce hallucinated image descriptions
  • Latency budget: profile encoder vs. LLM decode time separately — encoder is often 40–60% of total latency
  • Quantization: INT8 for vision encoder is usually safe; INT4 for LLM backbone — verify with ≥200 test images

Evaluation

  • Report both: image→text retrieval and text→image retrieval (R@1, R@5, R@10)
  • For generative VLMs: use GPT-4o or human evaluation for open-ended quality — BLEU < 0.3 correlation with human judgment on VLM tasks
  • Benchmark on ≥2 standard sets: MMMU, MMBench, SEEDBench, RealWorldQA — report version and split
  • Hallucination detection: run object presence checks — does the model describe objects not in the image? (use POPE benchmark, target F1 > 0.85)
  • Human evaluation required for cultural/locale-specific accuracy — CLIPScore alone is insufficient
  • Report inference cost: tokens/image, average latency per image, peak GPU memory

Common Pitfalls

  • Caption hallucinations on low-resolution inputs — the model invents details it cannot see
  • CLIP embeddings are poor for non-English text — verify with dedicated multilingual CLIP (e.g., OpenCLIP xlm-roberta)
  • Batch size affects contrastive loss significantly — note batch size in every experiment log
  • Training on image-text pairs with misaligned captions silently degrades quality — verify alignment before training
  • Forgetting to normalize inputs to the model's expected range ([0,1] vs. [-1,1] vs. ImageNet mean/std)
  • Resizing images with wrong interpolation (nearest vs. bilinear vs. bicubic) — match the original training config
  • Ignoring EXIF orientation: rotated images get fed in wrong orientation — use ImageOps.exif_transpose() before processing