Snippet: Data Labeling & Annotation Quality¶
Domain Context¶
Managing the data labeling lifecycle: guideline creation, annotator coordination, quality assurance. Model quality ceiling = label quality. No model can fix bad labels.
Labeling Guidelines¶
- Write annotation guidelines BEFORE any labeling begins — iterate them, not ad-hoc corrections
- Guidelines must include: task definition, label taxonomy, edge case examples, and "when in doubt" rules
- Include at least 5 positive and 5 negative examples per label category
- Version control the guidelines — annotators must always use the latest version
- When guidelines change, re-label a sample of previously labeled data for consistency
Quality Metrics¶
- Inter-Annotator Agreement (IAA): compute before using any labels for training
- Cohen's Kappa ≥ 0.7 for binary; Fleiss' Kappa ≥ 0.6 for multi-annotator
- If IAA is low, the task definition is ambiguous — fix guidelines, not annotators
- Gold standard sets: maintain a curated set of 100+ expert-labeled examples for calibration
- Every annotator batch must include 10-15% hidden gold standard items for ongoing quality monitoring
- Track per-annotator accuracy — remove or retrain consistently underperforming annotators
Labeling Workflow¶
- Pilot round: label 50-100 items, compute IAA, refine guidelines, repeat until threshold met
- Production round: assign 2-3 annotators per item for critical tasks; majority vote for labels
- Adjudication: disagreements on critical items must be resolved by expert review, not majority vote
- Log annotation metadata: annotator ID, timestamp, time spent per item, tool version
- Implement annotator calibration sessions: regular alignment meetings on ambiguous cases
Active Learning Integration¶
- Start with random sampling, switch to uncertainty sampling after initial model is trained
- Query strategies: uncertainty, diversity, or committee disagreement — document the choice
- Human review budget: define how many items can be labeled per iteration (fixed budget)
- Re-evaluate model after each active learning cycle — track diminishing returns
- Stop criteria: when adding more labels provides <1% metric improvement
Data Pipeline Integration¶
- Raw annotations → cleaned annotations → final labels: maintain all three versions
- Label format: use standard formats (COCO for vision, IOB2 for NER, etc.)
- Track label distribution throughout the pipeline — flag unexpected class imbalance shifts
- Deduplication: ensure the same item is not labeled multiple times (unless for IAA computation)
- PII review: check labeled data for inadvertent personal information before use
Common Pitfalls¶
- Labeler fatigue: quality degrades after 2-3 hours of continuous labeling — enforce breaks
- Majority class bias: annotators default to the most common label when uncertain
- Guideline drift: annotators gradually reinterpret guidelines without explicit correction
- Sampling bias: items sent for labeling don't represent the production data distribution
- Ignoring disagreement: low IAA items carry noisy labels that confuse models — handle explicitly