Snippet: General NLP¶
Domain Context¶
Text classification, named entity recognition, sentiment analysis, text summarization, and other core NLP tasks. Pre-LLM techniques remain highly relevant for production systems where cost, latency, and interpretability matter.
Task Selection Guide¶
- Classification: start with TF-IDF + logistic regression baseline before using transformers
- NER: use token classification with pretrained transformer; SpaCy for quick prototyping
- Summarization: extractive first (cheaper, more faithful), abstractive only when extractive is insufficient
- Sentiment: few-shot LLM for prototyping, fine-tuned classifier for production (cost + latency)
- Document choice depends on: data volume, latency budget, accuracy requirements, and interpretability needs
Text Preprocessing¶
- Document the preprocessing pipeline explicitly: lowercase, stopwords, stemming/lemmatization choices
- Never silently truncate text — log truncation events and their frequency
- Handle encoding issues upfront: UTF-8 normalize, remove control characters, fix mojibake
- Language detection: run on all inputs; route non-target-language text to appropriate handler
- Tokenization: verify tokenizer handles your domain vocabulary (medical, legal, code, etc.)
Model Selection¶
- Baseline progression: rule-based → TF-IDF + classical ML → fine-tuned transformer → LLM
- For labeled data <1K examples: few-shot LLM or SetFit (no full fine-tune)
- For labeled data 1K-50K: fine-tuned BERT-family model (domain-specific if available)
- For labeled data >50K: fine-tuned transformer with hyperparameter optimization
- Always document model size, inference latency, and cost alongside accuracy
Evaluation¶
- Classification: precision, recall, F1 per class + macro/weighted average; confusion matrix
- NER: entity-level F1 (not token-level) — partial matches must be handled explicitly
- Summarization: ROUGE-L + human evaluation for faithfulness
- Report metrics on each label/entity type separately — aggregate metrics hide class-level failures
- Error analysis: manually review ≥50 misclassified examples to identify patterns
Production Considerations¶
- Input length distribution matters: profile it and set max_length accordingly
- Batch inference: group by similar length to minimize padding waste
- Model distillation: if transformer is accurate but slow, distill to smaller model for serving
- Caching: cache predictions for identical or near-identical inputs
- Graceful degradation: define fallback behavior when model confidence is below threshold
Common Pitfalls¶
- Class imbalance: accuracy is misleading — use F1 or AUC, apply sampling strategies
- Label noise: crowdsourced labels often have 5-15% error rate — measure and account for it
- Domain shift: model trained on news text performs poorly on social media — validate on target domain
- Tokenizer mismatch: using a different tokenizer at inference vs. training causes silent degradation
- Evaluation contamination: test set overlapping with training data (especially from web-scraped corpora)