Skip to content

Snippet: Audio & Speech Processing

Domain Context

Automatic Speech Recognition (ASR), Text-to-Speech (TTS), speaker identification, audio classification. Audio has unique challenges: variable length, noise sensitivity, real-time requirements.

Data Pipeline

  • Audio format standardization: convert all inputs to consistent sample rate (16kHz for ASR, 22kHz+ for TTS)
  • Validate audio on ingest: check for corruption, silence-only files, anomalous duration (too short/too long)
  • Noise profiling: categorize training data by SNR level; log noise distribution before training
  • Always store raw audio — preprocessing should be reproducible from source
  • For speech data: check language, accent, and speaker distribution — imbalance here causes production failures

ASR (Speech-to-Text)

  • Baseline: use Whisper (large-v3) zero-shot before fine-tuning anything
  • Metric: Word Error Rate (WER) as primary; Character Error Rate (CER) for languages without clear word boundaries
  • Evaluate WER by: noise level, speaker accent, audio length, domain vocabulary
  • Decoder configuration: document beam size, language model weight, and any post-processing rules
  • Punctuation and casing: handle as post-processing unless the model does it natively — document the approach

TTS (Text-to-Speech)

  • Evaluation requires human listening tests — MOS (Mean Opinion Score) on 1-5 scale
  • Supplement human eval with: speaker similarity (for voice cloning), intelligibility, naturalness
  • Prosody control: test on edge cases — numbers, abbreviations, foreign words, emotional text
  • Latency: measure time-to-first-audio for streaming TTS; total generation time for batch
  • Always test on long-form text (>500 words) — many models degrade on longer inputs

Audio Classification

  • Feature extraction: log-mel spectrograms as default input representation
  • Augmentation: time stretch, pitch shift, noise injection, SpecAugment — document the pipeline
  • Class balance is critical: audio datasets are often heavily skewed — stratified splits required
  • Window size and hop length: document and keep consistent between training and inference
  • Real-time classification: profile latency per audio chunk, not per full clip

Production Deployment

  • Streaming ASR: measure Real-Time Factor (RTF) — must be <1.0 for real-time applications
  • VAD (Voice Activity Detection): implement before ASR to avoid processing silence/noise
  • Endpointing: detect when the speaker has finished — crucial for conversational AI latency
  • Fallback: define behavior for unsupported languages, excessive noise, or very long audio
  • Privacy: implement PII detection in transcriptions; offer on-device processing for sensitive audio

Common Pitfalls

  • Sample rate mismatch between training and inference — causes silent degradation
  • Microphone variability: models trained on studio audio fail on phone/laptop microphones
  • Background music in speech data: ASR models struggle without explicit noise robustness training
  • Evaluation only on clean audio: real-world WER is typically 2-5x worse than clean benchmarks
  • Ignoring speaker diversity: models can be biased toward specific accents, genders, or age groups