Snippet: Audio & Speech Processing¶

Domain Context¶

Automatic Speech Recognition (ASR), Text-to-Speech (TTS), speaker identification, audio classification. Audio has unique challenges: variable length, noise sensitivity, real-time requirements.

Data Pipeline¶

Audio format standardization: convert all inputs to consistent sample rate (16kHz for ASR, 22kHz+ for TTS)
Validate audio on ingest: check for corruption, silence-only files, anomalous duration (too short/too long)
Noise profiling: categorize training data by SNR level; log noise distribution before training
Always store raw audio — preprocessing should be reproducible from source
For speech data: check language, accent, and speaker distribution — imbalance here causes production failures

ASR (Speech-to-Text)¶

Baseline: use Whisper (large-v3) zero-shot before fine-tuning anything
Metric: Word Error Rate (WER) as primary; Character Error Rate (CER) for languages without clear word boundaries
Evaluate WER by: noise level, speaker accent, audio length, domain vocabulary
Decoder configuration: document beam size, language model weight, and any post-processing rules
Punctuation and casing: handle as post-processing unless the model does it natively — document the approach

TTS (Text-to-Speech)¶

Evaluation requires human listening tests — MOS (Mean Opinion Score) on 1-5 scale
Supplement human eval with: speaker similarity (for voice cloning), intelligibility, naturalness
Prosody control: test on edge cases — numbers, abbreviations, foreign words, emotional text
Latency: measure time-to-first-audio for streaming TTS; total generation time for batch
Always test on long-form text (>500 words) — many models degrade on longer inputs

Audio Classification¶

Feature extraction: log-mel spectrograms as default input representation
Augmentation: time stretch, pitch shift, noise injection, SpecAugment — document the pipeline
Class balance is critical: audio datasets are often heavily skewed — stratified splits required
Window size and hop length: document and keep consistent between training and inference
Real-time classification: profile latency per audio chunk, not per full clip

Production Deployment¶

Streaming ASR: measure Real-Time Factor (RTF) — must be <1.0 for real-time applications
VAD (Voice Activity Detection): implement before ASR to avoid processing silence/noise
Endpointing: detect when the speaker has finished — crucial for conversational AI latency
Fallback: define behavior for unsupported languages, excessive noise, or very long audio
Privacy: implement PII detection in transcriptions; offer on-device processing for sensitive audio

Common Pitfalls¶

Sample rate mismatch between training and inference — causes silent degradation
Microphone variability: models trained on studio audio fail on phone/laptop microphones
Background music in speech data: ASR models struggle without explicit noise robustness training
Evaluation only on clean audio: real-world WER is typically 2-5x worse than clean benchmarks
Ignoring speaker diversity: models can be biased toward specific accents, genders, or age groups