Stop Recomputing Mel Spectrograms: Preprocessing Your Data Before Whisper Fine-Tuning
This is Part 1 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. In this post, we build the data preprocessing pipeline. Parts 2 and 3 will cover the training loop and evaluation/benchmarking, respectively. When I first started fine-tuning OpenAI’s Whisper for Korean speech-to-text, I noticed something frustrating. Every single time I kicked off a training run — whether I was tweaking the learning rate, adjusting the batch size, or experimenting with a new scheduler — the framework would spend hours churning through raw audio files before a single gradient was computed. The preprocessing step was identical each time: load WAV files, resample, compute mel spectrograms, tokenize transcriptions. Nothing about the data had changed, yet I was paying the full cost of data preparation on every attempt. ...