Whisper

Did Fine-Tuning Actually Help? Evaluating and Benchmarking Whisper for Korean STT

This is Part 3 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. Here we measure whether the fine-tuned model actually improved, and by how much. Part 1 covered preprocessing; Part 2 covered training. A trained model without evaluation is just a checkpoint on disk. You can stare at the training loss curve and hope it went down, but until you run the model on held-out data and measure something concrete — CER, WER, per-category breakdowns — you don’t know if the fine-tuning worked, whether it regressed on certain domains, or how it compares to the baseline you started from. ...

Training Whisper on Precomputed Features: Full Fine-Tune for Korean STT

This is Part 2 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. Here we load the preprocessed dataset and run the training loop. Part 1 covered preprocessing; Part 3 will cover evaluation and benchmarking. With precomputed mel spectrograms and tokenized labels on disk, the next step is to plug them into a training loop and optimize the model. That sounds straightforward until you start making choices: full fine-tuning or LoRA? What learning rate and batch size? How do you pad variable-length sequences correctly for an encoder-decoder, and how do you avoid wasting GPU memory or blowing up training? This post walks through the training setup I use for Whisper large-v3 on Korean telephonic audio — and the engineering trade-offs behind each decision. ...

Stop Recomputing Mel Spectrograms: Preprocessing Your Data Before Whisper Fine-Tuning

This is Part 1 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. In this post, we build the data preprocessing pipeline. Parts 2 and 3 will cover the training loop and evaluation/benchmarking, respectively. When I first started fine-tuning OpenAI’s Whisper for Korean speech-to-text, I noticed something frustrating. Every single time I kicked off a training run — whether I was tweaking the learning rate, adjusting the batch size, or experimenting with a new scheduler — the framework would spend hours churning through raw audio files before a single gradient was computed. The preprocessing step was identical each time: load WAV files, resample, compute mel spectrograms, tokenize transcriptions. Nothing about the data had changed, yet I was paying the full cost of data preparation on every attempt. ...