Did Fine-Tuning Actually Help? Evaluating and Benchmarking Whisper for Korean STT
This is Part 3 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. Here we measure whether the fine-tuned model actually improved, and by how much. Part 1 covered preprocessing; Part 2 covered training. A trained model without evaluation is just a checkpoint on disk. You can stare at the training loss curve and hope it went down, but until you run the model on held-out data and measure something concrete — CER, WER, per-category breakdowns — you don’t know if the fine-tuning worked, whether it regressed on certain domains, or how it compares to the baseline you started from. ...