Training

This is Part 2 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. Here we load the preprocessed dataset and run the training loop. Part 1 covered preprocessing; Part 3 will cover evaluation and benchmarking. With precomputed mel spectrograms and tokenized labels on disk, the next step is to plug them into a training loop and optimize the model. That sounds straightforward until you start making choices: full fine-tuning or LoRA? What learning rate and batch size? How do you pad variable-length sequences correctly for an encoder-decoder, and how do you avoid wasting GPU memory or blowing up training? This post walks through the training setup I use for Whisper large-v3 on Korean telephonic audio — and the engineering trade-offs behind each decision. ...