Training Whisper on Precomputed Features: Full Fine-Tune for Korean STT

This is Part 2 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. Here we load the preprocessed dataset and run the training loop. Part 1 covered preprocessing; Part 3 will cover evaluation and benchmarking.

With precomputed mel spectrograms and tokenized labels on disk, the next step is to plug them into a training loop and optimize the model. That sounds straightforward until you start making choices: full fine-tuning or LoRA? What learning rate and batch size? How do you pad variable-length sequences correctly for an encoder-decoder, and how do you avoid wasting GPU memory or blowing up training? This post walks through the training setup I use for Whisper large-v3 on Korean telephonic audio — and the engineering trade-offs behind each decision.

The full training script is available on GitHub: train_whisper.py (full fine-tune on 8× H200, BF16, Seq2SeqTrainer).

From Preprocessed Data to the Trainer

Part 1 left us with a dataset stored as multiple Hugging Face Dataset shards on disk: each sample has input_features (precomputed mel spectrograms) and labels (token IDs for the transcriptions). The training script does not touch raw audio or the feature extractor during training; it only loads these shards, concatenates them, and feeds them to the model via a data collator that handles padding and decoder setup.

Loading is done in chunks: the script discovers all chunk_* directories under train/ and val/, loads each with load_from_disk, and concatenates the result. Using a thread pool for I/O-bound loading keeps startup time reasonable even with many shards. Because the data is already tokenized and length-filtered in preprocessing, the “prepare” step is just a sanity check that required columns exist — no on-the-fly tokenization.

Why Full Fine-Tuning (and When to Consider LoRA)

I use full fine-tuning of the encoder and decoder rather than parameter-efficient methods like LoRA or QLoRA. For Korean STT, the goal is to adapt the model to a specific domain (e.g. telephonic, accented, or noisy speech) and to the language’s orthography and disfluencies. Full fine-tuning updates every layer and has more capacity to shift the representation; in practice, I saw a clear CER/WER gain over the base model and over a LoRA run with comparable compute.

The trade-off is cost: full fine-tuning of Whisper large-v3 requires significant GPU memory and multi-GPU setup. Alternatives and when they make sense:

LoRA / QLoRA: Fewer trainable parameters and lower memory, so you can run on a single consumer GPU or smaller cloud instances. Useful for quick experiments, small datasets, or when you only need light adaptation. You give up some ceiling on accuracy.
Adapter layers or prefix tuning: Similar idea — fewer parameters, faster and cheaper, but typically less capacity than full fine-tuning for a large domain shift.
Full fine-tune: Best when you have enough data and GPU budget to justify it, and when you care about squeezing out the best CER/WER. The script is tuned for 8× H200 (143GB each); you can scale down batch size and use gradient accumulation on smaller rigs.

So the engineering choice here is: full fine-tune for maximum accuracy, given that we already invested in preprocessing and have the hardware. If you are resource-constrained, the same data pipeline works with a LoRA/QLoRA training script; you only swap the model wrapper and optimizer setup.

Precision, Checkpointing, and Optimizer

Training uses BF16 (bfloat16) everywhere: torch_dtype=torch.bfloat16 for the model, bf16=True in the training args, and the data collator casts input_features to bfloat16 before the forward pass. On H200 (and other recent NVIDIA GPUs), BF16 is the preferred choice over FP16 for stability and speed; FP16 can require loss scaling and is more prone to overflow in deep transformers. If you are on older GPUs without BF16, FP16 with dynamic loss scaling is the fallback.

Gradient checkpointing is enabled so that activations are recomputed during the backward pass instead of stored. That trades compute for memory and allows a much larger per-device batch size (e.g. 256 per GPU in the reference setup). Without it, Whisper large-v3 would not fit at that batch size on 80GB-class GPUs.

The optimizer is AdamW with the fused implementation (adamw_torch_fused). On modern GPUs the fused kernel reduces memory traffic and is faster than the default AdamW. Weight decay is set to 0.01 and gradient norm is clipped to 1.0 for stability.

Learning Rate, Schedule, and Batch Size

Learning rate: 5e-5. For full fine-tuning of a large pretrained model, this is a conservative value that avoids catastrophic forgetting while still making steady progress. I did not use a warmup schedule that ramps up to a higher peak; instead a short warmup (e.g. 5% of steps) into this fixed-ish LR was enough.
Scheduler: Cosine decay over the training run. LR decreases smoothly to a small value by the end. Linear decay is another common option; cosine tends to give a slightly better tail in my runs.
Batch size: Per-device train batch size 256, no gradient accumulation in the reference config, so effective batch size is 256 × 8 = 2048. Large batch sizes help stability with BF16 and reduce step variance. If you have fewer or smaller GPUs, reduce per-device batch size and increase gradient accumulation to keep effective batch size in a similar range (e.g. 1024–2048) so the learning dynamics stay comparable.

Data Collator: Padding and Decoder Setup

Whisper is encoder-decoder. The encoder gets input_features (mel spectrograms); the decoder gets decoder_input_ids and produces logits that are compared to labels. The data collator must:

Pad variable-length sequences in the batch to the same length (e.g. “longest” in the batch) so they can be stacked into tensors.
Truncate labels that exceed Whisper’s maximum length (448 tokens in the script) to avoid shape errors.
Build decoder_input_ids by shifting the label sequence right and prepending the decoder start token; use the tokenizer’s pad_token_id for padding in decoder_input_ids, and use -100 for padding in labels so the loss ignores those positions.
Cast input_features to bfloat16 after padding so the model receives the right dtype.

The reference collator does all of this in one place. Getting the padding and -100 masking wrong is a common source of bugs (e.g. the model learning to predict padding or the loss being computed on padding). So the engineering choice is to centralize this logic in a single collator and keep the dataset itself as “raw” features and label IDs only.

Evaluation and Early Stopping

Validation runs every 500 steps (configurable). The trainer uses predict_with_generate: it runs the full decoder with greedy decoding (single beam) to produce transcriptions, then compares them to the references to compute CER (character error rate) and WER (word error rate). CER is the primary metric for Korean; WER is reported for reference. Evaluation is run on a subset of the validation set (e.g. 4000 samples) to keep eval time manageable.

Early stopping is enabled with a patience of 10 evaluation steps and a small threshold. The trainer is configured to load the best model at the end using the best CER (lower is better). So the final saved model is the checkpoint with the lowest validation CER, not the last step — which matters when the curve flattens or starts overfitting.

Checkpointing and Resume

Checkpoints are saved every 500 steps, with save_total_limit=5 so only the last five are kept. The script checks for existing checkpoints in the output directory and, if found, resumes from the latest checkpoint. That way, long runs can be restarted after a crash or a preemption without losing progress.

Summary of Choices vs Alternatives

Area	Choice	Alternative
Parameter update	Full fine-tune	LoRA/QLoRA for less compute, lower ceiling
Precision	BF16	FP16 on older GPUs (with loss scaling)
Optimizer	AdamW fused	Default AdamW
LR schedule	Cosine + short warmup	Linear, or longer warmup
Batch size	Large (e.g. 2048 effective)	Smaller + more grad accumulation
Eval metric	CER (primary), WER	WER-only for some languages
Best model	Load best by CER at end	Keep last checkpoint only

What’s Next

With training done, the best checkpoint is saved (e.g. under .../final). In Part 3, we cover evaluation: how to run the model on held-out Korean test sets, compute CER/WER in a reproducible way, benchmark multiple checkpoints across data categories, and compare against baselines.

From Preprocessed Data to the Trainer#

Why Full Fine-Tuning (and When to Consider LoRA)#

Precision, Checkpointing, and Optimizer#

Learning Rate, Schedule, and Batch Size#

Data Collator: Padding and Decoder Setup#

Evaluation and Early Stopping#

Checkpointing and Resume#

Summary of Choices vs Alternatives#

What’s Next#