This is Part 3 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. Here we measure whether the fine-tuned model actually improved, and by how much. Part 1 covered preprocessing; Part 2 covered training.
A trained model without evaluation is just a checkpoint on disk. You can stare at the training loss curve and hope it went down, but until you run the model on held-out data and measure something concrete — CER, WER, per-category breakdowns — you don’t know if the fine-tuning worked, whether it regressed on certain domains, or how it compares to the baseline you started from.
This post covers two scripts: one for evaluating a single model on a Korean telephonic validation set, and one for benchmarking multiple models side by side with per-category metrics. The engineering focus is on doing this efficiently at scale — thousands of audio samples, multiple GPUs, multiple model checkpoints — without drowning in I/O bottlenecks or memory pressure.
The evaluation and benchmark scripts are available on GitHub: eval.py and benchmark.py.
What We Measure: CER and WER
For Korean speech-to-text, the primary metric is CER (Character Error Rate). Korean is an agglutinative language where spacing conventions are inconsistent, making word-level metrics noisy. CER computes the edit distance at the character level — insertions, deletions, and substitutions — normalized by the length of the reference transcription. It gives a more stable signal than WER for Korean.
WER (Word Error Rate) is still reported for completeness and for comparison with English-language benchmarks. Both metrics are computed using the Hugging Face evaluate library, which provides standardized implementations. Empty references are filtered out before computation to avoid division-by-zero artifacts.
Single-Model Evaluation
The evaluation script (eval.py) answers a simple question: how well does this one model perform on the validation set?
Pipeline
The pipeline has three stages:
-
Data discovery: Scan the validation directory for WAV files and pair each with its corresponding TXT transcription file. The directory convention follows the Korean NIA dataset structure, where audio lives under
원천데이터and labels under라벨링데이터. The pairing is deterministic (same seed, same shuffle, same subset every time) so results are reproducible. -
Parallel audio preprocessing: Load and resample audio to 16 kHz using 200 CPU workers via multiprocessing. Each worker uses
librosa.loadwithkaiser_fastresampling. The results — raw numpy arrays and their transcriptions — are collected in memory for the inference stage. -
Multi-GPU inference: The model is loaded in FP16 with
device_map="auto", which distributes the model across all available GPUs. Batches of audio are passed through theWhisperProcessorto compute mel spectrograms on the fly, then fed tomodel.generatewith greedy decoding (num_beams=1). The decoded predictions are paired with references for metric computation.
Greedy decoding is a deliberate choice for evaluation speed. Beam search with 4–5 beams would improve accuracy by a small margin (typically 0.1–0.3% CER) but multiplies inference time proportionally. For iterative evaluation during development — where you might run this script dozens of times as you compare checkpoints — greedy decoding is the better trade-off.
Output
The script prints a summary: total CER, total WER, samples evaluated, wall-clock time, and throughput (samples/sec). It also prints a random sample of 10 predictions alongside their references for manual inspection — which is often more informative than aggregate metrics. You can spot systematic errors (e.g. the model hallucinating filler words, dropping sentence endings, or confusing similar-sounding syllables) that a single CER number would hide.
Multi-Model Benchmarking
The benchmark script (benchmark.py) is designed for a different workflow: comparing multiple models on the same data, with per-category breakdowns.
Category-Level Evaluation
The Korean telephonic dataset is organized into categories (D01, D02, D03, D04), each representing a different type of telephonic conversation. A model might perform well on clean, scripted speech (D01) but poorly on noisy, spontaneous dialogue (D04). Reporting only aggregate CER would mask this. The benchmark script evaluates each category separately and reports a table:
Model D01 D02 D03 D04 TOTAL
medium-ft-2 4.21 5.87 6.34 8.12 6.14
This per-category view is essential for diagnosing where a model struggles and for deciding whether you need more data from a specific domain, or whether a different architecture or training strategy is warranted.
Architecture: Manifest-Based, No Audio in RAM
The evaluation script loads all audio into memory via multiprocessing, which works fine for 5,000 samples but does not scale to tens of thousands across multiple models. Loading 20,000 audio files into RAM and then running four models sequentially would require either enormous memory or repeated I/O.
The benchmark script takes a different approach. It builds a manifest — a lightweight list of (wav_path, transcription, category) tuples — once, using parallel workers to read only the text transcription files. No audio is loaded into memory at this stage. The manifest is shared across all model evaluations, so the I/O cost of discovering and pairing files is paid exactly once.
Audio is loaded on-demand inside each GPU worker, one batch at a time. After the batch is processed, the audio arrays are discarded. This keeps peak memory proportional to batch_size × audio_length per GPU, regardless of total dataset size.
Architecture: One Process Per GPU
Rather than using device_map="auto" to shard a single model across GPUs (which serializes inference), the benchmark script spawns one process per GPU. Each process:
- Loads the full model onto its assigned
cuda:{rank}device in FP16 - Receives a round-robin shard of the manifest (for load balancing)
- Loads audio on-demand with
torchaudio(faster thanlibrosafor simple load-and-resample) - Runs batched inference with greedy decoding
- Streams results back to the main process via a
multiprocessing.Queue
The main process collects results from all GPUs with a progress bar and computes metrics once all workers are done.
This 1-process-per-GPU design gives near-linear scaling. With 8 GPUs, you get approximately 8× throughput compared to single-GPU evaluation — which matters when you are benchmarking several model checkpoints on tens of thousands of samples.
Engineering Details
A few things that matter in practice:
-
Thread contention: Each GPU process pins
OMP_NUM_THREADS,MKL_NUM_THREADS, andtorch.set_num_threadsto 1. With 8 GPU processes each running their own CUDA kernels, internal CPU threading would cause contention and slow everything down. The same lesson from the preprocessing pipeline (Part 1) applies here. -
Flash Attention 2: The script attempts to load the model with
attn_implementation="flash_attention_2"and falls back gracefully if the environment does not support it. Flash Attention reduces memory usage and speeds up the attention computation, which helps when running many samples through a large model. -
TF32 and cuDNN benchmark:
torch.backends.cuda.matmul.allow_tf32 = Trueandtorch.backends.cudnn.benchmark = Trueare set for additional inference speed on Ampere+ GPUs. -
torchaudiooverlibrosa: For the benchmark script, audio loading usestorchaudio.load+torchaudio.functional.resamplerather thanlibrosa. Torchaudio avoids pulling in the full scipy/librosa stack in each spawned process and handles the load-resample path with less overhead. For preprocessing (Part 1) where we needed more control,librosawas fine; for high-throughput benchmark inference,torchaudiois leaner.
What the Numbers Tell You (and What They Don’t)
CER and WER give you a single number that summarizes transcription accuracy. That’s useful for comparing models, tracking progress, and deciding when to stop training. But a few caveats:
- CER doesn’t capture semantic errors. A model that outputs “학교에 갔다” instead of “학교에 갓다” has a low edit distance but a different meaning. Character-level metrics treat all errors equally.
- Aggregate metrics hide distribution. A model with 5% CER might have 2% on clean audio and 15% on noisy audio. The per-category breakdown in the benchmark script helps, but even within a category the variance can be large.
- Greedy vs. beam search. The numbers reported with greedy decoding are a lower bound on the model’s capability. If you need the absolute best CER for a final report, run beam search with 4–5 beams. For development iteration, greedy is the right call.
The most informative step is often the simplest: read the sample predictions. Look at what the model gets wrong. Are the errors phonetically plausible? Are they spacing issues (which CER penalizes but a human wouldn’t notice)? Are there hallucinations — the model generating fluent text that was never spoken? These qualitative observations guide the next round of data collection or training adjustments more than any single metric.
Closing the Loop
This post completes the three-part series. The full pipeline looks like this:
- Preprocess (Part 1): Compute mel spectrograms and tokenize transcriptions once. Save to disk as memory-mappable shards.
- Train (Part 2): Load precomputed features, run full fine-tuning with BF16, large batches, cosine schedule, and early stopping on CER.
- Evaluate (Part 3): Run the fine-tuned model on held-out data. Measure CER/WER. Benchmark against baselines and across data categories. Inspect predictions manually.
Each stage is decoupled. You can re-run evaluation without re-training, swap in a different model checkpoint without re-preprocessing, or add a new data category to the benchmark without touching the training pipeline. That modularity is the point — it turns a monolithic “train and hope” workflow into an iterative, measurable process.
The code for the entire pipeline is on GitHub. If you are fine-tuning Whisper for any language, the same structure applies: preprocess once, train with discipline, and measure everything.