Did Fine-Tuning Actually Help? Evaluating and Benchmarking Whisper for Korean STT

This is Part 3 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. Here we measure whether the fine-tuned model actually improved, and by how much. Part 1 covered preprocessing; Part 2 covered training.

A trained model without evaluation is just a checkpoint on disk. You can stare at the training loss curve and hope it went down, but until you run the model on held-out data and measure something concrete — CER, WER, per-category breakdowns — you don’t know if the fine-tuning worked, whether it regressed on certain domains, or how it compares to the baseline you started from.

This post covers two scripts: one for evaluating a single model on a Korean telephonic validation set, and one for benchmarking multiple models side by side with per-category metrics. The engineering focus is on doing this efficiently at scale — thousands of audio samples, multiple GPUs, multiple model checkpoints — without drowning in I/O bottlenecks or memory pressure.

The evaluation and benchmark scripts are available on GitHub: eval.py and benchmark.py.

What We Measure: CER and WER

For Korean speech-to-text, the primary metric is CER (Character Error Rate). Korean is an agglutinative language where spacing conventions are inconsistent, making word-level metrics noisy. CER computes the edit distance at the character level — insertions, deletions, and substitutions — normalized by the length of the reference transcription. It gives a more stable signal than WER for Korean.

WER (Word Error Rate) is still reported for completeness and for comparison with English-language benchmarks. Both metrics are computed using the Hugging Face evaluate library, which provides standardized implementations. Empty references are filtered out before computation to avoid division-by-zero artifacts.

Single-Model Evaluation

The evaluation script (eval.py) answers a simple question: how well does this one model perform on the validation set?

Pipeline

The pipeline has three stages:

Data discovery: Scan the validation directory for WAV files and pair each with its corresponding TXT transcription file. The directory convention follows the Korean NIA dataset structure, where audio lives under 원천데이터 and labels under 라벨링데이터. The pairing is deterministic (same seed, same shuffle, same subset every time) so results are reproducible.
Parallel audio preprocessing: Load and resample audio to 16 kHz using 200 CPU workers via multiprocessing. Each worker uses librosa.load with kaiser_fast resampling. The results — raw numpy arrays and their transcriptions — are collected in memory for the inference stage.
Multi-GPU inference: The model is loaded in FP16 with device_map="auto", which distributes the model across all available GPUs. Batches of audio are passed through the WhisperProcessor to compute mel spectrograms on the fly, then fed to model.generate with greedy decoding (num_beams=1). The decoded predictions are paired with references for metric computation.

Greedy decoding is a deliberate choice for evaluation speed. Beam search with 4–5 beams would improve accuracy by a small margin (typically 0.1–0.3% CER) but multiplies inference time proportionally. For iterative evaluation during development — where you might run this script dozens of times as you compare checkpoints — greedy decoding is the better trade-off.

Output

The script prints a summary: total CER, total WER, samples evaluated, wall-clock time, and throughput (samples/sec). It also prints a random sample of 10 predictions alongside their references for manual inspection — which is often more informative than aggregate metrics. You can spot systematic errors (e.g. the model hallucinating filler words, dropping sentence endings, or confusing similar-sounding syllables) that a single CER number would hide.

Multi-Model Benchmarking

The benchmark script (benchmark.py) is designed for a different workflow: comparing multiple models on the same data, with per-category breakdowns.

Category-Level Evaluation

The Korean telephonic dataset is organized into categories (D01, D02, D03, D04), each representing a different type of telephonic conversation. A model might perform well on clean, scripted speech (D01) but poorly on noisy, spontaneous dialogue (D04). Reporting only aggregate CER would mask this. The benchmark script evaluates each category separately and reports a table:

Model            D01     D02     D03     D04    TOTAL
medium-ft-2     4.21    5.87    6.34    8.12     6.14

This per-category view is essential for diagnosing where a model struggles and for deciding whether you need more data from a specific domain, or whether a different architecture or training strategy is warranted.

Architecture: Manifest-Based, No Audio in RAM

The evaluation script loads all audio into memory via multiprocessing, which works fine for 5,000 samples but does not scale to tens of thousands across multiple models. Loading 20,000 audio files into RAM and then running four models sequentially would require either enormous memory or repeated I/O.

The benchmark script takes a different approach. It builds a manifest — a lightweight list of (wav_path, transcription, category) tuples — once, using parallel workers to read only the text transcription files. No audio is loaded into memory at this stage. The manifest is shared across all model evaluations, so the I/O cost of discovering and pairing files is paid exactly once.

Audio is loaded on-demand inside each GPU worker, one batch at a time. After the batch is processed, the audio arrays are discarded. This keeps peak memory proportional to batch_size × audio_length per GPU, regardless of total dataset size.

Architecture: One Process Per GPU

Rather than using device_map="auto" to shard a single model across GPUs (which serializes inference), the benchmark script spawns one process per GPU. Each process:

Loads the full model onto its assigned cuda:{rank} device in FP16
Receives a round-robin shard of the manifest (for load balancing)
Loads audio on-demand with torchaudio (faster than librosa for simple load-and-resample)
Runs batched inference with greedy decoding
Streams results back to the main process via a multiprocessing.Queue

The main process collects results from all GPUs with a progress bar and computes metrics once all workers are done.

This 1-process-per-GPU design gives near-linear scaling. With 8 GPUs, you get approximately 8× throughput compared to single-GPU evaluation — which matters when you are benchmarking several model checkpoints on tens of thousands of samples.

Engineering Details

A few things that matter in practice:

Thread contention: Each GPU process pins OMP_NUM_THREADS, MKL_NUM_THREADS, and torch.set_num_threads to 1. With 8 GPU processes each running their own CUDA kernels, internal CPU threading would cause contention and slow everything down. The same lesson from the preprocessing pipeline (Part 1) applies here.
Flash Attention 2: The script attempts to load the model with attn_implementation="flash_attention_2" and falls back gracefully if the environment does not support it. Flash Attention reduces memory usage and speeds up the attention computation, which helps when running many samples through a large model.
TF32 and cuDNN benchmark: torch.backends.cuda.matmul.allow_tf32 = True and torch.backends.cudnn.benchmark = True are set for additional inference speed on Ampere+ GPUs.
torchaudio over librosa: For the benchmark script, audio loading uses torchaudio.load + torchaudio.functional.resample rather than librosa. Torchaudio avoids pulling in the full scipy/librosa stack in each spawned process and handles the load-resample path with less overhead. For preprocessing (Part 1) where we needed more control, librosa was fine; for high-throughput benchmark inference, torchaudio is leaner.

What the Numbers Tell You (and What They Don’t)

CER and WER give you a single number that summarizes transcription accuracy. That’s useful for comparing models, tracking progress, and deciding when to stop training. But a few caveats:

CER doesn’t capture semantic errors. A model that outputs “학교에 갔다” instead of “학교에 갓다” has a low edit distance but a different meaning. Character-level metrics treat all errors equally.
Aggregate metrics hide distribution. A model with 5% CER might have 2% on clean audio and 15% on noisy audio. The per-category breakdown in the benchmark script helps, but even within a category the variance can be large.
Greedy vs. beam search. The numbers reported with greedy decoding are a lower bound on the model’s capability. If you need the absolute best CER for a final report, run beam search with 4–5 beams. For development iteration, greedy is the right call.

The most informative step is often the simplest: read the sample predictions. Look at what the model gets wrong. Are the errors phonetically plausible? Are they spacing issues (which CER penalizes but a human wouldn’t notice)? Are there hallucinations — the model generating fluent text that was never spoken? These qualitative observations guide the next round of data collection or training adjustments more than any single metric.

Closing the Loop

This post completes the three-part series. The full pipeline looks like this:

Preprocess (Part 1): Compute mel spectrograms and tokenize transcriptions once. Save to disk as memory-mappable shards.
Train (Part 2): Load precomputed features, run full fine-tuning with BF16, large batches, cosine schedule, and early stopping on CER.
Evaluate (Part 3): Run the fine-tuned model on held-out data. Measure CER/WER. Benchmark against baselines and across data categories. Inspect predictions manually.

Each stage is decoupled. You can re-run evaluation without re-training, swap in a different model checkpoint without re-preprocessing, or add a new data category to the benchmark without touching the training pipeline. That modularity is the point — it turns a monolithic “train and hope” workflow into an iterative, measurable process.

The code for the entire pipeline is on GitHub. If you are fine-tuning Whisper for any language, the same structure applies: preprocess once, train with discipline, and measure everything.

What We Measure: CER and WER#

Single-Model Evaluation#

Pipeline#

Output#

Multi-Model Benchmarking#

Category-Level Evaluation#

Architecture: Manifest-Based, No Audio in RAM#

Architecture: One Process Per GPU#

Engineering Details#

What the Numbers Tell You (and What They Don’t)#

Closing the Loop#