Stop Recomputing Mel Spectrograms: Preprocessing Your Data Before Whisper Fine-Tuning

This is Part 1 of a three-part series on fine-tuning Whisper for Korean speech-to-text: Preprocess → Train → Evaluate. In this post, we build the data preprocessing pipeline. Parts 2 and 3 will cover the training loop and evaluation/benchmarking, respectively.

When I first started fine-tuning OpenAI’s Whisper for Korean speech-to-text, I noticed something frustrating. Every single time I kicked off a training run — whether I was tweaking the learning rate, adjusting the batch size, or experimenting with a new scheduler — the framework would spend hours churning through raw audio files before a single gradient was computed. The preprocessing step was identical each time: load WAV files, resample, compute mel spectrograms, tokenize transcriptions. Nothing about the data had changed, yet I was paying the full cost of data preparation on every attempt.

It didn’t take long to realize: this entire step could be done once, saved to disk, and reused across every subsequent fine-tuning run. What followed was a dedicated preprocessing pipeline — one that turned hundreds of thousands of raw audio files into a training-ready dataset in a single pass. This post walks through why that matters and what, exactly, the preprocessing entails.

The full preprocessing script is available on GitHub: preprocess_whisper.py

The Cost of Repeated Preprocessing

Popular training frameworks like Hugging Face’s Trainer make it easy to define a data collator and a map function that transforms raw audio into model-ready features on the fly. This is convenient for prototyping, but the convenience hides a real cost.

For a large-scale dataset — say, several hundred thousand Korean speech samples totaling thousands of hours — the per-run preprocessing overhead is enormous. Each sample requires disk I/O to load the audio, a resampling pass (often from 44.1 kHz or 48 kHz down to 16 kHz), a Short-Time Fourier Transform to produce a mel spectrogram, and subword tokenization of the transcription. Multiply that across hundreds of thousands of files, and you’re looking at hours of compute that contribute nothing to model improvement.

Worse, if a training run crashes at epoch two due to a misconfigured hyperparameter, you’ve wasted all of that preprocessing time. And in an iterative workflow — where you might launch dozens of experimental runs over the course of a week — the accumulated waste is significant.

The solution is straightforward: decouple preprocessing from training. Run the expensive feature extraction once, persist the results, and point every subsequent training script at the precomputed dataset.

What Whisper Actually Needs

To understand what the preprocessing pipeline does, it helps to briefly review how Whisper consumes audio.

The Mel Spectrogram

Whisper is an encoder-decoder Transformer. Its encoder does not operate on raw audio waveforms directly. Instead, it expects a log-mel spectrogram — a two-dimensional representation of audio where one axis is time and the other is frequency, expressed on the mel scale (a perceptual scale that more closely approximates how humans hear pitch).

The conversion from raw waveform to mel spectrogram involves several steps. First, the audio is resampled to 16 kHz, which is the sample rate Whisper was trained on. Then a Short-Time Fourier Transform (STFT) is applied using a 25-millisecond window with a 10-millisecond hop, decomposing the signal into overlapping frames of frequency content. The resulting power spectrum is projected onto 80 mel-frequency bins and converted to a logarithmic scale. The output is a matrix of shape (80, 3000) for a standard 30-second Whisper input window — 80 frequency bins across 3,000 time steps.

This mel spectrogram is what the Whisper encoder’s convolutional stem ingests before passing it through its Transformer layers. Computing it is not trivial: for each audio file, it involves resampling, windowing, an FFT, a filterbank projection, and log scaling. When you do this for hundreds of thousands of files, the aggregate cost is substantial.

Tokenized Transcriptions

On the decoder side, Whisper uses a byte-level BPE tokenizer to convert transcription text into a sequence of integer token IDs. For multilingual models like whisper-large-v3, the token sequence also includes special tokens that specify the language (<|ko|> for Korean) and the task (<|transcribe|>).

Tokenization is comparatively cheap, but it still involves loading and applying the tokenizer for every sample. When bundled into the preprocessing pass, it adds negligible marginal cost while eliminating one more source of repeated work at training time.

Design of the Preprocessing Pipeline

The preprocessing script I wrote is built around a few key design decisions that target high-core-count machines (in my case, a machine with over 200 CPU cores).

Chunked Parallel Processing

Rather than using a simple map over the dataset with multiprocessing — which would require serializing large numpy arrays back to the main process via IPC — the pipeline splits the file list into chunks of roughly 500 files each. Each chunk is dispatched to a worker process that independently loads audio, computes features, and writes results directly to disk as a Hugging Face Dataset shard. The main process never touches the feature arrays; it only collects lightweight status dictionaries reporting success and error counts.

This design sidesteps a common bottleneck in Python multiprocessing. Sending hundreds of megabytes of float arrays through pipes or queues between processes is slow and memory-intensive. By having each worker save its own output, the IPC overhead drops to nearly zero, and the pipeline scales well to high worker counts.

Per-Worker Model Initialization

Each worker process initializes its own WhisperFeatureExtractor and WhisperTokenizer instance. While this means the model weights are loaded once per worker (rather than shared), it avoids the complexity and fragility of shared-memory schemes for read-only model state. On a machine with ample RAM, the redundant memory footprint is an acceptable trade-off for simplicity and robustness.

Internal threading for numerical libraries (OpenBLAS, MKL, NumPy) is also pinned to a single thread per process via environment variables. Without this, 200 worker processes each spawning their own internal thread pool would cause catastrophic oversubscription of CPU cores.

Output Format

Each chunk is saved as a self-contained Hugging Face Dataset on disk. This format is memory-mappable at training time, meaning the training loop can read features directly from disk without loading the entire dataset into RAM. For large-scale fine-tuning, this is essential — the precomputed mel spectrograms alone can occupy hundreds of gigabytes in aggregate.

The optional use of float16 for the mel spectrogram features roughly halves storage requirements with negligible impact on training quality, since most fine-tuning runs use mixed precision anyway.

The Payoff

After a single preprocessing run, every subsequent fine-tuning experiment starts from precomputed features. A training script simply loads the Dataset shards from disk, defines a data collator for padding, and begins optimization immediately. There is no redundant audio loading, no redundant resampling, no redundant FFT computation.

In practice, this turned a workflow where each experiment had a multi-hour cold start into one where I could iterate on hyperparameters with near-instant data loading. For anyone fine-tuning Whisper on a dataset of meaningful size, the upfront investment in a proper preprocessing pipeline pays for itself almost immediately.

The broader takeaway extends beyond Whisper. Any time a training pipeline involves deterministic, expensive transformations of raw data, those transformations should be computed once and cached. The separation of data preparation from model training is not just a performance optimization — it is a prerequisite for efficient experimentation.

What’s Next

With the data pipeline in place, the preprocessed dataset is sitting on disk — mel spectrograms computed, transcriptions tokenized, shards ready to be memory-mapped. The expensive part of data preparation is behind us.

In Part 2, we pick up exactly where this leaves off: loading those precomputed features into a training loop, configuring the Whisper model for Korean fine-tuning, and working through the practical decisions around learning rates, scheduling, and mixed-precision training. Then in Part 3, we close the loop with evaluation — measuring CER/WER on held-out Korean test sets and benchmarking against baseline models.

The Cost of Repeated Preprocessing#

What Whisper Actually Needs#

The Mel Spectrogram#

Tokenized Transcriptions#

Design of the Preprocessing Pipeline#

Chunked Parallel Processing#

Per-Worker Model Initialization#

Output Format#

The Payoff#

What’s Next#