1. Introduction
Reliable biometric speaker verification demands clean speech signals, yet real-world deployments—from forensic casework and telephone banking to smart-home assistants—routinely expose recordings to background noise, reverberation, and channel distortion that degrade recognition accuracy [
1]. Deep learning (DL) has become the dominant paradigm for single-channel speech enhancement because learned, non-linear mappings from noisy to clean speech consistently surpass classical statistical estimators in both perceptual quality and intelligibility.
Modern architectures exploit encoder–decoder topologies, skip connections, and self-attention to handle non-stationary noise [
2]. Among the most widely adopted designs, Wave-U-Net [
3] operates directly on raw waveforms, preserving phase information that frequency-domain methods typically discard. The Conformer-based Metric GAN (CMGAN) [
4] pairs convolutional feature extractors with multi-head self-attention inside an adversarial training loop, yielding perceptually realistic output even under rapidly changing noise [
4,
5]. A hybrid U-Net variant [
6] applies multi-stage spectral feature fusion, retaining fine spectral detail at moderate computational cost.
Despite these architectural advances, two evaluation blind spots persist. First, most published comparisons test on the same corpus used for training, inflating performance estimates and masking generalization weaknesses. Second, almost no study quantifies how enhancement alters downstream speaker identity—a gap that is critical for forensic, security, and access-control applications.
This paper fills both gaps by benchmarking Wave-U-Net, CMGAN, and U-Net under identical training noise conditions and then evaluating them on three independent corpora—SpEAR, VPQAD, and Clarkson—that the models never see during training. Beyond conventional signal-level (SNR) and perceptual (PESQ) metrics, we introduce commercial-grade VeriSpeak match scores as a biometric fidelity measure, providing the first unified view of the three-way trade-off among noise suppression, perceptual quality, and speaker-identity preservation. While both speech quality enhancement and noise suppression are evaluated, the primary distinguishing contribution of this work is the biometric evaluation—assessing how speech enhancement impacts speaker identity preservation through commercial-grade VeriSpeak match scores, which is critical for security-sensitive applications.
1.1. Related Work
Early single-channel speech-enhancement systems relied on statistical estimators such as spectral subtraction [
7], Wiener filtering [
8], and MMSE log-spectral amplitude estimators [
9]. While computationally light, these classical methods assume stationary or slowly varying noise and therefore falter in the highly non-stationary environments typical of modern mobile and IoT deployments. Deep learning (DL) models now dominate the field because they can learn complex, non-linear noise–speech mappings directly from data, often outperforming classical baselines by large margins in perceptual quality and intelligibility [
10,
11].
Wave-U-Net family. Stoller et al. [
3] adapted the image-centric U-Net architecture to raw audio, introducing a multi-scale encoder–decoder with skip connections that preserve phase coherence while capturing long-range context. Subsequent variants either
widen the receptive field via dilated convolutions (Dilated Wave-U-Net [
12]) or
re-weight encoder features with attention to emphasize speech-dominant regions (Attention Wave-U-Net [
13]). These refinements boost distortion metrics (STOI, PESQ) but are still trained and validated on the same corpora, leaving cross-domain robustness unclear.
CMGAN line. CMGAN combines convolutional front-ends with Transformer-style multi-head attention inside a generative-adversarial framework [
4]. The generator produces a clean magnitude spectrogram, while a PatchGAN discriminator enforces perceptual realism, and a phone-aware perceptual loss (CMGAN-PPL) further sharpens formant structure [
14]. Zhang et al. [
15] showed that augmenting CMGAN with urban noise improves generalization to city-sound scenes, yet the model remains evaluated on matched VCTK or VoiceBank–DEMAND test splits.
U-Net derivatives. U-Net remains popular because of its parameter efficiency and ease of adaptation. Belz et al. [
6] added multi-stage feature fusion to capture both wide-band and narrow-band artifacts; Baloch et al. [
16] inserted gated convolutions that act as dynamic spectral masks, yielding gains for English and Urdu corpora. Variational U-Net (V-UNet) injects a stochastic bottleneck so that the decoder can sample plausible clean spectra under heavy noise [
17]. A supervised, light-weight U-Net tuned for real-time hearing-aid deployment was proposed by Hossain et al. [
18]. Despite these advances, U-Net studies rarely examine how speech enhancement affects
downstream tasks such as automatic speaker verification.
Evaluation practices and gaps.Table 1 summarizes representative studies. Nearly all works rely on
matched data splits (training and evaluation from the same corpus) and report only signal-level or perceptual metrics (SDR, PESQ, STOI). Cross-age generalization (children vs. adults), synthetic-vs-real noise transfer, and biometric utility remain largely unexplored.
Our position. We close this gap by retraining three state-of-the-art models—Wave-U-Net, CMGAN, and U-Net—under a common noise augmentation (DEMAND) and then challenging them with three independent, domain-diverse test sets: SpEAR (synthetic SNR sweep), VPQAD (real adult speakers in laboratory noise), and Clarkson (natural child speech in classrooms). Crucially, we extend evaluation beyond SNR/PESQ to VeriSpeak match scores, offering the first systematic view of enhancement–speaker-verification trade-offs. We note that the primary contribution of this work is not the proposal of a new architecture but the introduction of a rigorous, biometric-aware evaluation framework that is absent from the existing literature. The three models were selected to represent distinct architectural paradigms—time-domain encoder–decoder (Wave-U-Net), GAN-based conformer (CMGAN), and spectrogram-based U-Net—providing broad coverage of current design philosophies. While newer architectures such as Transformer-only and Mamba-based (state-space) models are emerging, they lack the mature, reproducible implementations and pretrained checkpoints necessary for a fair cross-architecture comparison at the time of this study. The evaluation framework established here is architecture-agnostic and readily extensible to such future models.
1.2. Contributions of This Paper
Unified benchmark. First side-by-side test of Wave-U-Net, CMGAN, and U-Net retrained under the same noise profile and evaluated on three unseen corpora (SpEAR, VPQAD, Clarkson). This shared yardstick lets future studies compare new architectures on equal footing and isolate genuine performance gains.
Biometric evaluation. Complements SNR/PESQ with VeriSpeak scores to gauge how well each model preserves speaker identity. Linking enhancement quality to biometric integrity guides algorithm design toward security-aware speech processing. To our knowledge, no prior study has systematically evaluated speech enhancement models using a commercial-grade biometric speaker verification system, making this the first benchmark of its kind.
Actionable insights. Finds U-Net best for noise suppression, CMGAN for perceptual quality, and Wave-U-Net for identity retention—offering clear model-selection guidance. Practitioners can directly map these trade-offs to real-world constraints in telephony, hearing aids, and voice-assistant deployments.
Open resources. Releases all code, configs, and notebooks for full reproducibility. Public artifacts accelerate replication, ablation studies, and community-driven innovation in speech enhancement.
3. Methodology
This study begins with dataset preparation, followed by model training details. The trained models are then assessed using objective metrics to determine their effectiveness in enhancing speech quality and robustness in diverse conditions. All codes pertaining to model architectures, data pre-processing, testing, training, and calculation of evaluation metrics are available at the following GitHub repository: DL Speech Enhancement Toolkit (
https://github.com/jahangirkhondkar/DL_SpeechEnhancementToolkit) (accessed on 1 December 2024).
3.1. Model Architectures
3.1.1. Wave-U-Net Architecture
Wave-U-Net operates entirely in the time domain, applying successive 1-D convolutions to the raw waveform without an intermediate spectral transform. Its symmetric encoder–decoder layout, bridged by skip connections at every resolution level, allows the network to capture both coarse temporal structure and fine transient detail simultaneously.
Figure 2 illustrates the architecture used in this study.
Encoder (Down-Sampling Path): Each encoder stage halves the temporal resolution through strided 1-D convolutions while doubling the feature-map count, progressively distilling high-level signal representations.
Skip Connections: Lateral links concatenate encoder activations with their mirror decoder layer, ensuring that fine-grained waveform details survive the bottleneck.
Decoder (Up-Sampling Path): Transposed convolutions restore the original sample rate; at each stage the decoder fuses up-sampled features with the corresponding skip-connection output to reconstruct the enhanced waveform.
3.1.2. CMGAN Architecture
Conformer-Based Metric GAN (CMGAN) tackles monaural speech enhancement within an adversarial framework. A conformer-augmented generator produces the enhanced spectrogram, while a metric discriminator scores output quality, jointly driving the system toward perceptually realistic speech.
Figure 3 outlines the conceptual architecture.
Generator: A shared encoder jointly processes the magnitude and complex (real/imaginary) spectral components. Two successive conformer blocks model temporal and frequency dependencies via interleaved convolution and self-attention. The decoder splits into parallel paths: one estimating a magnitude mask and another refining the complex-valued spectrogram.
Metric Discriminator: The discriminator is trained to predict the PESQ score of the generator’s output relative to the clean reference. By back-propagating a metric-aware loss, it steers the generator toward outputs that maximize perceptual quality—directly targeting non-differentiable evaluation criteria.
3.1.3. U-Net Architecture
Originally developed for biomedical image segmentation, the U-Net architecture has been adapted for spectrogram-based audio tasks owing to its efficient symmetric encoder–decoder design. A contracting path captures increasingly abstract spectral context, while an expanding path recovers spatial (time–frequency) resolution, as depicted in
Figure 4.
In the contracting path, pairs of
convolutions with ReLU activations are followed by
max-pooling steps that halve the spatial dimensions while doubling the channel count. The expanding path mirrors this structure with transposed convolutions and skip-connection concatenations, enabling the network to reconstruct clean spectrograms that retain the fine spectral detail needed for high-fidelity speech recovery [
27].
3.2. Data Preprocessing
Preprocessing steps were taken for each of the three models—Wave-U-Net, CMGAN, and U-Net—before training and evaluation. Different preprocessing approaches were required due to variations in dataset formats and model requirements.
It is important to note that all evaluation in this study targets speech signals, not music. Although Wave-U-Net was originally developed for music source separation using the MUSDB18-HQ dataset, it is applied here exclusively to speech denoising tasks. All three evaluation datasets (SpEAR, VPQAD, and Clarkson) contain speech recordings corrupted by environmental noise (e.g., Gaussian noise, cafeteria noise, classroom ambient sounds), and the models are assessed solely on their ability to enhance speech quality and preserve speaker identity.
3.2.1. Wave-U-Net
Training Preprocessing: The MUSDB18 HQ dataset used for training had a sampling rate of 44.1 kHz, which matches the input requirement of Wave-U-Net. However, the noise signals added from the DEMAND dataset were originally recorded at 48 kHz. Therefore, all tracks from the DEMAND dataset were down-sampled to 44.1 kHz.
Enhancement Preprocessing: The Clarkson and VPQAD datasets were both at 44.1 kHz, and the SpEAR dataset had a sampling rate of 16 kHz. Therefore, up-sampling to 44.1 kHz was performed to make it compatible with Wave-U-Net’s trained model during evaluation.
3.2.2. CMGAN
Training Preprocessing: The VCTK Corpus dataset, initially in .flac format and sampled at 48 kHz, was converted to .wav format. To match the target sampling rate of the CMGAN model, the clean speech files were down-sampled to 16 kHz. The DEMAND dataset was similarly down-sampled to 16 kHz and mixed with the VCTK dataset to generate noisy speech samples for training.
Enhancement Preprocessing: CMGAN checkpoints were configured for 16 kHz audio, necessitating down-sampling of Clarkson and VPQAD datasets to 16 kHz before evaluation. The SpEAR dataset was already at 16 kHz, so it was directly compatible with CMGAN for enhancement.
3.2.3. U-Net
Training Preprocessing: The LibriSpeech Corpus dataset, sampled at 16 kHz, was originally in .flac format and converted to .wav format to be compatible with the U-Net model. The ESC-50 dataset, recorded at 44.1 kHz, was down-sampled to 16 kHz and mixed with the LibriSpeech dataset.
Enhancement Preprocessing: Since U-Net checkpoints were compatible only with 16 kHz audio, Clarkson and VPQAD datasets were down-sampled to 16 kHz for evaluation. The SpEAR dataset, at 16 kHz, was compatible without additional preprocessing.
Wave-U-Net follows Stoller et al. [
3], whose original MUSDB18-HQ training data are 44.1 kHz; keeping this rate preserves the network’s time-domain receptive field and avoids aliasing from down-sampling. CMGAN and U-Net, which employ STFT front-ends, are trained at the standard 16 kHz speech rate. After enhancement, all outputs are resampled to 16 kHz so that PESQ, SNR, and VeriSpeak remain directly comparable across models. We acknowledge that resampling operations can alter the spectral content of audio signals. Up-sampling from 16 kHz to 44.1 kHz (e.g., SpEAR with Wave-U-Net) introduces interpolated frequency content above 8 kHz that was absent in the original recording, while down-sampling from 44.1 kHz to 16 kHz (e.g., VPQAD and Clarkson with CMGAN and U-Net) removes frequency information above 8 kHz. These transformations may influence model performance, particularly for speaker-specific features residing in higher frequency bands. To mitigate evaluation bias, all enhanced outputs were resampled to a common 16 kHz rate before computing PESQ, SNR, and VeriSpeak scores, ensuring a consistent comparison baseline across all models.
3.3. Training Setup
3.3.1. Training Environment
Training was carried out on a workstation equipped with an NVIDIA GeForce RTX 4080 GPU (16 GB VRAM, Compute Capability 8.9; NVIDIA Corporation, Santa Clara, CA, USA), 64 GB of system RAM, and an AMD (Advanced Micro Devices, Inc., Santa Clara, CA, USA) processor running Ubuntu 22.04. Isolated Python 3.10 virtual environments ensured dependency separation across models. Post-processing and metric computation used MATLAB 2022b (MathWorks, Natick, MA, USA).
3.3.2. Training Parameters
The training parameters varied across the three models—Wave-U-Net, CMGAN, and U-Net—based on their specific architecture requirements and dataset properties. While the initial training parameters were informed by the configurations provided by the original authors in their GitHub repositories, we primarily retained these default parameters as our study used the same datasets on which their models were trained. We added common noisy files from the CMGAN dataset to introduce slight diversity while maintaining the original format and structure. This approach ensured consistency with the original setup, allowing fair comparisons, while small adjustments, such as checkpointing, optimized training for our modified datasets. Hyper-parameters for each model were initialized from the configurations published in the respective authors’ repositories. We augmented each training set with DEMAND noise to increase acoustic diversity while preserving the native data format and added periodic checkpointing to capture the best validation snapshot. The per-model configurations are detailed below.
The Wave-U-Net model was trained using the Adam optimizer, with an initial learning rate of 0.0001. A cyclic learning rate schedule was employed, oscillating between and to help the model avoid local minima. The loss functions used were mean squared error (MSE) and L1 loss, which aided in preserving sharp transients in the audio signal. The training was conducted with a batch size of 16. Additionally, data augmentation was applied, involving random amplification between 0.7 and 1.0 to enhance the model’s generalization capabilities.
The CMGAN model was trained using the AdamW optimizer for both the generator and discriminator. The initial learning rates were set as follows:
A cyclic learning rate scheduler was employed, with decay by a factor of 0.5 every 12 epochs. The batch size was set to 4, ensuring efficient training while maintaining computational feasibility.
The training loss consisted of a combination of time-frequency loss and adversarial loss, balanced using the following weight factors:
where
represents the weight of the time-domain loss,
represents the weight of the frequency-domain loss,
represents the weight of the adversarial loss.
The U-Net model was trained using the Adam optimizer, with an initial learning rate of 0.0002. A learning rate scheduler was applied to reduce the learning rate by half after every 20 epochs to stabilize training and improve convergence. The model used a batch size of 16 for efficient training, and the L1 loss function was used to minimize the difference between the predicted and clean speech signals. Data augmentation was performed through random noise addition and pitch shifts to improve the model’s robustness to different audio conditions [
6].
3.3.3. Experimental Setup
The MUSDB18-HQ dataset consists of two folders: one containing a training set “train” of 100 songs and another with a test set “test” of 50 songs. We used 120 folders for training and 30 for validation, ensuring an 80-20 split for model evaluation. In the “others” stem of the training folders, 100 folders were randomly replaced with noise from the DEMAND dataset to simulate more challenging training conditions. The training process ran for 68 epochs, and the model checkpoint with the lowest validation loss (0.0248) was selected as the final model. Early stopping was employed, halting training if there was no improvement in validation loss for 20 consecutive epochs. Validation loss was monitored throughout training using TensorBoard, and a graph showing validation loss vs. epoch is presented in the results section to illustrate model convergence.
The experiment involved mixing the Voice Bank Corpus dataset with noise from the DEMAND dataset, creating a dataset with 10 to 15 dB SNR variation. A total of 2000 audio tracks were generated by mixing clean audio with random noises from the DEMAND dataset, which were then split into 1600 for training and 400 for validation to maintain an 80-20 split. The training process was conducted over 120 epochs, and the best checkpoint, based on validation losses, was found at the 34th epoch, which was used for audio enhancement.
The experiment used 2000 files from the LibriSpeech Corpus as clean speech and a combination of 288 files from the DEMAND dataset and 1712 files from the ESC-50 dataset as noise sources. The files were divided into 1600 for training and 400 for validation for each dataset, maintaining an 80-20 split for training and evaluation. The training ran for 100 epochs, with the model checkpoint saved based on the lowest validation loss. After careful validation, the best model was selected, providing optimal performance in enhancing noisy speech. The evaluation included standard objective metrics like PESQ and SNR to assess the effectiveness of the enhanced audio.
3.4. Evaluation Metrics
3.4.1. Signal-to-Noise Ratio (SNR)
SNR expresses the ratio of desired-signal power to residual-noise power in decibels; a higher value indicates more effective noise removal [
28]:
where
and
denote signal and noise power, respectively. Our implementation estimates noise power from the first 10% of the unfiltered waveform (assumed to contain predominantly background noise), then computes frame-wise SNR via the STFT with a 32 ms window (512 samples at 16 kHz) and 50% overlap [
29,
30].
3.4.2. Perceptual Evaluation of Speech Quality (PESQ)
PESQ (ITU-T P.862) models the human auditory response to quantify perceived speech quality on a scale from
to
[
31,
32]. We used the open-source Python wrapper [
33], supplying the noisy recording as the degraded input and the enhanced output as the test signal. Higher scores reflect less audible distortion and greater naturalness.
3.4.3. VeriSpeak: Speaker Recognition Performance Evaluation
VeriSpeak (Neurotechnology) is a commercial speaker-verification engine that computes a similarity score between an enrolled voice template and a test utterance [
34]. We automated enrollment and matching via a MATLAB wrapper, adopting the standard decision threshold of 60 (FAR ≈ 0.01%). An increase in match score after enhancement indicates that speaker-discriminative features have been preserved or recovered; a decrease signals harmful distortion of identity cues.
3.4.4. Computational Cost
Table 4 reports the approximate training and inference times for each model on the hardware described in
Section 3.3.1.
CMGAN requires the longest training and inference times due to its adversarial training framework and transformer-based attention mechanisms, while U-Net is the most computationally efficient model. All inference times are averaged over the SpEAR dataset files at each model’s native sampling rate.
5. Discussion
5.1. Wave-U-Net
Wave-U-Net exhibits SNR decreases of (SpEAR), (VPQAD), and (Clarkson). Rather than indicating poor denoising, these reductions stem from the model’s aggressive isolation of the target speech signal, which lowers the overall signal power envelope. Complementary metrics (PESQ, VeriSpeak) confirm that Wave-U-Net suppresses noise effectively while retaining speech clarity and speaker-discriminative cues.
Perceptual quality improves consistently: mean PESQ reaches 2.91 on SpEAR, 1.19 on VPQAD, and 2.44 on Clarkson. The SpEAR and Clarkson figures indicate substantial gains in intelligibility and naturalness, while the lower VPQAD score reflects the inherently demanding cafeteria and laboratory noise conditions in that corpus. Overall, these results confirm that Wave-U-Net enhances listening quality without the aggressive spectral distortion observed in purely SNR-optimized models.
Biometric fidelity is Wave-U-Net’s strongest asset. VeriSpeak match scores rise by (SpEAR), (VPQAD), and (Clarkson), indicating that the model’s time-domain processing preserves the spectral fine structure on which speaker embeddings rely. The pronounced gains on VPQAD and Clarkson—corpora with substantial ambient noise—suggest that Wave-U-Net is particularly well suited to biometric pipelines operating in uncontrolled acoustic environments.
5.2. CMGAN
CMGAN’s SNR behavior mirrors its design emphasis on perceptual quality over raw noise attenuation: SNR drops by (SpEAR), (VPQAD), and (Clarkson). The larger VPQAD reduction is attributable to the highly non-stationary cafeteria and laboratory noise in that corpus. As with Wave-U-Net, these decreases do not signify inadequate denoising; the conformer-based generator redistributes spectral energy to maximize listening comfort rather than wideband SNR, a strategy validated by the model’s leading PESQ scores.
CMGAN achieves the highest PESQ values in the comparison: 3.33 (SpEAR), 1.35 (VPQAD), and 2.50 (Clarkson). The adversarial training objective, which penalizes perceptually implausible outputs through its metric discriminator, drives the model toward natural-sounding speech even under severe noise. These figures make CMGAN the strongest candidate for listening-comfort applications such as hearing aids, telecommunications, and broadcast media, where subjective speech quality outweighs raw noise suppression.
VeriSpeak scores improve by (SpEAR), (VPQAD), and (Clarkson). Although these gains are smaller than Wave-U-Net’s, they confirm that CMGAN’s adversarial enhancement does not distort speaker identity, making it viable for dual-purpose (quality + verification) pipelines.
5.3. U-Net
U-Net is the clear SNR leader: (SpEAR), (VPQAD), and (Clarkson). The outsized Clarkson gain reflects the model’s ability to strip substantial ambient classroom noise, producing nearly silent backgrounds. These figures position U-Net as the top candidate when maximum noise attenuation is the overriding requirement.
The cost of U-Net’s aggressive noise removal is visible in its PESQ figures: 1.15 (SpEAR), 1.35 (VPQAD), and 1.89 (Clarkson). All values fall below those of CMGAN and Wave-U-Net, indicating that the spectrogram-domain max-pooling and hard masking introduce spectral artifacts that degrade perceived naturalness. This SNR–PESQ trade-off makes U-Net less appropriate for applications in which listening comfort is the primary objective, though its outputs may still be acceptable when subsequent processing (e.g., ASR) is tolerant of mild distortion.
U-Net’s heavy noise attenuation comes at a biometric cost. On SpEAR, VeriSpeak scores show no measurable gain; on VPQAD and Clarkson, results are inconsistent across subjects, with several speakers exhibiting declines in match scores (overall averages are therefore omitted from the table). The spectrogram-domain processing likely smooths the formant transitions and harmonic micro-structure that underpin speaker embeddings, explaining the poor identity-preservation performance. For deployments that require both clean audio and reliable speaker verification, U-Net would need to be paired with a downstream identity-restoration module.
5.4. Overall
Across all three corpora, the models occupy distinct regions of the noise-suppression/ perceptual-quality/biometric-fidelity trade-off space. U-Net maximizes SNR (–) but sacrifices naturalness (PESQ ≤ 1.89) and biometric utility. CMGAN maximizes PESQ (up to 3.33) with moderate identity preservation. Wave-U-Net offers the best biometric outcome (– VeriSpeak) alongside competitive PESQ, making it the strongest all-round candidate when speaker identity matters.
In deployment terms: U-Net suits forensic pre-processing where intelligibility after heavy noise removal is acceptable; CMGAN fits hearing-aid and media workflows that prioritize listening comfort; Wave-U-Net is the natural choice for access-control and voice-banking pipelines that must maintain verification accuracy.
A focused analysis of Wave-U-Net on Clarkson Collections 7 and 8 confirms that enhancement raises both mated scores (146→165, 122→155) and non-mated scores (22→32, 16→24). Crucially, the mated–non-mated gap widens, improving discriminability while keeping all non-mated scores well below the standard VeriSpeak threshold of 60.
5.5. Limitations and Future Directions
Several limitations of this study should be acknowledged. First, all training and evaluation datasets consist exclusively of English speech. Language-specific phonetic, prosodic, and tonal characteristics may influence model effectiveness differently, and the generalizability of our findings to other languages remains unexplored. Cross-lingual evaluation represents an important direction for future research.
Second, the resampling operations required to match each model’s native sampling rate may introduce subtle spectral artifacts, as discussed in
Section 3.2. While we standardized all outputs to 16 kHz before metric computation to ensure fair comparison, future studies could train all models at a common sampling rate to eliminate this potential confound.
Third, the current evaluation uses three datasets of varying size and noise characteristics. Extending the analysis to larger-scale, multilingual datasets with a wider variety of noise types would further strengthen the generalizability of the conclusions. Additionally, evaluating model performance on resource-constrained devices (e.g., edge deployment, real-time applications) warrants further investigation.
Fourth, while the three models selected cover distinct architectural paradigms (time-domain encoder–decoder, GAN-based conformer, and spectrogram-based U-Net), recent advances in Transformer-only and Mamba-based (state-space) architectures for speech enhancement are promising. At the time of this study, these newer models lacked the mature, reproducible implementations necessary for a fair cross-architecture comparison. The biometric-aware evaluation framework introduced here is architecture-agnostic, and extending it to include Transformer, Mamba, and other emerging architectures is a natural and important next step.
6. Conclusions
This paper introduced a biometric-aware benchmark for speech enhancement by jointly evaluating three architecturally diverse models—Wave-U-Net, CMGAN, and U-Net—on three independent, cross-domain test sets that the models never encountered during training. Unlike prior work that relies solely on SNR and PESQ, we incorporated commercial-grade VeriSpeak speaker-verification scores, revealing a three-way trade-off that is invisible to conventional metrics alone: U-Net maximizes noise suppression at the cost of naturalness and identity fidelity; CMGAN delivers the highest perceptual quality; and Wave-U-Net provides the strongest biometric preservation alongside competitive perceptual scores. The Clarkson children’s speech corpus and VPQAD cafeteria recordings posed particularly stringent generalization challenges, confirming that these trade-offs hold under realistic, non-synthetic conditions. Our primary contribution is therefore not a new architecture but a reproducible, architecture-agnostic evaluation protocol that links enhancement quality directly to biometric integrity—a perspective that is critical for forensic, security, and access-control deployments. Among the three evaluation dimensions, the VeriSpeak-based biometric assessment constitutes the most novel contribution, offering direct, actionable guidance for security-sensitive applications. Future work should extend this framework to non-English languages, investigate the effects of language-specific acoustic characteristics, incorporate emerging architectures such as Transformer-only and Mamba-based models, and evaluate real-time performance on resource-constrained hardware.