A Comparative Analysis of Deep-Learning-Based Speech Enhancement Models: Assessing Biometric Speaker Verification in Real-World Noisy Environments

Khondkar, Md Jahangir Alam; Ahmed, Ajan; Schuckers, Stephanie; Imtiaz, Masudul H.

doi:10.3390/bdcc10030098

Open AccessArticle

A Comparative Analysis of Deep-Learning-Based Speech Enhancement Models: Assessing Biometric Speaker Verification in Real-World Noisy Environments

by

Md Jahangir Alam Khondkar

^*

,

Ajan Ahmed

,

Stephanie Schuckers

and

Masudul H. Imtiaz

Department of Electrical and Computer Engineering, Clarkson University, Potsdam, NY 13699, USA

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(3), 98; https://doi.org/10.3390/bdcc10030098

Submission received: 19 January 2026 / Revised: 3 March 2026 / Accepted: 12 March 2026 / Published: 23 March 2026

(This article belongs to the Special Issue Artificial Intelligence Techniques for Audio, Image, and Multisensory Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

Speech enhancement through denoising is essential for maintaining signal intelligibility and quality in biometric speaker verification pipelines that operate in acoustically adverse conditions. Despite the proliferation of deep learning (DL) architectures for speech denoising, simultaneously optimizing noise attenuation, perceptual fidelity, and speaker-identity preservation remains an open problem. We address this gap by benchmarking three architecturally distinct DL-based enhancement models—Wave-U-Net, CMGAN, and U-Net—on three independent, domain-diverse corpora (SpEAR, VPQAD, and Clarkson) that the models never encountered during training and by introducing commercial-grade VeriSpeak speaker-verification scores as a biometric evaluation dimension absent from prior comparative studies. Our experiments reveal a clear three-way trade-off: U-Net achieves the highest signal-to-noise ratio (SNR) gains (+61.44% on SpEAR, +67.05% on VPQAD, +235.3% on Clarkson) but sacrifices naturalness; CMGAN yields the best perceptual evaluation of speech quality (PESQ) values (3.33, 1.35, and 2.50, respectively), favoring listening-comfort applications; and Wave-U-Net delivers the strongest biometric fidelity (VeriSpeak improvements of +11.63%, +30.22%, and +29.24%) while offering competitive perceptual quality. These results highlight that model selection must be driven by the target deployment scenario and provide actionable guidance for improving biometric verification robustness under real-world noise.

Keywords:

Wave-U-Net; CMGAN; U-Net; deep learning; enhancement; audio denoising; SNR; PESQ; VeriSpeak; voice biometrics

1. Introduction

Reliable biometric speaker verification demands clean speech signals, yet real-world deployments—from forensic casework and telephone banking to smart-home assistants—routinely expose recordings to background noise, reverberation, and channel distortion that degrade recognition accuracy [1]. Deep learning (DL) has become the dominant paradigm for single-channel speech enhancement because learned, non-linear mappings from noisy to clean speech consistently surpass classical statistical estimators in both perceptual quality and intelligibility.

Modern architectures exploit encoder–decoder topologies, skip connections, and self-attention to handle non-stationary noise [2]. Among the most widely adopted designs, Wave-U-Net [3] operates directly on raw waveforms, preserving phase information that frequency-domain methods typically discard. The Conformer-based Metric GAN (CMGAN) [4] pairs convolutional feature extractors with multi-head self-attention inside an adversarial training loop, yielding perceptually realistic output even under rapidly changing noise [4,5]. A hybrid U-Net variant [6] applies multi-stage spectral feature fusion, retaining fine spectral detail at moderate computational cost.

Despite these architectural advances, two evaluation blind spots persist. First, most published comparisons test on the same corpus used for training, inflating performance estimates and masking generalization weaknesses. Second, almost no study quantifies how enhancement alters downstream speaker identity—a gap that is critical for forensic, security, and access-control applications.

This paper fills both gaps by benchmarking Wave-U-Net, CMGAN, and U-Net under identical training noise conditions and then evaluating them on three independent corpora—SpEAR, VPQAD, and Clarkson—that the models never see during training. Beyond conventional signal-level (SNR) and perceptual (PESQ) metrics, we introduce commercial-grade VeriSpeak match scores as a biometric fidelity measure, providing the first unified view of the three-way trade-off among noise suppression, perceptual quality, and speaker-identity preservation. While both speech quality enhancement and noise suppression are evaluated, the primary distinguishing contribution of this work is the biometric evaluation—assessing how speech enhancement impacts speaker identity preservation through commercial-grade VeriSpeak match scores, which is critical for security-sensitive applications.

1.1. Related Work

Early single-channel speech-enhancement systems relied on statistical estimators such as spectral subtraction [7], Wiener filtering [8], and MMSE log-spectral amplitude estimators [9]. While computationally light, these classical methods assume stationary or slowly varying noise and therefore falter in the highly non-stationary environments typical of modern mobile and IoT deployments. Deep learning (DL) models now dominate the field because they can learn complex, non-linear noise–speech mappings directly from data, often outperforming classical baselines by large margins in perceptual quality and intelligibility [10,11].

Wave-U-Net family. Stoller et al. [3] adapted the image-centric U-Net architecture to raw audio, introducing a multi-scale encoder–decoder with skip connections that preserve phase coherence while capturing long-range context. Subsequent variants either widen the receptive field via dilated convolutions (Dilated Wave-U-Net [12]) or re-weight encoder features with attention to emphasize speech-dominant regions (Attention Wave-U-Net [13]). These refinements boost distortion metrics (STOI, PESQ) but are still trained and validated on the same corpora, leaving cross-domain robustness unclear.

CMGAN line. CMGAN combines convolutional front-ends with Transformer-style multi-head attention inside a generative-adversarial framework [4]. The generator produces a clean magnitude spectrogram, while a PatchGAN discriminator enforces perceptual realism, and a phone-aware perceptual loss (CMGAN-PPL) further sharpens formant structure [14]. Zhang et al. [15] showed that augmenting CMGAN with urban noise improves generalization to city-sound scenes, yet the model remains evaluated on matched VCTK or VoiceBank–DEMAND test splits.

U-Net derivatives. U-Net remains popular because of its parameter efficiency and ease of adaptation. Belz et al. [6] added multi-stage feature fusion to capture both wide-band and narrow-band artifacts; Baloch et al. [16] inserted gated convolutions that act as dynamic spectral masks, yielding gains for English and Urdu corpora. Variational U-Net (V-UNet) injects a stochastic bottleneck so that the decoder can sample plausible clean spectra under heavy noise [17]. A supervised, light-weight U-Net tuned for real-time hearing-aid deployment was proposed by Hossain et al. [18]. Despite these advances, U-Net studies rarely examine how speech enhancement affects downstream tasks such as automatic speaker verification.

Evaluation practices and gaps.Table 1 summarizes representative studies. Nearly all works rely on matched data splits (training and evaluation from the same corpus) and report only signal-level or perceptual metrics (SDR, PESQ, STOI). Cross-age generalization (children vs. adults), synthetic-vs-real noise transfer, and biometric utility remain largely unexplored.

Our position. We close this gap by retraining three state-of-the-art models—Wave-U-Net, CMGAN, and U-Net—under a common noise augmentation (DEMAND) and then challenging them with three independent, domain-diverse test sets: SpEAR (synthetic SNR sweep), VPQAD (real adult speakers in laboratory noise), and Clarkson (natural child speech in classrooms). Crucially, we extend evaluation beyond SNR/PESQ to VeriSpeak match scores, offering the first systematic view of enhancement–speaker-verification trade-offs. We note that the primary contribution of this work is not the proposal of a new architecture but the introduction of a rigorous, biometric-aware evaluation framework that is absent from the existing literature. The three models were selected to represent distinct architectural paradigms—time-domain encoder–decoder (Wave-U-Net), GAN-based conformer (CMGAN), and spectrogram-based U-Net—providing broad coverage of current design philosophies. While newer architectures such as Transformer-only and Mamba-based (state-space) models are emerging, they lack the mature, reproducible implementations and pretrained checkpoints necessary for a fair cross-architecture comparison at the time of this study. The evaluation framework established here is architecture-agnostic and readily extensible to such future models.

1.2. Contributions of This Paper

Unified benchmark. First side-by-side test of Wave-U-Net, CMGAN, and U-Net retrained under the same noise profile and evaluated on three unseen corpora (SpEAR, VPQAD, Clarkson). This shared yardstick lets future studies compare new architectures on equal footing and isolate genuine performance gains.
Biometric evaluation. Complements SNR/PESQ with VeriSpeak scores to gauge how well each model preserves speaker identity. Linking enhancement quality to biometric integrity guides algorithm design toward security-aware speech processing. To our knowledge, no prior study has systematically evaluated speech enhancement models using a commercial-grade biometric speaker verification system, making this the first benchmark of its kind.
Actionable insights. Finds U-Net best for noise suppression, CMGAN for perceptual quality, and Wave-U-Net for identity retention—offering clear model-selection guidance. Practitioners can directly map these trade-offs to real-world constraints in telephony, hearing aids, and voice-assistant deployments.
Open resources. Releases all code, configs, and notebooks for full reproducibility. Public artifacts accelerate replication, ablation studies, and community-driven innovation in speech enhancement.

2. Dataset Description

2.1. Training Datasets

The datasets used for training the denoising model are as follows:

2.1.1. DEMAND Dataset

DEMAND (Diverse Environments Multichannel Acoustic Noise Database) comprises multi-channel ambient recordings from 18 everyday locations grouped into six categories: domestic, office, public, transportation, nature, and street [19]. Captured with a 16-microphone array, the recordings preserve spatial cues that make the noise statistics more representative of practical deployment conditions than single-channel noise generators.

2.1.2. MUSDB18 HQ Dataset

MUSDB18 HQ provides 150 full-length songs with isolated stems (vocals, bass, drums, accompaniment) at uncompressed 44.1 kHz quality [20]. For Wave-U-Net training, we injected DEMAND noise into the “other” stem, repurposing the source-separation framework as a speech-in-noise denoising task while preserving the model’s native temporal receptive field.

2.1.3. VCTK Corpus Dataset

The VCTK Corpus contains clean recordings of 110 English speakers with varied regional accents, sampled at 48 kHz [21]. We mixed these utterances with DEMAND noise at controlled SNR levels to generate the noisy–clean training pairs required by CMGAN’s adversarial objective.

2.1.4. Librispeech Corpus Dataset

LibriSpeech offers roughly 1000 h of read English speech at 16 kHz, derived from public-domain audiobooks [22]. Its large speaker pool and varied recording conditions provided the clean-speech foundation for U-Net training.

2.1.5. ESC-50 Dataset

ESC-50 contains 2000 five-second clips spanning 50 environmental sound categories (e.g., sirens, rain, keyboard clicks) at 44.1 kHz [23]. We combined ESC-50 samples with DEMAND noise to broaden the acoustic diversity of the U-Net training set beyond the noise profiles available from DEMAND alone.

Table 2 shows the characteristics of the training datasets used in this study, including the number of files, audio type, sampling rate, and audio format.

2.2. Evaluation Datasets

To assess the performance of the model under diverse and challenging real-world conditions, in this study, the following evaluation data sets were used:

2.2.1. SpEAR Dataset

Speech Enhancement Assessment Resource (SpEAR) provides 70 text-independent utterances from seven speakers, each corrupted with additive Gaussian noise at a range of SNR levels [24]. Because the noise type and level are precisely controlled, SpEAR serves as a synthetic-noise stress test that isolates each model’s core denoising capability.

2.2.2. VPQAD Dataset

Voice Pre-Processing and Quality Assessment Dataset (VPQAD) captures recordings of 50 adult participants (ages 18–40) performing both text-dependent and text-independent tasks in cafeteria and laboratory settings [25]. Its real-world, unscripted noise profiles make VPQAD a challenging generalization test for models trained on synthetic mixtures.

2.2.3. Clarkson Dataset

The Clarkson Dataset is a longitudinal collection of children’s speech (ages 4–18) gathered approximately every six months since 2016, now totaling 1656 recordings. Sessions lasted about 90 s and took place in non-soundproof school classrooms, so the recordings contain natural ambient noise (door slams, hallway conversation). Text-dependent prompts were used, and the corpus is restricted for privacy reasons. Figure 1 shows a participant with the dual-microphone capture setup.

By spanning a wide age range and capturing natural vocal maturation, the Clarkson corpus uniquely tests whether enhancement models can handle the acoustic variability inherent in developing voices [26].

Table 3 provides an overview of the evaluation datasets used in this study, including key attributes such as the number of files, noise type, sampling rate, recording environment, and availability.

3. Methodology

This study begins with dataset preparation, followed by model training details. The trained models are then assessed using objective metrics to determine their effectiveness in enhancing speech quality and robustness in diverse conditions. All codes pertaining to model architectures, data pre-processing, testing, training, and calculation of evaluation metrics are available at the following GitHub repository: DL Speech Enhancement Toolkit (https://github.com/jahangirkhondkar/DL_SpeechEnhancementToolkit) (accessed on 1 December 2024).

3.1. Model Architectures

3.1.1. Wave-U-Net Architecture

Wave-U-Net operates entirely in the time domain, applying successive 1-D convolutions to the raw waveform without an intermediate spectral transform. Its symmetric encoder–decoder layout, bridged by skip connections at every resolution level, allows the network to capture both coarse temporal structure and fine transient detail simultaneously. Figure 2 illustrates the architecture used in this study.

Encoder (Down-Sampling Path): Each encoder stage halves the temporal resolution through strided 1-D convolutions while doubling the feature-map count, progressively distilling high-level signal representations.

Skip Connections: Lateral links concatenate encoder activations with their mirror decoder layer, ensuring that fine-grained waveform details survive the bottleneck.

Decoder (Up-Sampling Path): Transposed convolutions restore the original sample rate; at each stage the decoder fuses up-sampled features with the corresponding skip-connection output to reconstruct the enhanced waveform.

3.1.2. CMGAN Architecture

Conformer-Based Metric GAN (CMGAN) tackles monaural speech enhancement within an adversarial framework. A conformer-augmented generator produces the enhanced spectrogram, while a metric discriminator scores output quality, jointly driving the system toward perceptually realistic speech. Figure 3 outlines the conceptual architecture.

Generator: A shared encoder jointly processes the magnitude and complex (real/imaginary) spectral components. Two successive conformer blocks model temporal and frequency dependencies via interleaved convolution and self-attention. The decoder splits into parallel paths: one estimating a magnitude mask and another refining the complex-valued spectrogram.

Metric Discriminator: The discriminator is trained to predict the PESQ score of the generator’s output relative to the clean reference. By back-propagating a metric-aware loss, it steers the generator toward outputs that maximize perceptual quality—directly targeting non-differentiable evaluation criteria.

3.1.3. U-Net Architecture

Originally developed for biomedical image segmentation, the U-Net architecture has been adapted for spectrogram-based audio tasks owing to its efficient symmetric encoder–decoder design. A contracting path captures increasingly abstract spectral context, while an expanding path recovers spatial (time–frequency) resolution, as depicted in Figure 4.

In the contracting path, pairs of

3 \times 3

convolutions with ReLU activations are followed by

2 \times 2

max-pooling steps that halve the spatial dimensions while doubling the channel count. The expanding path mirrors this structure with transposed convolutions and skip-connection concatenations, enabling the network to reconstruct clean spectrograms that retain the fine spectral detail needed for high-fidelity speech recovery [27].

3.2. Data Preprocessing

Preprocessing steps were taken for each of the three models—Wave-U-Net, CMGAN, and U-Net—before training and evaluation. Different preprocessing approaches were required due to variations in dataset formats and model requirements.

It is important to note that all evaluation in this study targets speech signals, not music. Although Wave-U-Net was originally developed for music source separation using the MUSDB18-HQ dataset, it is applied here exclusively to speech denoising tasks. All three evaluation datasets (SpEAR, VPQAD, and Clarkson) contain speech recordings corrupted by environmental noise (e.g., Gaussian noise, cafeteria noise, classroom ambient sounds), and the models are assessed solely on their ability to enhance speech quality and preserve speaker identity.

3.2.1. Wave-U-Net

Training Preprocessing: The MUSDB18 HQ dataset used for training had a sampling rate of 44.1 kHz, which matches the input requirement of Wave-U-Net. However, the noise signals added from the DEMAND dataset were originally recorded at 48 kHz. Therefore, all tracks from the DEMAND dataset were down-sampled to 44.1 kHz.

Enhancement Preprocessing: The Clarkson and VPQAD datasets were both at 44.1 kHz, and the SpEAR dataset had a sampling rate of 16 kHz. Therefore, up-sampling to 44.1 kHz was performed to make it compatible with Wave-U-Net’s trained model during evaluation.

3.2.2. CMGAN

Training Preprocessing: The VCTK Corpus dataset, initially in .flac format and sampled at 48 kHz, was converted to .wav format. To match the target sampling rate of the CMGAN model, the clean speech files were down-sampled to 16 kHz. The DEMAND dataset was similarly down-sampled to 16 kHz and mixed with the VCTK dataset to generate noisy speech samples for training.

Enhancement Preprocessing: CMGAN checkpoints were configured for 16 kHz audio, necessitating down-sampling of Clarkson and VPQAD datasets to 16 kHz before evaluation. The SpEAR dataset was already at 16 kHz, so it was directly compatible with CMGAN for enhancement.

3.2.3. U-Net

Training Preprocessing: The LibriSpeech Corpus dataset, sampled at 16 kHz, was originally in .flac format and converted to .wav format to be compatible with the U-Net model. The ESC-50 dataset, recorded at 44.1 kHz, was down-sampled to 16 kHz and mixed with the LibriSpeech dataset.

Enhancement Preprocessing: Since U-Net checkpoints were compatible only with 16 kHz audio, Clarkson and VPQAD datasets were down-sampled to 16 kHz for evaluation. The SpEAR dataset, at 16 kHz, was compatible without additional preprocessing.

Wave-U-Net follows Stoller et al. [3], whose original MUSDB18-HQ training data are 44.1 kHz; keeping this rate preserves the network’s time-domain receptive field and avoids aliasing from down-sampling. CMGAN and U-Net, which employ STFT front-ends, are trained at the standard 16 kHz speech rate. After enhancement, all outputs are resampled to 16 kHz so that PESQ, SNR, and VeriSpeak remain directly comparable across models. We acknowledge that resampling operations can alter the spectral content of audio signals. Up-sampling from 16 kHz to 44.1 kHz (e.g., SpEAR with Wave-U-Net) introduces interpolated frequency content above 8 kHz that was absent in the original recording, while down-sampling from 44.1 kHz to 16 kHz (e.g., VPQAD and Clarkson with CMGAN and U-Net) removes frequency information above 8 kHz. These transformations may influence model performance, particularly for speaker-specific features residing in higher frequency bands. To mitigate evaluation bias, all enhanced outputs were resampled to a common 16 kHz rate before computing PESQ, SNR, and VeriSpeak scores, ensuring a consistent comparison baseline across all models.

3.3. Training Setup

3.3.1. Training Environment

Training was carried out on a workstation equipped with an NVIDIA GeForce RTX 4080 GPU (16 GB VRAM, Compute Capability 8.9; NVIDIA Corporation, Santa Clara, CA, USA), 64 GB of system RAM, and an AMD (Advanced Micro Devices, Inc., Santa Clara, CA, USA) processor running Ubuntu 22.04. Isolated Python 3.10 virtual environments ensured dependency separation across models. Post-processing and metric computation used MATLAB 2022b (MathWorks, Natick, MA, USA).

3.3.2. Training Parameters

The training parameters varied across the three models—Wave-U-Net, CMGAN, and U-Net—based on their specific architecture requirements and dataset properties. While the initial training parameters were informed by the configurations provided by the original authors in their GitHub repositories, we primarily retained these default parameters as our study used the same datasets on which their models were trained. We added common noisy files from the CMGAN dataset to introduce slight diversity while maintaining the original format and structure. This approach ensured consistency with the original setup, allowing fair comparisons, while small adjustments, such as checkpointing, optimized training for our modified datasets. Hyper-parameters for each model were initialized from the configurations published in the respective authors’ repositories. We augmented each training set with DEMAND noise to increase acoustic diversity while preserving the native data format and added periodic checkpointing to capture the best validation snapshot. The per-model configurations are detailed below.

Wave-U-Net:

The Wave-U-Net model was trained using the Adam optimizer, with an initial learning rate of 0.0001. A cyclic learning rate schedule was employed, oscillating between

1 \times 10^{- 3}

and

5 \times 10^{- 5}

to help the model avoid local minima. The loss functions used were mean squared error (MSE) and L1 loss, which aided in preserving sharp transients in the audio signal. The training was conducted with a batch size of 16. Additionally, data augmentation was applied, involving random amplification between 0.7 and 1.0 to enhance the model’s generalization capabilities.

CMGAN:

The CMGAN model was trained using the AdamW optimizer for both the generator and discriminator. The initial learning rates were set as follows:

{lr}_{generator} = 5 \times 10^{- 4}

(1)

{lr}_{discriminator} = 1 \times 10^{- 3}

(2)

A cyclic learning rate scheduler was employed, with decay by a factor of 0.5 every 12 epochs. The batch size was set to 4, ensuring efficient training while maintaining computational feasibility.

The training loss consisted of a combination of time-frequency loss and adversarial loss, balanced using the following weight factors:

γ_{1} = 1, γ_{2} = 0.01, γ_{3} = 1

(3)

where

$γ_{1}$ represents the weight of the time-domain loss,
$γ_{2}$ represents the weight of the frequency-domain loss,
$γ_{3}$ represents the weight of the adversarial loss.

U-Net:

The U-Net model was trained using the Adam optimizer, with an initial learning rate of 0.0002. A learning rate scheduler was applied to reduce the learning rate by half after every 20 epochs to stabilize training and improve convergence. The model used a batch size of 16 for efficient training, and the L1 loss function was used to minimize the difference between the predicted and clean speech signals. Data augmentation was performed through random noise addition and pitch shifts to improve the model’s robustness to different audio conditions [6].

3.3.3. Experimental Setup

Wave-U-Net:

The MUSDB18-HQ dataset consists of two folders: one containing a training set “train” of 100 songs and another with a test set “test” of 50 songs. We used 120 folders for training and 30 for validation, ensuring an 80-20 split for model evaluation. In the “others” stem of the training folders, 100 folders were randomly replaced with noise from the DEMAND dataset to simulate more challenging training conditions. The training process ran for 68 epochs, and the model checkpoint with the lowest validation loss (0.0248) was selected as the final model. Early stopping was employed, halting training if there was no improvement in validation loss for 20 consecutive epochs. Validation loss was monitored throughout training using TensorBoard, and a graph showing validation loss vs. epoch is presented in the results section to illustrate model convergence.

CMGAN:

The experiment involved mixing the Voice Bank Corpus dataset with noise from the DEMAND dataset, creating a dataset with 10 to 15 dB SNR variation. A total of 2000 audio tracks were generated by mixing clean audio with random noises from the DEMAND dataset, which were then split into 1600 for training and 400 for validation to maintain an 80-20 split. The training process was conducted over 120 epochs, and the best checkpoint, based on validation losses, was found at the 34th epoch, which was used for audio enhancement.

U-Net:

The experiment used 2000 files from the LibriSpeech Corpus as clean speech and a combination of 288 files from the DEMAND dataset and 1712 files from the ESC-50 dataset as noise sources. The files were divided into 1600 for training and 400 for validation for each dataset, maintaining an 80-20 split for training and evaluation. The training ran for 100 epochs, with the model checkpoint saved based on the lowest validation loss. After careful validation, the best model was selected, providing optimal performance in enhancing noisy speech. The evaluation included standard objective metrics like PESQ and SNR to assess the effectiveness of the enhanced audio.

3.4. Evaluation Metrics

3.4.1. Signal-to-Noise Ratio (SNR)

SNR expresses the ratio of desired-signal power to residual-noise power in decibels; a higher value indicates more effective noise removal [28]:

SNR = 10 {log}_{10} (\frac{P_{signal}}{P_{noise}})

(4)

where

P_{signal}

and

P_{noise}

denote signal and noise power, respectively. Our implementation estimates noise power from the first 10% of the unfiltered waveform (assumed to contain predominantly background noise), then computes frame-wise SNR via the STFT with a 32 ms window (512 samples at 16 kHz) and 50% overlap [29,30].

3.4.2. Perceptual Evaluation of Speech Quality (PESQ)

PESQ (ITU-T P.862) models the human auditory response to quantify perceived speech quality on a scale from

- 0.5

to

4.5

[31,32]. We used the open-source Python wrapper [33], supplying the noisy recording as the degraded input and the enhanced output as the test signal. Higher scores reflect less audible distortion and greater naturalness.

3.4.3. VeriSpeak: Speaker Recognition Performance Evaluation

VeriSpeak (Neurotechnology) is a commercial speaker-verification engine that computes a similarity score between an enrolled voice template and a test utterance [34]. We automated enrollment and matching via a MATLAB wrapper, adopting the standard decision threshold of 60 (FAR ≈ 0.01%). An increase in match score after enhancement indicates that speaker-discriminative features have been preserved or recovered; a decrease signals harmful distortion of identity cues.

3.4.4. Computational Cost

Table 4 reports the approximate training and inference times for each model on the hardware described in Section 3.3.1.

CMGAN requires the longest training and inference times due to its adversarial training framework and transformer-based attention mechanisms, while U-Net is the most computationally efficient model. All inference times are averaged over the SpEAR dataset files at each model’s native sampling rate.

4. Results

We analyzed three datasets using three evaluation metrics across three models. The mean performance metrics for each model–dataset combination are presented in the tables below, providing a complete summary of the evaluation results. Detailed per-file results are available in the accompanying GitHub repository.

4.1. SpEAR Dataset Results

Table 5 summarizes the mean performance metrics (SNR, PESQ, and VeriSpeak) for the Wave-U-Net, CMGAN, and U-Net models evaluated on the SpEAR dataset. U-Net yielded the highest improvement in SNR, CMGAN achieved the highest average PESQ scores, and Wave-U-Net provided the greatest increase in VeriSpeak scores.

4.2. VPQAD Dataset Results

Table 6 presents the mean performance metrics—SNR, PESQ, and VeriSpeak—across Wave-U-Net, CMGAN, and U-Net evaluated on the VPQAD dataset. U-Net achieved the highest average SNR improvement, while CMGAN provided the highest average scores for both PESQ and VeriSpeak metrics.

4.3. Clarkson Dataset Results

Table 7 summarizes the mean performance metrics—SNR, PESQ, and VeriSpeak—for the Wave-U-Net, CMGAN, and U-Net models on the Clarkson dataset. U-Net provided the highest average improvement in SNR, CMGAN achieved the highest PESQ scores, and Wave-U-Net yielded the highest increase in VeriSpeak scores.

4.4. Cross-Model Comparison

Table 8 consolidates the performance of all three models across the three evaluation datasets, enabling a direct cross-model comparison.

Several key observations emerge from this consolidated view. First, each model consistently excels in the same metric category across all three datasets, indicating stable model-specific strengths rather than dataset-dependent performance fluctuations. U-Net consistently achieves the highest SNR improvement, with particularly dramatic gains on the Clarkson dataset (

+ 235.3 %

), which contains naturally noisy recordings from non-soundproof school environments. Second, CMGAN’s PESQ superiority is most pronounced on the SpEAR dataset (3.33), where Gaussian noise conditions align well with its adversarial training paradigm. Third, Wave-U-Net’s VeriSpeak improvements are most notable on the VPQAD (

+ 30.22 %

) and Clarkson (

+ 29.24 %

) datasets, suggesting that its time-domain processing approach better preserves speaker-discriminative features in real-world noise conditions compared to frequency-domain methods.

4.5. Additional Analysis

Table 9 summarizes the VeriSpeak match scores for mated and non-mated comparisons on unfiltered and filtered speech data from the Clarkson Collection 7 and 8 using the Wave-U-Net model. The filtered data consistently yielded higher match scores for both cases compared to the unfiltered data.

5. Discussion

5.1. Wave-U-Net

SNR Performance Across Datasets

Wave-U-Net exhibits SNR decreases of

33.48 %

(SpEAR),

83.33 %

(VPQAD), and

60.57 %

(Clarkson). Rather than indicating poor denoising, these reductions stem from the model’s aggressive isolation of the target speech signal, which lowers the overall signal power envelope. Complementary metrics (PESQ, VeriSpeak) confirm that Wave-U-Net suppresses noise effectively while retaining speech clarity and speaker-discriminative cues.

PESQ Performance Across Datasets

Perceptual quality improves consistently: mean PESQ reaches 2.91 on SpEAR, 1.19 on VPQAD, and 2.44 on Clarkson. The SpEAR and Clarkson figures indicate substantial gains in intelligibility and naturalness, while the lower VPQAD score reflects the inherently demanding cafeteria and laboratory noise conditions in that corpus. Overall, these results confirm that Wave-U-Net enhances listening quality without the aggressive spectral distortion observed in purely SNR-optimized models.

Speaker Recognition (VeriSpeak Scores)

Biometric fidelity is Wave-U-Net’s strongest asset. VeriSpeak match scores rise by

+ 11.63 %

(SpEAR),

+ 30.22 %

(VPQAD), and

+ 29.24 %

(Clarkson), indicating that the model’s time-domain processing preserves the spectral fine structure on which speaker embeddings rely. The pronounced gains on VPQAD and Clarkson—corpora with substantial ambient noise—suggest that Wave-U-Net is particularly well suited to biometric pipelines operating in uncontrolled acoustic environments.

5.2. CMGAN

SNR Performance Across Datasets

CMGAN’s SNR behavior mirrors its design emphasis on perceptual quality over raw noise attenuation: SNR drops by

23.25 %

(SpEAR),

65.57 %

(VPQAD), and

44.58 %

(Clarkson). The larger VPQAD reduction is attributable to the highly non-stationary cafeteria and laboratory noise in that corpus. As with Wave-U-Net, these decreases do not signify inadequate denoising; the conformer-based generator redistributes spectral energy to maximize listening comfort rather than wideband SNR, a strategy validated by the model’s leading PESQ scores.

PESQ Performance Across Datasets

CMGAN achieves the highest PESQ values in the comparison: 3.33 (SpEAR), 1.35 (VPQAD), and 2.50 (Clarkson). The adversarial training objective, which penalizes perceptually implausible outputs through its metric discriminator, drives the model toward natural-sounding speech even under severe noise. These figures make CMGAN the strongest candidate for listening-comfort applications such as hearing aids, telecommunications, and broadcast media, where subjective speech quality outweighs raw noise suppression.

Speaker Recognition (VeriSpeak Scores)

VeriSpeak scores improve by

+ 8.90 %

(SpEAR),

+ 26.59 %

(VPQAD), and

+ 14.78 %

(Clarkson). Although these gains are smaller than Wave-U-Net’s, they confirm that CMGAN’s adversarial enhancement does not distort speaker identity, making it viable for dual-purpose (quality + verification) pipelines.

5.3. U-Net

SNR Performance Across Datasets

U-Net is the clear SNR leader:

+ 61.44 %

(SpEAR),

+ 67.05 %

(VPQAD), and

+ 235.3 %

(Clarkson). The outsized Clarkson gain reflects the model’s ability to strip substantial ambient classroom noise, producing nearly silent backgrounds. These figures position U-Net as the top candidate when maximum noise attenuation is the overriding requirement.

PESQ Performance Across Datasets

The cost of U-Net’s aggressive noise removal is visible in its PESQ figures: 1.15 (SpEAR), 1.35 (VPQAD), and 1.89 (Clarkson). All values fall below those of CMGAN and Wave-U-Net, indicating that the spectrogram-domain max-pooling and hard masking introduce spectral artifacts that degrade perceived naturalness. This SNR–PESQ trade-off makes U-Net less appropriate for applications in which listening comfort is the primary objective, though its outputs may still be acceptable when subsequent processing (e.g., ASR) is tolerant of mild distortion.

Speaker Recognition (VeriSpeak Scores)

U-Net’s heavy noise attenuation comes at a biometric cost. On SpEAR, VeriSpeak scores show no measurable gain; on VPQAD and Clarkson, results are inconsistent across subjects, with several speakers exhibiting declines in match scores (overall averages are therefore omitted from the table). The spectrogram-domain processing likely smooths the formant transitions and harmonic micro-structure that underpin speaker embeddings, explaining the poor identity-preservation performance. For deployments that require both clean audio and reliable speaker verification, U-Net would need to be paired with a downstream identity-restoration module.

5.4. Overall

Across all three corpora, the models occupy distinct regions of the noise-suppression/ perceptual-quality/biometric-fidelity trade-off space. U-Net maximizes SNR (

+ 61

–

235 %

) but sacrifices naturalness (PESQ ≤ 1.89) and biometric utility. CMGAN maximizes PESQ (up to 3.33) with moderate identity preservation. Wave-U-Net offers the best biometric outcome (

+ 11

–

30 %

VeriSpeak) alongside competitive PESQ, making it the strongest all-round candidate when speaker identity matters.

In deployment terms: U-Net suits forensic pre-processing where intelligibility after heavy noise removal is acceptable; CMGAN fits hearing-aid and media workflows that prioritize listening comfort; Wave-U-Net is the natural choice for access-control and voice-banking pipelines that must maintain verification accuracy.

A focused analysis of Wave-U-Net on Clarkson Collections 7 and 8 confirms that enhancement raises both mated scores (146→165, 122→155) and non-mated scores (22→32, 16→24). Crucially, the mated–non-mated gap widens, improving discriminability while keeping all non-mated scores well below the standard VeriSpeak threshold of 60.

5.5. Limitations and Future Directions

Several limitations of this study should be acknowledged. First, all training and evaluation datasets consist exclusively of English speech. Language-specific phonetic, prosodic, and tonal characteristics may influence model effectiveness differently, and the generalizability of our findings to other languages remains unexplored. Cross-lingual evaluation represents an important direction for future research.

Second, the resampling operations required to match each model’s native sampling rate may introduce subtle spectral artifacts, as discussed in Section 3.2. While we standardized all outputs to 16 kHz before metric computation to ensure fair comparison, future studies could train all models at a common sampling rate to eliminate this potential confound.

Third, the current evaluation uses three datasets of varying size and noise characteristics. Extending the analysis to larger-scale, multilingual datasets with a wider variety of noise types would further strengthen the generalizability of the conclusions. Additionally, evaluating model performance on resource-constrained devices (e.g., edge deployment, real-time applications) warrants further investigation.

Fourth, while the three models selected cover distinct architectural paradigms (time-domain encoder–decoder, GAN-based conformer, and spectrogram-based U-Net), recent advances in Transformer-only and Mamba-based (state-space) architectures for speech enhancement are promising. At the time of this study, these newer models lacked the mature, reproducible implementations necessary for a fair cross-architecture comparison. The biometric-aware evaluation framework introduced here is architecture-agnostic, and extending it to include Transformer, Mamba, and other emerging architectures is a natural and important next step.

6. Conclusions

This paper introduced a biometric-aware benchmark for speech enhancement by jointly evaluating three architecturally diverse models—Wave-U-Net, CMGAN, and U-Net—on three independent, cross-domain test sets that the models never encountered during training. Unlike prior work that relies solely on SNR and PESQ, we incorporated commercial-grade VeriSpeak speaker-verification scores, revealing a three-way trade-off that is invisible to conventional metrics alone: U-Net maximizes noise suppression at the cost of naturalness and identity fidelity; CMGAN delivers the highest perceptual quality; and Wave-U-Net provides the strongest biometric preservation alongside competitive perceptual scores. The Clarkson children’s speech corpus and VPQAD cafeteria recordings posed particularly stringent generalization challenges, confirming that these trade-offs hold under realistic, non-synthetic conditions. Our primary contribution is therefore not a new architecture but a reproducible, architecture-agnostic evaluation protocol that links enhancement quality directly to biometric integrity—a perspective that is critical for forensic, security, and access-control deployments. Among the three evaluation dimensions, the VeriSpeak-based biometric assessment constitutes the most novel contribution, offering direct, actionable guidance for security-sensitive applications. Future work should extend this framework to non-English languages, investigate the effects of language-specific acoustic characteristics, incorporate emerging architectures such as Transformer-only and Mamba-based models, and evaluate real-time performance on resource-constrained hardware.

Author Contributions

M.J.A.K. installed, trained, enhanced, and analyzed the data. A.A. contributed to the analysis of the evaluation algorithms. M.J.A.K. drafted the manuscript. S.S. and M.H.I. supervised the project, performed data curation, and contributed to the analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based on work supported by the Center for Identification Technology Research and the National Science Foundation under Grant Nos. 1650503 and 2413228.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Institutional Review Board of Clarkson University (protocol code: 24-42) on 10 April 2024.

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

The datasets analyzed in this study include publicly available datasets and the Clarkson dataset. Public datasets used in this work can be obtained from their respective original sources. The Clarkson dataset is not publicly available due to institutional and privacy restrictions.

Acknowledgments

The authors would like to thank Md Abdul Baset Sarker, Rahul V., and Ernesto Sola-Thomas for their help and technical support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Maher, R.C. Principles of Forensic Audio Analysis. In Modern Acoustics and Signal Processing; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Michelsanti, D.; Tan, Z.-H.; Zhang, S.-X.; Xu, Y.; Yu, M.; Yu, D.; Jensen, J. An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1368–1396. [Google Scholar] [CrossRef]
Stoller, D.; Ewert, S.; Dixon, S. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018. [Google Scholar] [CrossRef]
Abdulatif, S.; Cao, R.; Yang, B. CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2477–2493. [Google Scholar] [CrossRef]
Zhang, W.; Liao, J.; Zhang, Y.; Liu, L. CMGAN: A Generative Adversarial Network Embedded with Causal Matrix. Appl. Intell. 2022, 52, 16233–16245. [Google Scholar] [CrossRef]
Belz, V. Speech-enhancement with Deep Learning. Medium, 30 November 2024. Available online: https://towardsdatascience.com/speech-enhancement-with-deep-learning-36a1991d3d8d (accessed on 25 November 2024).
Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 2003, 27, 113–120. [Google Scholar] [CrossRef]
Lim, J.; Oppenheim, A. All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 197–210. [Google Scholar] [CrossRef]
Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 2003, 33, 443–445. [Google Scholar] [CrossRef]
Drgas, S. A survey on low-latency DNN-based speech enhancement. Sensors 2023, 23, 1380. [Google Scholar] [CrossRef] [PubMed]
Mensah, S.Y.; Zhang, T.; Mahmud, N.A.; Geng, Y. Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems. Electronics 2025, 14, 2643. [Google Scholar] [CrossRef]
Ali, M.N.; Brutti, A.; Falavigna, D. Speech Enhancement Using Dilated Wave-U-Net: An Experimental Analysis. In Proceedings of the 27th Conference of Open Innovations Association (FRUCT), Trento, Italy, 7–9 September 2020; pp. 3–9. [Google Scholar] [CrossRef]
Giri, R.; Isik, U.; Krishnaswamy, A. Attention Wave-U-Net for Speech Enhancement. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 249–253. [Google Scholar] [CrossRef]
Dai, C.-E.; Zeng, J.-X.; Zeng, W.-L.; Li, E.S.; Hung, J.-W. Improving the performance of CMGAN in speech enhancement with the phone fortified perceptual loss. In Proceedings of the 2023 International Conference on Consumer Electronics—Taiwan (ICCE-Taiwan), PingTung, Taiwan, 17–19 July 2023; pp. 459–460. [Google Scholar] [CrossRef]
Zhang, T. CMGAN-Based Speech Enhancement for Automotive Environments: Targeted Noise Reduction. Master’s Thesis, University of Groningen, Groningen, The Netherlands, 2024. Available online: https://campus-fryslan.studenttheses.ub.rug.nl/503/ (accessed on 1 December 2024).
Baloch, D.; Abdullah, S.; Qaiser, A.; Ahmed, S.; Nasim, S.F.; Kanwal, M. Speech Enhancement using Fully Convolutional UNET and Gated Convolutional Neural Network. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 831–836. [Google Scholar] [CrossRef]
Nustede, E.J.; Anemüller, J. Towards speech enhancement using a variational U-Net architecture. In Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 481–485. [Google Scholar] [CrossRef]
Hossain, M.N.; Basir, S.; Hosen, M.S.; Asaduzzaman, A.O.M.; Islam, M.M.; Hossain, M.A.; Islam, M.S. Supervised Single Channel Speech Enhancement Method Using UNET. Electronics 2023, 12, 3052. [Google Scholar] [CrossRef]
Thiemann, J.; Ito, N.; Vincent, E. DEMAND: A collection of multi-channel recordings of acoustic noise in diverse environments. Zenodo 2013. [Google Scholar] [CrossRef]
Rafii, Z.; Liutkus, A.; Stöter, F.-R.; Mimilakis, S.I.; Bittner, R. MUSDB18-HQ—An uncompressed version of MUSDB18. Zenodo 2019. [Google Scholar] [CrossRef]
Yamagishi, J.; Veaux, C.; MacDonald, K. CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit (Version 0.92); University of Edinburgh, The Centre for Speech Technology Research (CSTR): Edinburgh, UK, 2019. [Google Scholar] [CrossRef]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
Piczak, K.J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd ACM International Conference on Multimedia (MM ’15), Brisbane, QLD, Australia, 26–30 October 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1015–1018. [Google Scholar] [CrossRef]
Wan, E.; Nelson, A.; Peterson, R. Speech Enhancement Assessment Resource (SpEAR) Database. Data Set, Beta Release v1.0. CSLU, Oregon Graduate Institute of Science and Technology. 2002. Available online: https://web.archive.org/web/20060831010952/http://cslu.ece.ogi.edu/nsel/data/SpEAR_database.html (accessed on 30 November 2024).
Ahmed, A.; Alam Khondkar, M.J.; Herrick, A.; Schuckers, S.; Imtiaz, M.H. Descriptor: Voice Pre-Processing and Quality Assessment Dataset (VPQAD). IEEE Data Descr. 2024, 1, 146–153. [Google Scholar] [CrossRef]
Purnapatra, S.; Das, P.; Holsopple, L.; Schuckers, S. Longitudinal study of voice recognition in children. In Proceedings of the 2020 International Conference of the Biometrics Special Interest Group (BIOSIG), Online, 16–18 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–8. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015); LNCS; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Proakis, J.G. Digital Signal Processing: Principles, Algorithms, and Applications; Pearson Education India: Delhi, India, 2001. [Google Scholar]
Zhao, Y.; Benesty, J.; Chen, J. Single-channel noise reduction in the STFT domain from the fullband output SNR perspective. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August–2 September 2016; pp. 1956–1959. [Google Scholar]
MathWorks. snr (Signal-to-Noise Ratio). MATLAB Documentation. Available online: https://www.mathworks.com/help/signal/ref/snr.html (accessed on 2 December 2024).
International Telecommunication Union. ITU-T Recommendation P. 862: Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs, February 2001. Available online: https://www.itu.int/rec/T-REC-P.862 (accessed on 3 December 2024).
Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City, UT, USA, 7–11 May 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 2, pp. 749–752. [Google Scholar] [CrossRef]
Ludlows. python-pesq: A Python Wrapper for the ITU-T PESQ (P.862) Algorithm. GitHub Repository, 2019. Available online: https://github.com/ludlows/python-pesq (accessed on 3 December 2024).
Neurotechnology. VeriSpeak Voice Speaker Verification and Identification Technology, Algorithm and SDK for PC, Android and Web. Available online: https://www.neurotechnology.com/verispeak.html (accessed on 3 December 2024).

Figure 1. Data Collection Setup for the Clarkson Dataset, Showing Participant with Dual Microphone Setup.

Figure 2. The Wave-U-Net Architecture [3]. The encoder (left) applies successive 1D convolutions with down-sampling to extract multi-scale temporal features, while the decoder (right) uses transposed convolutions to reconstruct the enhanced waveform. Skip connections (dashed arrows) concatenate encoder features with decoder layers to preserve fine-grained detail.

Figure 3. Conceptual overview of the CMGAN-based speech enhancement framework. A feature encoder extracts representations from noisy speech, followed by an enhancement module that integrates generative adversarial learning and attention mechanisms to produce enhanced speech. During training, a discriminator provides feedback to guide perceptual quality improvement.

Figure 4. Conceptual illustration of a U-Net-based encoder–decoder architecture for speech enhancement. The encoder progressively extracts hierarchical features from the input, while the decoder reconstructs the enhanced output. Skip connections transfer intermediate representations between the encoder and decoder to preserve fine-grained details.

Table 1. Representative speech-enhancement studies. Our work uniquely tests three models on independent real-world corpora and reports biometric impact.

Authors	Model	Training Data	Eval. Data	Metrics
Stoller et al. [3]	Wave-U-Net	MUSDB18, CCMixter	MUSDB18	SDR
Ali et al. [12]	Dilated Wave-U-Net	VCTK, LibriSpeech	VCTK, LibriSpeech	PESQ, STOI, SNR
Giri et al. [13]	Attention Wave-U-Net	VCTK	VCTK	PESQ, CSIG, CBAK, COVL, SSNR
Abdullatif et al. [4]	CMGAN	VCTK	VCTK + DEMAND	PESQ, CSIG, CBAK, COVL, STOI, SSNR
Dai et al. [14]	CMGAN	VoiceBank-DEMAND	VoiceBank-DEMAND	PESQ, STOI
Zhang et al. [15]	CMGAN	VCTK, UrbanSound	VCTK, UrbanSound	PESQ, CSIG, CBAK, COVL, SSNR, STOI
Belz et al. [6]	Multi-stage U-Net	LibriSpeech, ESC-50	LibriSpeech, ESC-50	–
Baloch et al. [16]	Gated U-Net	Valentini, DEMAND	Valentini, DEMAND	PESQ, STOI
Nustede et al. [17]	Variational U-Net	MS-SNSD, DEMAND	MS-SNSD	PESQ, STOI
Hossain et al. [18]	Real-time U-Net	IEEE Corpus, NOISEX-92	IEEE Corpus	HASQI, PESQ, STOI
Our Work	Wave-U-Net, CMGAN, U-Net	DEMAND + model-specific speech corpora	SpEAR, VPQAD, Clarkson	PESQ, SNR, VeriSpeak

Bold indicates the current study.

Table 2. Summary of Training Datasets Used for Model Development.

Dataset	Number of Files/Length	Audio Type	Sampling Rate	Format
DEMAND Dataset	288 files; 24 h	Real-world noise	48 kHz	.wav
MUSDB18 HQ Dataset	150 tracks; 9+ h	Multi-track music (vocals, bass, drums, etc.)	44.1 kHz	.wav
VCTK Corpus Dataset	44,000+ files; 300+ h	Clean speech	48 kHz	.flac
LibriSpeech Corpus Dataset	1000 h	Read speech	16 kHz	.flac
ESC-50 Dataset	2000 clips; 2.78 h	Environmental sounds	44.1 kHz	.wav

Table 3. Summary of Evaluation Datasets for Model Performance Evaluation.

Dataset	Number of Files/Subjects	Noise Type	Sampling Rate	Recording Environment	Availability
SpEAR Dataset	70 files, 7 subjects, 2 h (Approx.)	Gaussian noise (various levels)	16 kHz	Controlled, clean speech mixed with noise	Publicly Available
VPQAD Dataset	50 participants, 2+ h	Real-world noise (various environments)	41 kHz	Controlled real-life environments	Publicly Available
Clarkson Dataset	1656 files, 184 (avg.) participants, 32+ h	Real-world ambient noises	41 kHz	Non-soundproof, local schools	Not Publicly Available

Table 4. Computational Cost for Training and Inference (NVIDIA RTX 4080, 16 GB VRAM).

Model	Training Time (Approx.)	Epochs	Best Checkpoint	Avg. Inference Time (per File)
Wave-U-Net	∼18 h	68	Epoch 68	∼2.1 s
CMGAN	∼36 h	120	Epoch 34	∼3.8 s
U-Net	∼12 h	100	Best val. loss	∼1.5 s

Table 5. Mean Performance on SpEAR Dataset.

Metric	Wave-U-Net	CMGAN	U-Net
SNR Change (%)	−33.48	−23.25	61.44
PESQ Score (Avg.)	2.91	3.33	1.15
VeriSpeak Change (%)	11.63	8.90	-

Table 6. Mean Performance on VPQAD Dataset.

Metric	Wave-U-Net	CMGAN	U-Net
SNR Change (%)	−83.33	−65.57	67.05
PESQ Score (Avg.)	1.19	1.35	1.35
VeriSpeak Change (%)	30.22	26.59	-

Table 7. Mean Performance on Clarkson Dataset.

Metric	Wave-U-Net	CMGAN	U-Net
SNR Change (%)	−60.57	−44.58	235.3
PESQ Score (Avg.)	2.44	2.50	1.89
VeriSpeak Change (%)	29.24	14.78	-

Table 8. Consolidated Cross-Model Performance Summary Across All Datasets.

	SpEAR			VPQAD			Clarkson
Metric	W-U-Net	CMGAN	U-Net	W-U-Net	CMGAN	U-Net	W-U-Net	CMGAN	U-Net
SNR (%)	−33.48	−23.25	+61.44	−83.33	−65.57	+67.05	−60.57	−44.58	+235.3
PESQ	2.91	3.33	1.15	1.19	1.35	1.35	2.44	2.50	1.89
VeriSpeak (%)	+11.63	+8.90	–	+30.22	+26.59	–	+29.24	+14.78	–

Bold values indicate the best-performing model for each metric per dataset.

Table 9. Wave-U-Net Genuine vs. Imposter Mean VeriSpeak Scores.

	Collection 7		Collection 8
Data Type	Mated	Non-Mated	Mated	Non-Mated
Unfiltered	146.27	21.97	121.98	16.47
Filtered	165.36	32.25	155.19	24.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khondkar, M.J.A.; Ahmed, A.; Schuckers, S.; Imtiaz, M.H. A Comparative Analysis of Deep-Learning-Based Speech Enhancement Models: Assessing Biometric Speaker Verification in Real-World Noisy Environments. Big Data Cogn. Comput. 2026, 10, 98. https://doi.org/10.3390/bdcc10030098

AMA Style

Khondkar MJA, Ahmed A, Schuckers S, Imtiaz MH. A Comparative Analysis of Deep-Learning-Based Speech Enhancement Models: Assessing Biometric Speaker Verification in Real-World Noisy Environments. Big Data and Cognitive Computing. 2026; 10(3):98. https://doi.org/10.3390/bdcc10030098

Chicago/Turabian Style

Khondkar, Md Jahangir Alam, Ajan Ahmed, Stephanie Schuckers, and Masudul H. Imtiaz. 2026. "A Comparative Analysis of Deep-Learning-Based Speech Enhancement Models: Assessing Biometric Speaker Verification in Real-World Noisy Environments" Big Data and Cognitive Computing 10, no. 3: 98. https://doi.org/10.3390/bdcc10030098

APA Style

Khondkar, M. J. A., Ahmed, A., Schuckers, S., & Imtiaz, M. H. (2026). A Comparative Analysis of Deep-Learning-Based Speech Enhancement Models: Assessing Biometric Speaker Verification in Real-World Noisy Environments. Big Data and Cognitive Computing, 10(3), 98. https://doi.org/10.3390/bdcc10030098

Article Menu

A Comparative Analysis of Deep-Learning-Based Speech Enhancement Models: Assessing Biometric Speaker Verification in Real-World Noisy Environments

Abstract

1. Introduction

1.1. Related Work

1.2. Contributions of This Paper

2. Dataset Description

2.1. Training Datasets

2.1.1. DEMAND Dataset

2.1.2. MUSDB18 HQ Dataset

2.1.3. VCTK Corpus Dataset

2.1.4. Librispeech Corpus Dataset

2.1.5. ESC-50 Dataset

2.2. Evaluation Datasets

2.2.1. SpEAR Dataset

2.2.2. VPQAD Dataset

2.2.3. Clarkson Dataset

3. Methodology

3.1. Model Architectures

3.1.1. Wave-U-Net Architecture

3.1.2. CMGAN Architecture

3.1.3. U-Net Architecture

3.2. Data Preprocessing

3.2.1. Wave-U-Net

3.2.2. CMGAN

3.2.3. U-Net

3.3. Training Setup

3.3.1. Training Environment

3.3.2. Training Parameters

3.3.3. Experimental Setup

3.4. Evaluation Metrics

3.4.1. Signal-to-Noise Ratio (SNR)

3.4.2. Perceptual Evaluation of Speech Quality (PESQ)

3.4.3. VeriSpeak: Speaker Recognition Performance Evaluation

3.4.4. Computational Cost

4. Results

4.1. SpEAR Dataset Results

4.2. VPQAD Dataset Results

4.3. Clarkson Dataset Results

4.4. Cross-Model Comparison

4.5. Additional Analysis

5. Discussion

5.1. Wave-U-Net

5.2. CMGAN

5.3. U-Net

5.4. Overall

5.5. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI