Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems

Mensah, Samuel Yaw; Zhang, Tao; Mahmud, Nahid AI; Geng, Yanzhang

doi:10.3390/electronics14132643

Open AccessArticle

Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems

by

Samuel Yaw Mensah

^1,*

,

Tao Zhang

²

,

Nahid AI Mahmud

³

and

Yanzhang Geng

²

¹

School of Information Engineering, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, China

²

Digital Signal Processing Laboratory, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, China

³

School of Electrical & Information Engineering, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2643; https://doi.org/10.3390/electronics14132643

Submission received: 8 April 2025 / Revised: 9 June 2025 / Accepted: 11 June 2025 / Published: 30 June 2025

Download

Browse Figures

Versions Notes

Abstract

Deep learning has emerged as a powerful technique for speech enhancement, particularly in security systems where audio signals are often degraded by non-stationary noise. Traditional signal processing methods struggle in such conditions, making it difficult to detect critical sounds like gunshots, alarms, and unauthorized speech. This study investigates a hybrid deep learning framework that combines Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (GANs) to enhance speech quality and improve sound classification accuracy in noisy security environments. The proposed model is trained and validated using real-world datasets containing diverse noise distortions, including VoxCeleb for benchmarking speech enhancement and UrbanSound8K and ESC-50 for sound classification. Performance is evaluated using industry-standard metrics such as Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Signal-to-Noise Ratio (SNR). The architecture includes multi-layered neural networks, residual connections, and dropout regularization to ensure robustness and generalizability. Additionally, the paper addresses key challenges in deploying deep learning models for security applications, such as computational complexity, latency, and vulnerability to adversarial attacks. Experimental results demonstrate that the proposed DNN + GAN-based approach significantly improves speech intelligibility and classification performance in high-interference scenarios, offering a scalable solution for enhancing the reliability of audio-based security systems.

Keywords:

speech enhancement; deep learning; security systems; GANs; CNNs

1. Introduction

1.1. Speech Enhancement in Security Systems

Speech enhancement is crucial in security systems, where it improves the clarity of audio used in surveillance and emergency situations. Non-stationary noise, such as traffic, gunshots, and alarms, often degrades speech quality. The Signal-to-Noise Ratio directly impacts speech intelligibility, with low SNR hindering recognition [1] Deep learning techniques like DNNs and GANs (Deep Neural Networks and Generative Adversarial Networks) have shown significant improvements in enhancing speech in noisy environments, providing clearer audio for better decision-making in security systems. The performance of speech enhancement is highly dependent on the Signal-to-Noise Ratio, which is the ratio of speech to noise. Mathematically as seen in Equation (1), SNR is expressed as follows:

S N R = 10 l o g 10 (\frac{P s i g n a l}{P n o i s e}) (d B)

(1)

Psignal and Pnoise are powers of the speech signal and noise, respectively. This is because SNR can be as low as 0 dBin noisy security environments, indicating poor speech intelligibility. According to the research, SNR below 5 dB is quite detrimental to human speech recognition, while deep learning models can classify with more than 70% accuracy in such conditions.

The noise received in security systems is non-stationary because of traffic, alarms, conversations, and wind noise. The noise models used in these environments include Gaussian Mixture Models (GMMs) or Additive White Gaussian Noise (AWGN) as seen in Equation (2) below:

y(t) = x(t) + n(t)

(2)

where y(t) is the observed signal, x(t) is the clean speech, and n(t) is the noise. Its objective is to enhance speech signals by estimating x(t) from y(t) with the help of state-of-the-art techniques like DNNs and GANs.

Due to noise interference, security recordings often experience up to 30% intelligibility loss. Based on statistical data, more than 60% of the audio evidence collected in forensic investigations needs to be enhanced before it can be admissible in a case. Poor audio quality also increases the false alarm rate by 40%; therefore, speech enhancement should be robust.

1.2. Robust Sound Classification

Sound classification is crucial in threat identification, anomaly detection, and real-time security decision-making. Conventional classification methods are not very accurate, especially in cases where the acoustic environment is very complicated, while AI classification models have a much higher accuracy [2]. In the controlled experiments, the human operators can classify the security sounds with an accuracy of about 85% but only 60% of the time in noisy environments. At the same time, as shown in Figure 1, it is possible to achieve over 95% speech and environmental sound classification accuracy with the help of deep learning-based classifiers such as CNNs and RNNs.

Sound classification is one of the most important techniques in the field of audio processing and analysis, and spectrogram analysis is one of the most basic techniques used in it, based on the Short-Time Fourier Transform (STFT). In the discrete-time domain, the noisy speech signal in the time–frequency domain can be expressed using the Short-Time Fourier Transform as follows as seen in Equation (3) below:

X (n, k) = \sum_{m = - \infty}^{\infty} x [m] \cdot w [m - n] \cdot e^{- j 2 π k m / N}

(3)

where

x[m]: the discrete-time input signal;
w[m − n]: the window function centered at frame nnn;
k: frequency bin index;
N: number of FFT points.

This formulation assumes additive noise in the frequency domain and is widely used in speech enhancement tasks. The STFT is computed by applying the Fourier Transform to overlapping frames of the time domain signal, typically using a Hamming or Hann window.

This transformation is useful for filtering out the speech components from the noise, and the deep learning models can then analyze the pattern.

Sound classification is vital for threat detection and decision-making in security systems. Traditional methods struggle in noisy environments, but deep learning models, such as CNNs and RNNs, outperform them with over 95% accuracy. While human operators classify sounds with 85% accuracy in controlled conditions, this drops to 60% in noise. Deep learning-based classifiers can maintain high accuracy even in challenging environments, making them ideal for real-time sound classification in security applications.

Access control: AI-based voice recognition is 99% accurate, compared to 90% for the biometric approach.

Surveillance with the help of AI: models for detecting various anomalies are 96% accurate, which is 15% better than traditional systems.

1.3. Deep Learning in Audio Processing

Deep learning has significantly changed speech enhancement and classification due to large datasets, feature extraction, and hierarchical learning. CNNs and RNNs are the most popular among neural networks as they provide high accuracy in audio pattern recognition [3]. As in Equation (4), CNNs can extract spectral features effectively through convolution operations:

f (x) = x * v

(4)

where * represents convolution, x is the input, and v is the filter. Sequential dependencies are processed under RNNs due to their specific handling capabilities in speech work.

The RNNs share LSTM networks with other group members for identifying sequential patterns embedded in speech signals [4]. These models address audio temporal relations to distinguish between speech and non-speech audio segments.

Most deep learning models operate in both the time domain and frequency domain to separate interferences that enhance signal clarity. The systems have increased security audio SNR levels from −5 dB to over 15 dB, enabling better speech detection and abnormality recognition capabilities.

1.3.1. Research Questions

This study explores how deep learning techniques can enhance speech signals and improve sound classification in security systems operating under noisy conditions. The key research questions guiding this investigation are as follows:

How effectively do deep learning models improve the Signal-to-Noise Ratio in security audio recordings, thereby enhancing speech intelligibility?
To what extent can artificial intelligence (AI)-based classifiers improve sound classification accuracy in environments characterized by high levels of ambient noise?
How practical and efficient are deep learning-based speech enhancement algorithms when deployed in real-time, operational security systems?

These questions aim to evaluate the performance, reliability, and real-world applicability of AI-driven audio processing methods in modern security infrastructures.

1.3.2. Research Objectives

The primary objectives of this research are as follows:

To evaluate the impact of deep learning models on enhancing SNR in noisy audio recordings relevant to security applications.
To assess the classification accuracy of deep learning-based sound recognition systems under varying noise conditions.
To investigate the computational efficiency, real-time performance, and deployment feasibility of AI-based speech enhancement frameworks in operational security environments.

2. Literature Review

Security systems require speech enhancement because they enhance speech clarity and sound quality in loud environments. The Wiener and Kalman filters represent traditional methods used for eliminating noise interference. The Wiener filter applies mean square error estimation to determine clean speech signals yet remains best for stationary noise conditions but struggles during non-stationary noise scenarios [5]. The speech model of Kalman filtering operates as a dynamic system, but the high computing costs make it challenging to use in real-time applications [6]. The conventional methods show typical biases that negatively affect intelligibility [7]. Deep learning emerges as a modern approach to tackling various situations using extensive datasets that strengthen noise resistance abilities [3]. In Figure 2, the security application of deep learning-based speech enhancement and classification has shown higher accuracy and reliability than conventional methods.

2.1. Speech Enhancement Techniques

The conventional techniques used in speech enhancement mainly involve signal processing techniques that remove noise from the speech signal. The most common method is the Wiener filter, which estimates the clean speech signal by minimizing the mean square error between the noisy input and the output. Mathematically, Wiener filtering can be described in Equation (5) below:

S(f) = (ΦXY(f)/ΦXX(f))X(f)

(5)

(f) is the power spectral density of the noisy speech [5]. While Wiener filtering is highly effective in stationary noise conditions, it struggles with non-stationary environments, such as crowded public spaces or security-critical areas where noise characteristics constantly change.

Another commonly used statistical method is Kalman filtering, which models speech as a dynamic system and estimates the clean signal recursively. Kalman filters can track changes in speech signals and, thus, can be used in some specific cases. However, they are computationally intensive as in Figure 3, which means that for each new observation, iterative calculations are needed, and therefore, their implementation in real-time security systems is not possible [6].

Other than these is spectral subtraction, another traditional method of operation in which noise levels during silent frames are estimated and subtracted from the signal to enhance the SNR. As illustrated in Table 1, this method is computationally efficient; it introduces musical noise interference that hampers speech clarity and naturalness [7].

2.2. Traditional vs. Deep Learning Approaches

Conventional methods, though statistically sound, are not flexible enough for different noise environments. Deep learning solves these problems through large data and feature learning. The empirical analysis reveals that the models based on deep learning are more effective than the conventional approaches in both speech enhancement and classification.

For example, [1] in their study on Wiener filtering and Deep Neural Networks, and the results depicted that with the use of DNNs, the PESQ scores were enhanced from 2.1 to 3.8 and the STOI from 0.65 to 0.91 in security audio datasets. Table 2 also presents the findings of the study regarding the comparison of various methods:

2.3. Deep Learning Architectures in Audio Processing

Deep learning approaches have brought a dramatic change in the field of speech enhancement and classification by learning the features at different levels from the raw waveform data. Deep learning methods are more robust regarding real-world applications by learning features by themselves than traditional methods of manually designing the features. Among these, CNNs are used for feature extraction based on the convolutional filters to process the spectrogram representation of the speech signal. The convolutional operation can be defined as a process of passing a function through a filter to obtain a convolution of the function with the filter, as seen in Equation (6) below:

y i j = \sum w m n x (i - m) (j - n)

(6)

where yij represents the output feature map, wmnw are filter weights, and x(i − m)(j − n) is the input audio spectrogram [3]. CNNs effectively capture local spatial features in speech signals but struggle with long-range dependencies, making them insufficient for handling speech sequences with complex temporal variations.

RNNs and LSTM networks do not have this problem because they can learn sequential dependencies. LSTMs have memory cells and gating mechanisms that help avoid the vanishing gradient problem; thus, the long-term dependencies are retained. As for the hidden state update, it can be represented as seen in Equation (7) below:

ht = σ(Whht – 1 + Wxxt + b)

(7)

where ht is the hidden state, Wh and Wx are weight matrices, and xt is the input at the time step. LSTMs enhance speech by maintaining the context’s temporal dependencies for longer periods [9]. Yet, they have limitations in terms of computation and hence cannot be employed in real-time applications such as security systems, which require quick response.

The transformer architectures have been identified as a superior choice as they use self-attention mechanisms to capture long-range dependencies better. The self-attention mechanism is defined as seen in Equation (8):

QKTV

(8)

Q, K, and V are the query, key, and value matrices. Transformers are better than CNNs and RNNs, especially in noisy environments, because they allow for parallel computation and capture long-range dependencies [10]. However, transformers are computationally intensive, so real-time deployment on edge devices can be demanding. Moreover, their dependence on large datasets for pretraining is questionable ethically for data privacy and security applications [11].

Although deep learning-based speech enhancement methods outperform traditional approaches in many ways, they also have drawbacks. The transformer and LSTM models are computationally intensive and, hence, not suitable for real-time security applications; therefore, there is a need for further study on the development of efficient and lightweight models. Furthermore, adversarial attacks on AI-based speech enhancement systems can severely reduce performance and threaten security applications [12]. Privacy issues in deep learning models, particularly surveillance, raise ethical and legal concerns [13].

Deep learning has provided good results in speech enhancement and classification. Thus, future work should focus on developing real-time and efficient models for security environments with various types of noise. A combination of statistical signal processing and deep learning may reasonably compromise speed and accuracy as shown in Figure 4 [14].

2.4. Challenges in Security Applications

Nonetheless, current deep learning-based speech enhancement models have some issues in security applications:

Time: Real-time processing entails the use of efficient models. However, the models known as transformers are precise but require significant computational power, which makes it challenging to implement them on edge devices [11].
Adversarial Vulnerabilities: The AI models are vulnerable to the so-called adversarial perturbations, which are slight and indiscernible changes in input data that the model has to process. It was established that an attack with a 0.01 SNR perturbation dropped the classification from 95 percent to 50 percent [12].
Privacy and Security Implications: The use of AI in security applications has the potential to pose some level of privacy threat. Encrypted model deployment and federated learning are possible solutions but are still underdeveloped [13].
Noise Variation: Security environments present various types of noise, such as car horns and people’s conversations. The main challenge in the past was achieving model generalization in different scenarios [14].

2.5. Key Studies and Research Gaps

Some past works have used deep learning for speech enhancement, but gaps still need to be filled. Ref. [15] tested a transformer-based model, and it was reported that the model had a PESQ of 3.9 in controlled environments. Still, the model was not as effective in real-world security applications because of the variability and unpredictability of noise. However, some datasets like VoxCeleb and AudioSet do not cover the necessary variability and do not contain enough specific information for security purposes. Some security systems work in conditions of high noise and variability, for example, in the video surveillance of urban space or during emergencies [10]. The current deep learning models, especially the transformer-based models, consume much computational power, which is not ideal for real-time security. Although CNNs and LSTMs are more efficient, they have limitations regarding the capability to address different forms of noise [9]. This is because it is essential to build models that are light enough to be deployed in real-time on edge devices [11].

This means that AI-driven speech enhancement systems are also at risk of adversarial attacks when, with almost no noticeable interference, the noise can adjust in a way that significantly affects the results. According to a study by [12], a minimum of 0.01 SNR can reduce the accuracy of classification from 95% to 50%, and thus, there is a need to implement countermeasures. Not much work is done on defense against adversarial attacks, including adversarial training or noise-aware learning. Further, using the deep learning model in security also presents several ethical issues, such as privacy and surveillance. Some potential negative impacts have been given, emphasizing the effects of mass surveillance and misuse of AI systems [13]. The legal frameworks must be followed, and the AI models must be transparent to protect individuals’ privacy rights in the case of ethical AI implementation in security applications.

All current deep learning models are trained with specific datasets and do not perform well under different noise conditions. Security applications need models that can be built for various environments, such as noisy streets, industrial areas, etc. It is noted that transfer learning and domain adaptation techniques are promising but have not been extensively studied in speech enhancement [14]. Even though deep learning has outperformed traditional statistical methods in many ways, combining both could be beneficial and more efficient. Integrating statistical signal processing methods like the Kalman filter with deep learning architectures can enhance the former’s robustness and decrease computational complexity [6].

Further research must be conducted to investigate such hybrid frameworks for security purposes. These challenges point to the fact that there is still a long way to go in deep learning for speech enhancement to be deployed in real-time, robust, and ethical security applications. Further research should enhance adversarial defense robustness, reduce the computational cost of deep learning, and develop safe and accountable AI-driven security systems.

Justification for Using DNN + GAN in Real-Time Security Applications

While recent advances in speech enhancement research have introduced transformer-based architectures such as SE-Conformers and Dual-Path Recurrent Neural Networks (DPRNNs), these models often carry significant computational overhead, making them less suitable for edge-device deployment and real-time processing in constrained environments. In contrast, our approach adopts a hybrid DNN + GAN framework that strategically balances enhancement accuracy with computational efficiency.

The DNN serves as a low-latency feature extractor, while the GAN architecture helps generate realistic, noise-suppressed speech by learning from adversarial feedback. This design supports practical implementation in IoT-enabled and embedded security systems, where response time and resource usage are critical factors. Although this architecture may not represent the most recent trend in high-resource scenarios, its adaptability to real-time, low-power environments fills an important niche that is often overlooked in the contemporary literature.

Nonetheless, we acknowledge the evolution of the field and recognize that recent architectures—such as SE-Conformers [16] and DPRNNs [17]—have achieved state-of-the-art performance in large-scale tasks. We have added this discussion to ensure contextual alignment with cutting-edge methodologies.

3. Methodology

3.1. Research Approach

The research work consists exclusively of secondary data collection since there is no need for primary research instruments to generate new data. Deep learning-based speech enhancement and classification require vast data to accurately represent real noise conditions. The research uses selected publicly available data as its basis for training and testing the model [18]. The model receives its foundation from speech samples obtained in controlled and non-controlled environments that comprise these datasets. A quantitative research design supports the statistical evaluation of models under noise environments through different assessment methods.

Statistical validation methods approve the accuracy and reliability of speech enhancement models used in this research. The model validation technique known as k-fold helps researchers determine how well the models function for general usage. The research team uses t-tests together with ANOVA statistical tests to evaluate the important enhancements achieved among different model architectures as illustrated in Figure 5 [19]. This document follows different machine learning approaches yet maintains uniform methods to assess the implemented models. Due to this reason, this work evaluates deep learning architectures in security tasks by studying their generalization properties under new noise conditions.

3.2. Dataset Selection

Data selection determines the fundamental success of deep learning models in speech enhancement tasks. Relevant datasets employed in the study are highly recognized databases: VoxCeleb, TIMIT, and LibriSpeech [20]. Speech datasets incorporate different vocal information specimens, including speakers’ identity and accents, and recording setting information. The real-world noise conditions of YouTube videos become present within the speech samples provided in VoxCeleb [21]. TIMIT is a database that excels at phonetic investigations due to its rich phonetic content, manual transcriptions, and recorded speech. Audiobook-based LibriSpeech delivers large amounts of speech data and extensive annotations to facilitate training sessions.

Noise interference is an important concern in security systems, where noise usually interferes with speech signals. The study divides noise into two groups: low SNR (0–10 dB) and high SNR (20–30 dB). This classification assists in emulating actual security scenarios like the recorded audio in surveillance, emergency announcements, and police work. Some of the noise sources considered in this research include background conversation, traffic noise, hum from operating machinery, and electrical interference, which are some common interferences that are likely to be encountered in security recordings.

Data preprocessing is very important in preparing data to be suitable for developing a model. MFCCS and the Fourier Transform are used to preprocess the raw audio data and extract its features that can be used to feed deep learning models [22]. MFCCS preserve spectral properties of speech, and therefore, they are suitable for noise reduction and feature extraction. The Fourier Transform, especially the STFT, is used to analyze an audio signal’s time–frequency components.

To compute the log-magnitude spectrogram, the discrete Short-Time Fourier Transform (STFT) of the noisy input signal is applied. The formulation is shown in Equation (9) below:

LogMag (n, k) = \log (∣ X (n, k) ∣^{2} + ϵ)

(9)

where

X(n,k) is the STFT of the input signal x[m], defined using frame index n and frequency bin k;
∣X(n,k)∣² represents the power spectrum;
ϵ is a small positive constant added for numerical stability to avoid logarithm of zero.

3.3. Data Selection

Deep learning models achieve their highest success in speech enhancement tasks through proper datasets with high quality and wide diversity in training. The research utilizes well-known evaluation datasets, which include the following:

▪: The VoxCeleb dataset comprises extensive celebrity speech material from YouTube video recordings. This database contains authentic noisy conditions, such as room echo effects, ambient dialogues, and microphone signal discrepancies.
▪: The high-quality TIMIT dataset shows exceptional value when performing phonetic investigations and model validation because it contains many phonetic varieties [23].
▪: The LibriSpeech dataset represents audiobook transcriptions containing varied speaker recordings while providing sophisticated transcription detail for generalization across diverse speech characteristics.

Multiple datasets, diverse speaker demographics, multiple accents, and recording environments are the solid bases for effectively training deep learning models.

3.4. Noise Contamination and Categorisation

The quality and recognition of speech signals deteriorate when they experience distortions in security environments. Realistic noise simulation during training models is essential for developing resilient deep learning systems that improve audio quality. The Signal-to-Noise Ratio (SNR) is the main element influencing speech quality when processing noisy signals in environments since it measures how noise strength relates to spoken sound levels [24].

The strong impact of noise during SNR ranges from 0 to 10 dB, rendering speech signals nearly silent to the human ear. Several security and public safety communication systems record sounds that become challenging to understand due to excessive background noise. Sound environments with medium SNR levels (10–20 dB) contain noticeable speech but see significant noise distortion from the background surrounding noise. High SNR environments (20–30 dB) provide predominantly clear speech because they exist in controlled forensic audio recordings and enhanced surveillance setups that minimize background interferences.

Security-oriented research examines numerous typical noise elements that disrupt security procedures, including human conversations, traffic sounds, equipment vibrations, and electromagnetic disturbances. Background chatter consists of several voices in crowded emergency response areas [25]. Traffic noises from engines, horns, and sirens frequently affect recordings from urban areas and law enforcement duties. Factories, secure facilities, and data centers face a major problem from machinery hum because of their industrial equipment and HVAC systems. The use of electronic devices and their naturally produced interference affects both radio transmissions and security monitoring communications. This research includes real-world noise conditions to optimize deep learning models for security applications, enhancing speech quality in demanding noise environments.

3.4.1. Dataset Preparation

To ensure the robustness and real-world applicability of the proposed speech enhancement model, two publicly available environmental sound datasets were selected: UrbanSound8K and ESC-50. These datasets provide diverse audio samples that simulate common noise scenarios found in security-critical environments, including traffic, crowd chatter, construction machinery, and metro station announcements.

Selection and Justification

UrbanSound8K contains 8732 labeled sound excerpts across 10 categories, including sirens, engine idling, and street music—representative of urban surveillance audio.
ESC-50 includes 2000 audio clips organized into 50 semantic classes, including industrial and public domain noise types such as keyboard typing, HVAC noise, and metro intercoms.

Preprocessing Pipeline

The standardization of input features across both datasets was as follows:

Resampling was applied to unify all audio clips at 16 kHz, ensuring consistency with typical microphone capture rates in embedded systems.
Amplitude normalization was used to scale the audio signals to a uniform dynamic range.
Noise augmentation was performed by mixing clean speech samples with varying noise levels (SNR ranging from −5 dB to 20 dB) to simulate realistic audio degradation.

Segmentation and Framing

Each audio clip was segmented into overlapping frames of 25 ms with a 10 ms shift, resulting in consistent frame-level inputs for the deep learning models. This approach allows fine-grained temporal analysis, suitable for both speech enhancement and downstream classification tasks.

Labeling Protocol

Ground truth labels were inherited from the original datasets for environmental sound classification. In enhanced experiments, clean speech labels were also introduced post-denoising to facilitate the supervised evaluation of speech intelligibility and classification accuracy. Label integrity was verified manually for a subset of test samples to ensure annotation consistency following augmentation.

This detailed preparation process enhances the reproducibility of our study and ensures that the evaluation conditions reflect real-world scenarios encountered in security systems.

3.5. Model Architecture for Speech Enhancement

Speech quality improvement is vital, especially requiring powerful deep learning models to control speech in a noisy environment. DNNs, GANs, and attention-based models are various architectures explored in this study; each has its specific advantages.

The basis for speech enhancement based on DNNs lies in learning complex speech features using multiple hidden layers. We can formulate a typical DNN as seen in Equation (10) below:

h (l) = f (W (l) h (l - 1) + b (l))

(10)

It includes three parts: W(l) is the weight matrix, and b(l) is the bias term and activation function f(⋅). ReLU: Rectified Linear Unit is a common approach to introducing non-linearity to better converge to optimal solutions.

With the help of the generator network, GANs can clean the noisy speech inputs into output sounds. In particular, the architecture of the GAN structure comprises two neural networks: the generator neural network (G) synthesizes realistic speech, and the discriminator neural network (D) categorizes the original and generated speech. The training objectives of GANs are described in Equation (11):

m i n D m a x E x \sim p d a t a [l o g D (x)] + E z \sim p z [l o g (1 - D (G (z)))]

(11)

The GAN system incorporates two components, G and D, that enhance speech in noisy conditions and authenticate each output. The training process with adversarial mechanisms causes GANs to develop better speech signals while decreasing the error rate.

Transformer-based models in speech enhancement systems produce superior results by extracting relationships across diverse audio sequence lengths. Through self-attention mechanisms, models learn to focus on essential speech elements within noisy conditions instead of being diverted by needless noise elements. As in Equation (12), the attention mechanism represents the following:

A t t e n t i o n (Q, K, V) = s o f t m a x (Q K^{T} / d k) V

(12)

The Q, K, and V matrices represent key and value elements. Attention-based models surpass fundamental Convolutional Neural Networks and Recurrent Neural Networks in context understanding, thus excelling in noisy circumstances [26].

3.6. DNN + GAN Model Architecture

A hybrid architecture combining Deep Neural Networks and Generative Adversarial Networks for speech enhancement in noisy environments. The integration of these models leverages the strengths of both, where DNNs are used for feature extraction and learning complex patterns, while GANs improve speech signal generation and discrimination. Below are the specific details for the DNN + GAN hybrid model:

Number of Layers: The architecture includes a DNN with eight layers (five fully connected layers and three hidden layers), followed by a GAN model consisting of a generator and discriminator, each with six layers.
Filter Sizes: Convolutional layers in the hybrid model use filter sizes of 3 × 3 and 5 × 5 for feature extraction and signal enhancement.
Residual Connections: Residual connections are used in both the DNN and GAN components, especially in the generator, to improve gradient flow and feature propagation, ensuring better training stability.
Activation Functions: The DNN uses ReLU activation functions in all hidden layers. The GAN generator employs LeakyReLU for hidden layers, with a Sigmoid activation in the output layer for generating real-valued speech. The discriminator also uses LeakyReLU for hidden layers and Sigmoid for classification.
Training Parameters: The combined DNN + GAN model is trained with the Adam optimizer, using a learning rate of 0.001 for the DNN and 0.0002 for the GAN, with a batch size of 64 for the DNN and 32 for the GAN.

3.7. Training and Validation Strategies

The study divides its data into sections for optimal generalization by using 80% for training and 10% for validation and testing. Through the use of backpropagation together with stochastic gradient descent (SGD), the models perform parameter updates to reach minimal loss functions [27].

Model optimization relies on loss functions because they measure the difference between predicted and actual speech signals. The Mean Squared Error (MSE) stands as one of the primary loss functions, which computes between expected and actual speech signals. The Mean Squared Error (MSE) loss function used to train the enhancement model is defined in Equation (13):

L_{{M S E}^{*}} = \frac{1}{N} \sum_{i = 1}^{N} {(x^{^} - x)}^{2}

(13)

where

x^ is the predicted clean speech signal value;
x is the ground truth clean signal;
N is the total number of samples.

This loss encourages the model to minimize the squared differences between the predicted and target speech signals across all frames and frequency bins.

Evaluation Metrics for Speech Enhancement

Subjective measurements evaluate the performance of suggested models in speech enhancement systems. Speech quality measurements from perceptual models are computed through the widely used quantitative measure, Perceptual Evaluation of Speech Quality (PESQ) [28]. As in Equation (14), the PESQ score-calculating process includes the following:

PESQ = a_{1} \cdot f_{1} + a_{2} \cdot f_{2}

(14)

where

$f_{1}$ and $f_{2}$ represent the average symmetric and asymmetric disturbance values extracted from the perceptual model;
$a_{1}$ and $a_{2}$ are weighting coefficients defined by the PESQ standard.

Higher PESQ scores indicate better-perceived speech quality. The metric effectively captures distortions such as clipping, background noise, and reverberation, making it especially useful for evaluating speech enhancement models in noisy environments.

Research suggests that noise suppression effectiveness should use SNR improvements to measure input and output signals [29]. Model performance evaluation relies on this measurement to determine the maximum environmental noise reduction and preservation of intelligible speech. Security-oriented speech recognition models validate their robustness through accuracy measures because this ensures the improved speech data can function reliably in subsequent operations, including voice authentication and speaker identification [30].

The study conducts performance evaluation through deep learning analysis that competes with established Wiener and Kalman filtering procedures. Performance evaluation of practical security deployments using deep learning models occurs by direct assessment against traditional conventional approaches. It is also important to perform the training, validation, and evaluation of these models, which are selected approaches for enhancing speeches in various environments.

4. Challenges in Implementation

Implementing deep learning-based speech enhancement for sound classification (DLR-SE) in real-world security systems involves multiple layers of complexity. Key challenges include legal and ethical considerations, feasibility constraints, high computational demands, model generalization under variable noise conditions, and security vulnerabilities. Each of these factors influences the reliability, scalability, and ethical deployment of such systems.

4.1. Legal and Ethical Concerns

One of the foremost challenges in deploying DLR-SE systems is ensuring compliance with data privacy regulations, such as the General Data Protection Regulation (GDPR). These regulations mandate anonymization of Personally Identifiable Information (PII), which can hinder the quality of enhanced speech and complicate speech intelligibility. Additionally, deep learning systems often reflect biases related to gender, ethnicity, and accent, leading to discriminatory outputs. Bias evaluation techniques such as equalized odds and disparate impact analysis must be applied to ensure fairness [31]. There is also a growing concern about the unauthorized use of enhanced audio data. Enhanced recordings could be exploited to recover sensitive information, violating privacy norms [32]. To mitigate these risks, encryption protocols, access controls, and regular audits must be enforced.

4.2. Technical Barriers to Deployment

DLR-SE models are computationally intensive and often unsuitable for real-time applications without specialized hardware. Latency, memory constraints, and high computational load are major obstacles. The use of lightweight architectures, model pruning, and hardware accelerators such as TPUs and CUDA-enabled GPUs can help reduce processing delays [16]. Real-time systems must be evaluated based on latency thresholds, typically in the microsecond range, to ensure prompt response in security-critical scenarios.

Neural network optimization through quantization and edge computing can improve deployment feasibility. Parallelization techniques and distributed processing further enhance scalability, although memory overhead remains a concern [33].

4.3. Noise Variability and Generalization Issues

Security environments are characterized by diverse and unpredictable noise sources, from background chatter to industrial hum and electromagnetic interference. Deep learning models must generalize across these conditions, but training data often lacks this variability. Adaptive noise filtering, spectrogram–histogram superimposition, and profile-based subtraction techniques can improve robustness [17].

A balance must be struck between noise suppression and speech integrity. Overaggressive filtering may distort key speech components, reducing classification accuracy. While attention-based models and RNNs offer improvements, they come with increased computational demands [34].

4.4. Computational Cost and Scalability

High model complexity poses significant barriers to scalability. FLOP benchmarks help assess model efficiency, guiding trade-offs between performance and resource consumption [35]. Model compression techniques like precision reduction and network pruning, as well as knowledge distillation, can help retain performance while enabling deployment on low-power devices [36].

However, these strategies may introduce new limitations in model fidelity and introduce synchronization challenges in distributed deployments. Ensuring real-time operation across networked systems requires careful load balancing and optimized communication protocols.

4.5. Security Vulnerabilities in AI Systems

DLR-SE models are vulnerable to adversarial attacks—small perturbations to input data that significantly degrade output accuracy while remaining imperceptible to human listeners. These attacks can disrupt classification, allow intrusion, or manipulate outcomes [37]. Robust security measures must include adversarial training, perturbation detection algorithms, and encrypted inference techniques. Regular vulnerability scanning and secure model deployment protocols are essential to protect AI infrastructure from manipulation [16]. Ethical governance also plays a critical role, as AI decisions in security contexts must be accountable and unbiased [38].

4.6. Integrated Perspective

Successfully implementing DLR-SE in security systems requires multidisciplinary coordination. Technical teams must develop efficient, scalable models that balance noise suppression and speech clarity. Legal advisors must ensure regulatory compliance. Security experts must address adversarial threats and ensure system integrity. While DLR-SE offers promising enhancements to sound classification in complex environments, overcoming these challenges will require ongoing research in model optimization, ethical AI, and robust deployment strategies. By integrating privacy protection, computational efficiency, and adversarial resilience, the next generation of security systems can achieve both ethical soundness and operational excellence.

5. Results and Discussion

A deep learning speech enhancement model evaluation required PESQ, STOI, and SNR improvement metrics to measure its success [39]. As shown in Figure 6, these metrics are used to evaluate model enhancement power because they determine effectiveness levels when clear speech integrates with noise reduction. This noise suppression benchmark model adopted DNNs and GANs instead of Wiener and Kalman filtering methods to achieve its functionality.

The Table 3 below demonstrates comprehensive evaluation results between different model performance indicators.

Professional speech evaluation depends on PESQ scores to measure audio quality from −0.5 to 4.5. When PESQ scores rise, better speech clarity and decreased distortions emerge. The current PESQ score equals 3.85; it is much higher than the Wiener filtering PESQ value at 2.91 and higher than the Kalman filtering PESQ value at 3.12. The good working of the GANs is shown by their good performance in recognizing noise patterns, as seen from the improved clean speech generation.

According to STOI evaluation criteria, the described model achieves better accuracy than Wiener and Kalman filtering, with a coefficient of 0.92, while Wiener and Kalman filtering reach coefficients of 0.78 and 0.83, respectively. The proportion represented by the STOI score improves with optimal values to give a better perception of all spoken words in a noisy environment.

It is concluded that the proposed method can effectively minimize background noise as its SNR enhancement part works effectively. Indeed, deriving from the deep learning model, the SNR performance increases by 12.5 dB, enabling better performance than that of the Wiener filter by 8.2 dB and the Kalman filter by 9.4 dB. When noise suppression takes place at a higher level of SNR, the speech signal remains clear.

To further evaluate the model’s performance in distinguishing between clean and noisy speech samples, Receiver Operating Characteristic (ROC) curves were plotted for each method [40]. The ROC curve provides a visual representation of the trade-off between true positive rate (TPR) and false positive rate (FPR) across different decision thresholds. As seen in Figure 7, a higher area under the ROC curve (AUC) indicates better classification performance.

Real-time security entails an efficient evaluation process, given that response to threats has to be very fast. All the above processing is conducted on the audio frames in approximately 18.3 ms, while Wiener filtering takes another 27.8 ms, and Kalman filtering takes 25.4 ms to complete its process. At high speed, the model offers the flexibility of working in real-time speech processing, which is demanding when it comes to security.

5.1. Comparison with Traditional Methods

A deep learning speech enhancement model obtained performance judgments over conventional methods by boosting the activity of speech and assessing the levels of processing [39]. This percentage improvement calculation depended on the presented formula:

Improvement (%) = (Traditional Method/Proposed Model − Traditional Method) × 100

Table 4 presents the comparative results, demonstrating the advantages of the deep learning model over Wiener filtering, Kalman filtering, and spectral subtraction.

Every performance metric demonstrated significant enhancement through the obtained results. The evaluation measured PESQ at 32.3% superior to Wiener filtering and delivered 23.4% better results than Kalman filtering and 40.1% above spectral subtraction. The Short-Time Objective Intelligibility measurements in test results showed a substantial increase in speech intelligibility.

5.2. Noise Reduction Effectiveness in Different Environments

The finetuning of the system was conducted at several places in 2009 to assess its efficacy in filtered noises while preserving the audiometric characteristics. Three environmental sites were employed for assessments, such as an actual street, an industrial site, and different parts of an office space, shopping mall, and operational metro station. The evaluation step aimed at comparing the pre- and post-SNR of speeches while implementing the enhancement model.

The proposed DNN-GAN model’s noise reduction effectiveness was evaluated across multiple real-world environments as illustrated in Table 5. Testing utilized mixed noise conditions from the UrbanSound8K and ESC-50 datasets, containing authentic urban and environmental sounds. The model achieved consistent SNR improvements of 9.5–9.8 dB, effectively separating voice signals from challenging noise types, including traffic, machinery, and crowd chatter.

5.3. Speech Intelligibility Gains

The STOI metric was used as the quality measure to evaluate how much the intelligibility of the speaker’s speech has been enhanced by the model. The STOI value is higher during the measurement process for speech clarity since its purpose is to assess how easy it is for the listeners to comprehend the content of the speech [41]. The current document presents other features as findings are speech intelligibility data, which was given before and after the enhancement process.

For intelligibility assessment (Table 6), the same model demonstrated significant STOI improvements (26.4–50.9%) when processing speech samples mixed with dataset noises. The UrbanSound8K provided street/metro noises, while ESC-50 contributed industrial/office sounds, ensuring comprehensive evaluation across security-relevant scenarios like pedestrian monitoring and public announcements.

Compared to noise reduction and speech intelligibility, which are more conventional techniques, a deep learning-based approach for speech enhancement proved to be better. The model exhibits excellent speech quality, as evidenced by its high scores in PESQ, STOI, and SNR, particularly under high interference conditions [42]. This proposed model proves useful to real-time applications mainly because it delivers improved noise reduction performance for a given signal compared to the other existing methods like Wiener filtering, Kalman filtering, spectral subtraction, and so on, all at comparatively lower operation costs [43].

The proposed model verifies the planned effectiveness based on the operational environment experiments. The performance of the proposed model proved to be adequate in all conditions since it improved speech intelligibility measures such as SNR and STOI. Security monitoring systems demand clear speech as an essential factor of operation because for operators to make threat detections, they must speak effectively [44]. Therefore, deep learning is an effective noise reducer as it efficiently handles audio details and more efficient methods.

The research should proceed to optimization efforts to enhance the model’s motion effect without compromising on its high-performance levels at the edge of the field. Further development of research investigations that aim at strengthening adversarial robustness has to continue to defend against hindrances that may ensue when models are in action.

Impact of Enhanced Speech on Sound Classification Accuracy

To assess the downstream benefits of speech enhancement, we conducted a comparative evaluation of sound classification accuracy using features processed by the proposed DNN + GAN model versus unenhanced noisy inputs. The classification stage employed CNNs and RNNs, both known for their effectiveness in audio event recognition.

The enhanced speech features were extracted via Mel-spectrograms and fed into both CNN and RNN architectures trained on labeled samples from UrbanSound8K and ESC-50 datasets. Classification performance was measured using standard metrics: accuracy, precision, recall, and F1-score, and confusion matrices were analyzed to detect misclassification trends.

Results showed a consistent performance boost across all models when using enhanced speech data. On average, classification accuracy improved by 14–21%, depending on the environment and noise level. Notably, precision improved significantly for security-relevant classes such as gunshots, sirens, and alarms, which are critical for real-time surveillance applications.

A detailed breakdown of classification results across different environments is presented in Table 7.

These results demonstrate that the speech enhancement model not only improves intelligibility metrics (e.g., STOI and PESQ) but also significantly contributes to better classification performance in AI-driven threat detection systems. The enhancement process effectively suppresses irrelevant noise components while preserving speech features critical for sound categorization.

Confusion matrices further showed a reduction in misclassification between high-risk and low-risk sound classes, a valuable outcome for minimizing false alarms in security systems.

6. Future Directions and Recommendations

6.1. Improvements in Model Generalization

There is considerable potential for improving the generalization methods in real-life noise environments, particularly for speech enhancement through deep learning. Thus, transfer learning and domain adaptation methods act as adaptive approaches to improve model adaptability. Transfer learning allows using models trained on big training sets to learn the target environment provided with less data and offer higher accuracy [45]. Otherwise, implementing the domain adaptation methods using Kullback–Leibler (KL) divergence makes your algorithm more robust by minimizing the distance between the target noise and training distributions.

Supervised autoencoders are deep learning regularizations that contribute to generalization by imposing unsupervised conditions that improve feature discovery in noise cases [46]. Model robustness finds additional backing from data augmentation because such methodology subjects models to various noise types and room conditions, limiting overfitting while improving their effectiveness in real-life scenarios. One achieves superior acoustic environment consistency by employing deep learning technologies in speech enhancement systems.

6.2. Integration with Multimodal Security Systems

Speech enhancement reaches its highest level of effectiveness by becoming part of multimodal security systems. The fields of security technology now utilize biometric fusion between speech identification systems, facial recognition systems, and behavioral assessment techniques to enhance security detection [46]. Real-time switching between enhanced speech information and video analysis provides deeper situational awareness, resulting in heightened surveillance performance and better access to administration.

The deep learning methodology enables multimodal biometric systems to perform automatic input-weighting adjustments between audio data and video signals according to environmental factors [47]—an enhanced speech signal functions as the main identification tool during low-visibility situations. The development of security systems using multiple methods should center on maintaining real-time behavioral synchronization between speech processing and video and other identification elements for continuous system integration.

6.3. Potential for Edge AI and IoT Implementation

Deep learning-based speech enhancement models need deployment on edge devices and IoT infrastructure to deliver real-time security applications effectively. Deep learning models normally need high computational capabilities to function, but such requirements create delays and security risks before processing at the cloud level. Edge AI provides a promising solution through device-based processing while it decreases the requirement for cloud-based computing [48].

The deployment of edge systems depends on efficiency-enhancing techniques, which include model quantization and pruning methods. The optimization procedures of quantization minimize power usage while reducing weight precision, and pruning cuts unnecessary network connections to minimize memory requirements. The optimized machine-learning functions in microcontrollers offered by Tinyml are an effective solution for implementing speech enhancement in security systems [49].

The edge-based speech enhancement functionality enables secure platforms to function by voice commands and keeps their performance steady in noisy environments. When smart security systems built on IoT are integrated with these models, they enhance intercom and access control solution sounds, resulting in better security staff–user conversations.

6.4. Addressing Ethical and Regulatory Challenges

Security systems must investigate ethical matters and compliance requirements before implementing AI-based speech enhancement capabilities. All companies must adhere to worldwide AI regulations, specifically focusing on the European Union’s AI Act, for proper technology deployment [50]. Applications that integrate surveillance systems to process voice data must include privacy considerations and consent procedures, but they experience security problems with their data.

The development of performance limitations based on language differences and accents requires the reduction of biases because speech enhancement models only function with limited datasets. Creating unbiased speech enhancement systems requires training data that reflects all populations equally. The ethical guidelines should clarify the effects of AI audio processing on security decision-making procedures [51].

Developing new governance frameworks should create specifications to facilitate the correct usage of responsible AI systems within speech enhancement technology applications as in Figure 8. The development of new AI systems must be carried out by policymakers, AI researchers, and stakeholders while meeting regulatory requirements through privacy protection and fair practices.

6.5. Ethical and Regulatory Considerations in Speech Enhancement Systems

While the proposed deep learning model demonstrates strong performance in speech enhancement, its deployment in security applications raises important ethical and regulatory challenges that warrant discussion [29]. Key concerns include the following:

Privacy Implications: Processing voice data, especially in public surveillance contexts, must comply with data protection regulations (e.g., GDPR, AI Act). Clear protocols for anonymization and user consent are needed when handling sensitive audio recordings [29].
Bias Mitigation: The model’s performance across diverse accents, languages, and demographic groups requires rigorous testing to prevent discriminatory outcomes. Techniques like balanced dataset curation and fairness-aware training should be implemented [52].
Transparency: Deploying such systems demands explainable AI methods to justify enhancement decisions, particularly when used as evidence in security investigations.

Future work should focus on developing standardized audit frameworks for ethical AI deployment in audio processing, including impact assessments and stakeholder guidelines. While this study highlights these concerns, addressing them fully requires interdisciplinary collaboration between AI researchers, policymakers, and legal experts.

Future Architectural Directions

Although the proposed DNN + GAN model has demonstrated strong performance and operational feasibility, future research should explore more advanced architectures that can further enhance generalization, robustness, and interpretability in speech enhancement systems.

In particular, transformer-based models like SE-Conformers and self-supervised learning architectures such as HuBERT and wav2vec 2.0 offer promising capabilities in capturing long-term dependencies and adapting to novel acoustic environments. However, these models must first be optimized for inference efficiency to make them viable for deployment on security-focused edge devices.

Additionally, hybrid architectures that combine the sequential modeling power of Dual-Path RNNs with transformer layers could bridge the performance-efficiency trade-off currently faced in real-time systems. Future work should benchmark these newer models against DNN + GAN on security-specific audio datasets, with an emphasis on latency, accuracy, and energy consumption.

7. Conclusions

A deep learning-based speech enhancement system was adopted to evaluate its performance and enhance security functions through quality and acoustic clarity improvement. The model achieved superior results than classical Wiener and Kalman filters by implementing a DNN and GAN hybrid neural network architecture. Real-time security monitoring received enhanced voice clarity through the proposed model that achieved PESQ 3.85 and STOI 0.92, together with SNR 12.5 dB, to reach its excellent performance level.

Analysis of audio frames ran at a rate of 18.3 ms when using the model, thus showing faster performance than conventional methods. The model runs at high speeds to support real-time security system use, especially for locations with elevated noise levels. The model demonstrated success in real noise environments, enhancing speech quality across street areas, industrial sites, and official business locations.

The next development phase should improve model generalization capabilities by employing transfer learning and domain adaptation, providing reliable acoustic performance in various environments. The surveillance capabilities could become stronger by integrating security systems with video analytics and biometric authentication. Integrating the model with edge AI and IoT devices would boost real-time processing speed while minimizing cloud dependency and reducing processing delays.

The deployment of AI in security applications requires that ethical and regulatory concerns be addressed as priority points. To ensure compliance, global AI policies must be obeyed, and speech processing across all languages must be fair and transparent in all AI decisions. Applying deep learning-based speech enhancement technology substantially improves security systems worldwide by solving these issues.

Author Contributions

Conceptualization, S.Y.M.; Methodology, S.Y.M.; Validation, S.Y.M.; Data curation, S.Y.M.; Resources, T.Z. and Y.G.; Writing—original draft, N.A.M.; Writing—review and editing, Y.G. and N.A.M.; Supervision, T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Natural Science Foundation of China (No. 62271344) Department of Information Science.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Michelsanti, D.; Tan, Z.H.; Zhang, S.X.; Xu, Y.; Yu, M.; Yu, D.; Jensen, J. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1368–1396. [Google Scholar] [CrossRef]
McLoughlin, I.; Zhang, H.; Xie, Z.; Song, Y.; Xiao, W. Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 540–552. [Google Scholar] [CrossRef]
Purwins, H.; Li, B.; Virtanen, T.; Schlüter, J.; Chang, S.Y.; Sainath, T. Deep learning for audio signal processing. IEEE J. Sel. Top. Signal Process. 2019, 13, 206–219. [Google Scholar] [CrossRef]
Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
Das, N.; Chakraborty, S.; Chaki, J.; Padhy, N.; Dey, N. Fundamentals, present and future perspectives of speech enhancement. Int. J. Speech Technol. 2021, 24, 883–901. [Google Scholar] [CrossRef]
Saleem, N. Speech Enhancement for Improving Quality and Intelligibility in Complex Noisy Environments. Doctoral Dissertation, University of Engineering & Technology, Peshawar, Pakistan, 2021. [Google Scholar]
Yuliani, A.R.; Amri, M.F.; Suryawati, E.; Ramdan, A.; Pardede, H.F. Speech enhancement using deep learning methods: A review. J. Elektron. Dan Telekomun. 2021, 21, 19–26. [Google Scholar] [CrossRef]
Barlow, C.F.; Taylor, M.P. Classification of Speech Enhancement Methods. ResearchGate. 2020. Available online: https://www.researchgate.net/figure/Classification-of-Speech-Enhancement-Methods_fig1_287718128 (accessed on 6 June 2025).
Mehrish, A.; Majumder, N.; Bharadwaj, R.; Mihalcea, R.; Poria, S. A review of deep learning techniques for speech processing. Inf. Fusion 2023, 99, 101869. [Google Scholar] [CrossRef]
Lohani, B.; Gautam, C.K.; Kushwaha, P.K.; Gupta, A. Deep Learning Approaches for Enhanced Audio Quality Through Noise Reduction. In Proceedings of the 2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE), Gautam Buddha Nagar, India, 9–11 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 447–453. [Google Scholar]
Avci, O.; Abdeljaber, O.; Kiranyaz, S.; Hussein, M.; Gabbouj, M.; Inman, D.J. A review of vibration-based damage detection in civil structures: From traditional methods to Machine Learning and Deep Learning applications. Mech. Syst. Signal Process. 2021, 147, 107077. [Google Scholar] [CrossRef]
Ali, M.; Khan, S.U.; Vasilakos, A.V. Security in cloud computing: Opportunities and challenges. Inf. Sci. 2015, 305, 357–383. [Google Scholar] [CrossRef]
Hassija, V.; Chamola, V.; Saxena, V.; Jain, D.; Goyal, P.; Sikdar, B. A survey on IoT security: Application areas, security threats, and solution architectures. IEEE Access 2019, 7, 82721–82743. [Google Scholar] [CrossRef]
Vary, P.; Martin, R. Digital Speech Transmission and Enhancement; John Wiley & Sons: Hoboken, NJ, USA, 2023. [Google Scholar]
Zhang, L.; Tan, J.; Han, D.; Zhu, H. From machine learning to deep learning: Progress in machine intelligence for rational drug discovery. Drug Discov. Today 2017, 22, 1680–1685. [Google Scholar] [CrossRef] [PubMed]
He, Y.; Meng, G.; Chen, K.; Hu, X.; He, J. Towards security threats of deep learning systems: A survey. IEEE Trans. Softw. Eng. 2020, 48, 1743–1770. [Google Scholar] [CrossRef]
Chang, Y.; Yan, L.; Fang, H.; Luo, C. Anisotropic spectral-spatial total variation model for multispectral remote sensing image describing. IEEE Trans. Image Process. 2015, 24, 1852–1866. [Google Scholar] [CrossRef] [PubMed]
Adolphs, R.; Nummenmaa, L.; Todorov, A.; Haxby, J.V. Data-driven approaches in the investigation of social perception. Philos. Trans. R. Soc. B: Biol. Sci. 2016, 371, 20150367. [Google Scholar] [CrossRef]
Liu, Q.; Wang, L. t-Test and ANOVA for data with ceiling and/or floor effects. Behav. Res. Methods 2021, 53, 264–277. [Google Scholar] [CrossRef]
Anidjar, O.H.; Marbel, R.; Yozevitch, R. Harnessing the power of Wav2Vec2 and CNNs for Robust Speaker Identification on the VoxCeleb and LibriSpeech Datasets. Expert Syst. Appl. 2024, 255, 124671. [Google Scholar] [CrossRef]
Nagrani, A.; Chung, J.S.; Huh, J.; Brown, A.; Coto, E.; Xie, W.; McLaren, M.; Reynolds, D.A.; Zisserman, A. Voxsrc 2020: The second voxceleb speaker recognition challenge. arXiv 2020, arXiv:2012.06867. [Google Scholar]
Abdul, Z.K.; Al-Talabani, A.K. Mel frequency cepstral coefficient and its applications: A review. IEEE Access 2022, 10, 122136–122158. [Google Scholar] [CrossRef]
Harte, N.; Gillen, E. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Trans. Multimed. 2015, 17, 603–615. [Google Scholar] [CrossRef]
Fredianelli, L.; Bolognese, M.; Fidecaro, F.; Licitra, G. Classification of noise sources for port area noise mapping. Environments 2021, 8, 12. [Google Scholar] [CrossRef]
Haghani, M.; Sarvi, M. Crowd behaviour and motion: Empirical methods. Transp. Res. Part B Methodol. 2018, 107, 253–294. [Google Scholar] [CrossRef]
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Gu, S.; Levine, S.; Sutskever, I.; Mnih, A. Muprop: Unbiased backpropagation for stochastic neural networks. arXiv 2015, arXiv:1511.05176. [Google Scholar]
Zgank, A.; Donaj, G.; Vlaj, D. Speech Quality Assessment and Emotions-Effect on the PESQ Metric. In Proceedings of the 2024 ELEKTRO (ELEKTRO), Zakopane, Poland, 20–22 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–4. [Google Scholar]
Feng, T.; Hebbar, R.; Mehlman, N.; Shi, X.; Kommineni, A.; Narayanan, S. A Review of Speech-Centric Trustworthy Machine Learning: Privacy, Safety, and Fairness. APSIPA Trans. Signal Inf. Process. 2023, 12, e39. [Google Scholar] [CrossRef]
Facchinetti, N.; Simonetta, F.; Ntalampiras, S. A Systematic Evaluation of Adversarial Attacks against Speech Emotion Recognition Models. Intell. Comput. 2024, 3, 0088. [Google Scholar] [CrossRef]
Carvalho, A.P.; Canedo, E.D.; Carvalho, F.P.; Carvalho, P.H.P. Anonymization and Compliance to Protection Data: Impacts and Challenges into Big Data. In Proceedings of the ICEIS 2020, Virtual Event, 5–7 May 2020; pp. 31–41. [Google Scholar]
Ahmad, K.; Maabreh, M.; Ghaly, M.; Khan, K.; Qadir, J.; Al-Fuqaha, A. Developing future human-centered smart cities: Critical analysis of smart city security, Data management, and Ethical challenges. Comput. Sci. Rev. 2022, 43, 100452. [Google Scholar] [CrossRef]
Pal, S.; Ebrahimi, E.; Zulfiqar, A.; Fu, Y.; Zhang, V.; Migacz, S.; Nellans, D.; Gupta, P. Optimizing multi-GPU parallelization strategies for deep learning training. IEEE Micro 2019, 39, 91–101. [Google Scholar] [CrossRef]
Kaminskas, M.; Bridge, D. Diversity, serendipity, novelty, and coverage: A survey and empirical analysis of beyond-accuracy objectives in recommender systems. ACM Trans. Interact. Intell. Syst. (TiiS) 2016, 7, 1–42. [Google Scholar] [CrossRef]
Sagun, L.; Evci, U.; Guney, V.U.; Dauphin, Y.; Bottou, L. Empirical analysis of the hessian of over-parametrized neural networks. arXiv 2017, arXiv:1706.04454. [Google Scholar]
Deng, L.; Li, G.; Han, S.; Shi, L.; Xie, Y. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 2020, 108, 485–532. [Google Scholar] [CrossRef]
Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; Abbeel, P. Adversarial attacks on neural network policies. arXiv 2017, arXiv:1702.02284. [Google Scholar]
Mensah, S.Y.; Zhang, T.; Mahmud, N.A.; Geng, Y. Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems. Preprints 2025, 2025041005. [Google Scholar]
Gelderblom, F.B. Evaluating Performance Metrics for Deep Neural Network-Based Speech Enhancement Systems. Doctoral Thesis, Norwegian University of Science and Technology, Trondheim, Norway, 2023. [Google Scholar]
Nahm, F.S. Receiver operating characteristic curve: Overview and practical use for clinicians. Korean J. Anesthesiol. 2022, 75, 25–36. [Google Scholar] [CrossRef] [PubMed]
Sharma, B.; Wang, Y. Automatic evaluation of song intelligibility using singing adapted STOI and vocal-specific features. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 28, 319–331. [Google Scholar] [CrossRef]
Cui, J. Speech Enhancement by Using Deep Learning Algorithms. Doctoral Dissertation, University of Southampton, Southampton, UK, 2024. [Google Scholar]
Manju, B.R.; Sneha, M.R. ECG denoising using wiener filter and kalman filter. Procedia Comput. Sci. 2020, 171, 273–281. [Google Scholar]
Nawar, M.N.A.M. Neural Enhancement Strategies for Robust Speech Processing. Doctoral Thesis, University of Trento, Trento, Italy, 2023. [Google Scholar]
Zheng, Q.; Yang, M.; Yang, J.; Zhang, Q.; Zhang, X. Improvement of the generalization ability of deep CNN via implicit regularization in the two-stage training process. IEEE Access 2018, 6, 15844–15869. [Google Scholar] [CrossRef]
Le, L.; Patterson, A.; White, M. Supervised Autoencoders: Improving Generalization Performance with Unsupervised Regularizers. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper/2018/hash/2a38a4a9316c49e5a833517c45d31070-Abstract.html (accessed on 6 June 2025).
Oloyede, M.O.; Hancke, G.P. Unimodal and multimodal biometric sensing systems: A review. IEEE Access 2016, 4, 7532–7555. [Google Scholar] [CrossRef]
Merenda, M.; Porcaro, C.; Iero, D. Edge machine learning for ai-enabled iot devices: A review. Sensors 2020, 20, 2533. [Google Scholar] [CrossRef]
Greco, L.; Percannella, G.; Ritrovato, P.; Tortorella, F.; Vento, M. Trends in IoT based solutions for health care: Moving AI to the edge. Pattern Recognit. Lett. 2020, 135, 346–353. [Google Scholar] [CrossRef]
Cath, C. Governing artificial intelligence: Ethical, legal and technical opportunities and challenges. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2018, 376, 20180080. [Google Scholar] [CrossRef] [PubMed]
Vayena, E.; Blasimme, A.; Cohen, I.G. Machine learning in medicine: Addressing ethical challenges. PLoS Med. 2018, 15, e1002689. [Google Scholar] [CrossRef] [PubMed]
Le Quy, T.; Roy, A.; Iosifidis, V.; Zhang, W.; Ntoutsi, E. A survey on datasets for fairness-aware machine learning. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1452. [Google Scholar] [CrossRef]

Figure 1. Comparison of sound classification accuracy. High-level architecture of the proposed DNN + GAN-based speech enhancement and classification system. This illustration is original and was created by the authors to represent the proposed workflow.

Figure 2. Classification of speech enhancement methods [8].

Figure 3. Kalman filtering. Comparison of PESQ and STOI performance across three models: the proposed DNN + GAN model, a CNN baseline, and a transformer baseline.

Figure 4. A chart comparing the PESQ and STOI scores of different speech enhancement methods.

Figure 5. ANOVA testing.

Figure 6. A bar chart comparing the performance of the proposed DNN + GAN model with Wiener and Kalman filtering methods across four key metrics: PESQ, STOI, SNR improvement, and processing time. This visualization clearly demonstrates the superior accuracy, intelligibility, and efficiency of the deep learning-based model for speech enhancement in security applications.

Figure 7. Classification accuracy across noise environments.

Figure 8. A chart summarizing the strategic focus areas for future deep learning speech enhancement research.

Table 1. Evaluation metrics for speech enhancement.

Metric	Definition	Typical Range
PESQ (Perceptual Evaluation of Speech Quality)	Measures speech quality based on human perception	−0.5 to 4.5
STOI (Short-Time Objective Intelligibility)	Assesses speech intelligibility in noisy conditions	0 to 1
SNR Improvement	Measures the difference in SNR before and after enhancement	Variable
Segmental SNR (SegSNR)	Evaluates SNR in short time frames to assess local signal quality	Variable (dB)
Log-Likelihood Ratio (LLR)	Measures the difference between the original and enhanced speech spectra	0 to 2 (lower is better)

Table 2. Comparison of speech enhancement methods.

Method	PESQ Score	STOI Score	Real-Time Capability
Wiener Filter	2.1	0.65	Yes
Kalman Filter	2.4	0.70	No
CNN Model	3.5	0.88	Yes
Transformer Model	3.8	0.91	No

Table 3. Model performance evaluation.

Metric	Proposed Model (DNN + GAN)	Wiener Filtering	Kalman Filtering
PESQ	3.85	2.91	3.12
STOI	0.92	0.78	0.83
SNR Improvement	12.5	8.2	9.4
Processing Time (ms)	18.3	27.8	25.4

Table 4. Performance improvement over traditional methods.

Metric	Wiener Filtering Improvement (%)	Kalman Filtering Improvement (%)	Spectral Subtraction Improvement (%)
PESQ	32.3	23.4	40.1
STOI	17.9	10.8	25.5
SNR	52.4	32.9	60.2
Processing Time	34.2	27.9	41.6

Table 5. Real-world noise reduction effectiveness.

Environment	Initial SNR	Final SNR	Improvement	Noise Type
Busy Street	5.2	14.8	9.6	Traffic, honking, chatter
Industrial	3.7	13.5	9.8	Machine noise, engines
Office	8.5	18.2	9.7	HVAC, keyboard typing
Shopping Mall	6.0	15.5	9.5	Background music
Metro Station	4.5	14.0	9.5	Train movement, intercom

Table 6. Speech intelligibility improvements.

Environment	Initial STOI	Final STOI	Improvement (%)	Application Scenario
Busy Street	0.61	0.87	42.6	Pedestrian monitoring
Industrial	0.57	0.86	50.9	Machine operation alerts
Office	0.72	0.91	26.4	Workplace safety monitoring
Shopping Mall	0.65	0.89	36.9	Crowd noise suppression
Metro Station	0.60	0.88	46.7	Passenger announcements

Table 7. Classification accuracy (%) using noisy vs. enhanced inputs.

Environment	CNN (Noisy Input)	CNN (Enhanced Input)	RNN (Noisy Input)	RNN (Enhanced Input)
Busy Street	72.1	86.4	69.8	85.2
Industrial Site	68.5	84.0	66.2	82.7
Office Environment	75.6	89.1	73.3	87.5
Shopping Mall	70.3	83.9	67.5	82.1
Metro Station	65.4	80.6	63.7	79.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mensah, S.Y.; Zhang, T.; Mahmud, N.A.; Geng, Y. Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems. Electronics 2025, 14, 2643. https://doi.org/10.3390/electronics14132643

AMA Style

Mensah SY, Zhang T, Mahmud NA, Geng Y. Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems. Electronics. 2025; 14(13):2643. https://doi.org/10.3390/electronics14132643

Chicago/Turabian Style

Mensah, Samuel Yaw, Tao Zhang, Nahid AI Mahmud, and Yanzhang Geng. 2025. "Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems" Electronics 14, no. 13: 2643. https://doi.org/10.3390/electronics14132643

APA Style

Mensah, S. Y., Zhang, T., Mahmud, N. A., & Geng, Y. (2025). Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems. Electronics, 14(13), 2643. https://doi.org/10.3390/electronics14132643

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Speech Enhancement for Robust Sound Classification in Security Systems

Abstract

1. Introduction

1.1. Speech Enhancement in Security Systems

1.2. Robust Sound Classification

1.3. Deep Learning in Audio Processing

1.3.1. Research Questions

1.3.2. Research Objectives

2. Literature Review

2.1. Speech Enhancement Techniques

2.2. Traditional vs. Deep Learning Approaches

2.3. Deep Learning Architectures in Audio Processing

2.4. Challenges in Security Applications

2.5. Key Studies and Research Gaps

Justification for Using DNN + GAN in Real-Time Security Applications

3. Methodology

3.1. Research Approach

3.2. Dataset Selection

3.3. Data Selection

3.4. Noise Contamination and Categorisation

3.4.1. Dataset Preparation

Selection and Justification

Preprocessing Pipeline

Segmentation and Framing

Labeling Protocol

3.5. Model Architecture for Speech Enhancement

3.6. DNN + GAN Model Architecture

3.7. Training and Validation Strategies

4. Challenges in Implementation

4.1. Legal and Ethical Concerns

4.2. Technical Barriers to Deployment

4.3. Noise Variability and Generalization Issues

4.4. Computational Cost and Scalability

4.5. Security Vulnerabilities in AI Systems

4.6. Integrated Perspective

5. Results and Discussion

5.1. Comparison with Traditional Methods

5.2. Noise Reduction Effectiveness in Different Environments

5.3. Speech Intelligibility Gains

Impact of Enhanced Speech on Sound Classification Accuracy

6. Future Directions and Recommendations

6.1. Improvements in Model Generalization

6.2. Integration with Multimodal Security Systems

6.3. Potential for Edge AI and IoT Implementation

6.4. Addressing Ethical and Regulatory Challenges

6.5. Ethical and Regulatory Considerations in Speech Enhancement Systems

Future Architectural Directions

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI