1. Introduction
Applications that are connected to speech signals, such as Automated Speech Recognition (ASR), voice signal communication, speaker verification, and hearing aids, all play a significant part in contemporary societies. The speech along with noise signals are captured by the sound sensors (microphones) where speech enhancement enables the above-mentioned applications to work effectively in noisy environments. Nevertheless, the vast majority of these apps are not resilient when dealing with interference. As a result, speech enhancement (SE) [
1,
2,
3,
4], a technique that tries to enhance the intelligibility and quality of the original speech signals, has seen widespread use in the context of these applications. Over the last few years, deep learning techniques have seen an increased amount of use when it comes to the construction of SE systems. Enhancement of the frequency-domain acoustic properties is carried out by a subset of SE systems, which fall under the category of what are known as spectral-mapping-based SE method types. In these methods [
5,
6,
7,
8], short-time Fourier transform (STFT) and inverse short-time Fourier transform (inverse STFT) are used to analyse and reconstruct speech signals, respectively. Then, the deep learning models, namely fully connected deep denoising auto-encoder [
9], convolutional neural networks (CNNs) [
10], recurrent neural networks (RNNs) [
11] and long short-term memory (LSTM) [
12,
13], are utilized as a transformation function to change the noise degraded spectral features to clean features. Moreover, in the meantime, various techniques are being developed by integrating various kinds of deep learning models (for example, CNN and RNN) in order to more efficiently obtain the local and sequential correlations [
14,
15].
In recent past, an SE system that was developed based on stacked simple recurrent units (SRUs) [
16,
17] has demonstrated denoising performance comparable to that of the LSTM-based SE system while needing significantly fewer computational costs for training. This was accomplished by an SE system that was based on stacked simple recurrent units (SRUs) [
16]. Due to the absence of proper phase information, the augmented speech signal will never be able to realize its full potential, despite the fact that the methodologies described above are currently capable of providing remarkable performance. Some SE systems use complex-spectral-mapping and complex-ratio-masking to improve distorted speech [
18,
19]. This is conducted in order to combat the issue that was just described. In [
20], the phase estimation was recast as a classification issue, and it was used in the process of source separation. A further category of SE techniques offers the opportunity to directly conduct augmentation on the raw waveform in [
21,
22]. These methods, which are typically referred to as waveform-mapping-based approaches, are classified as a subcategory of SE methods. Fully convolutional networks, often known as FCNs, are one kind of deep learning model that has seen widespread use for the purpose of directly performing waveform mapping [
23,
24].
The WaveNet model, which was first suggested for use in text-to-speech applications, was also implemented in the waveform-mapping-based SE systems [
25,
26]. Fully convolutional architectures have the ability to represent the frequency features of speech waveforms more precisely than fully connected architectures because fully convolutional architectures preserve greater local information than fully connected architectures. More recently, it was suggested that a temporal convolutional neural network (TCNN) [
10] might properly describe temporal characteristics and carry out SE in the time domain. Some waveform-mapping-based SE techniques [
27] employ adversarial loss or perceptual loss to obtain high-level disparities between predictions and their targets. This was conducted in conjunction with the point-to-point loss that was used for optimization. An efficient characterization of sequential and local patterns is a crucial factor to take into account when evaluating the overall performance of the SE algorithms described above that are based on waveform mapping. The high computational cost and model size of RNN may drastically limit its use; despite the fact that the integration of CNN and RNN/LSTM may be a workable solution, RNN’s applicability may be significantly restricted. In this research work, we introduce and extensively explore an E2E waveform mapping-based SE technique that makes use of a unique CRN. This technique improves efficiency by combining the advantages of CNN with parallel recurrent models (LSTM, GRU, and SRU), allowing us to map waveforms from start to finish. In contrast to spectral mapping-based CRN models [
14,
15], the proposed solutions directly estimate feature masks from unprocessed waveforms using highly parallelizable recurrent networks. A diagram explains the overall speech enhancement research work highlighting the flow of the work is demonstrated in
Figure 1.
The remaining portions of the paper are structured as follows:
Section 2 presents related studies. The methodology for the proposed E2E waveform-based SE is explained in
Section 3.
Section 4 presents experiments, whereas results and discussions are given in
Section 5. The concluding remarks of this study are drawn from
Section 6.
2. Related Studies
The majority of existing speech enhancement systems involve spectrogram features [
28,
29,
30], which require a complex transformation and result in phase information loss. Convolutional networks have been used in earlier research to solve these problems by learning the temporal correlation amongst high-resolution speech waveforms. However, the memory-intensive dilated convolution and aliasing problems caused by upsampling restrict the performance of these models. Due to its straightforward design workflow, E2E deep learning models have received a lot of attention for speech enhancement. The local and sequential (speech waveforms) characteristics of speech should be effectively taken into consideration during modeling in order to enhance the performance of an E2E model.
The study [
31] presents a completely E2E recurrent neural network (RNN) for enhancing single-channel speech. By lowering the feature resolution without sacrificing the information, an hourglass-shaped network effectively captured long-range temporal correlation. Additionally, the study leveraged residual connections to increase model adaptation and stop gradient deterioration across the layers. According to experimental findings, the E2E-RNN model performs better than cutting-edge techniques in six quantitative performance indicators. The study [
21] presents a fully convolutional network (FCN) for waveform-based SE where waveforms have been modeled using convolutional layers. FCN only has convolutional layers, so local temporal speech features are retained with little weights. Experiments reveal that simple DNN and CNN-based models are not able to recover high-frequency waveform components, thereby reducing speech intelligibility. The proposed FCN model recovers waveforms successfully and outperforms the LPS-based DNN baseline in terms of intelligibility and speech quality. The study [
32] presents an efficient E2E SE model which employs the CNN module to retrieve speech locality features and the SRU module to represent their sequential properties. SRU can be effectively parallelized in computation, using fewer model parameters than LSTM and GRU. With the SRU and the constrained feature map, the model performs favourably to other latest techniques with decreased computational cost and running time.
A wavenet-based E2E SE is proposed [
26], where the suggested model adaption preserves the Wavenet’s outstanding acoustic modeling capabilities while decreasing its temporal complexity. The model uses non-causal, dilated convolutions and predicts target signals. The discriminative model adapts by reducing regression loss with supervised learning. These changes make training and inference parallelizable. Both computational and perceptual assessments recommend the suggested technique above Wiener filtering, which evaluates the magnitude spectrogram. Due to high speech sampling rates, using a lengthy temporal input context at the sample level is challenging yet essential for high-quality SE results. For this, the study [
33] presents the Wave-U-Net, which resamples feature maps to calculate and aggregate information at various time scales. With architectural changes, the study provides an additional output layer, an upsampling approach, and a context-aware prediction framework to decrease artifacts. Experiments for speech separation show that the Wave-U-Net architecture performs similarly to a state-of-the-art spectrogram-based U-Net architecture. Finally, the study highlights an issue with outliers in existing SDR assessment criteria and advises presenting rank-based data.
The study [
34] presents CNN for real-time SE in the temporal-domain. The suggested CNN uses an encoder-decoder architecture with a temporal convolutional module. The encoder part of temporal CNN low-dimensionalizes a noisy input frame. The temporal convolutional module employs causal and dilated convolutional layers to exploit present and previous frames of encoder output. The Decoder reconstructs improved frames from the outputs. The model is speaker as well as noise-independent and, according to experiments, consistently outperformed the SOTA real-time convolutional recurrent model. Fully convolutional models have fewer trainable parameters than other models. The study [
35] proposes the temporal CRN, an E2E neural model that maps the noisy waveforms to the clean waveforms. The model efficiently exploited both short-term and long-term information. In addition, the study offered a forward propagation architecture that downsamples and upsamples the speech waveforms. The proposed model outperformed CRNs and also provided crucial training stabilization approaches. In terms of speech intelligibility and quality, the temporal CRN model exceeded the previous techniques.
The study [
36] examined how the loss functions affect the time-domain deep learning SE. Perceptually inspired loss functions may be better than MSE. The study demonstrated that the learning rate is a significant design parameter even for adaptive gradient-based optimizers, which is typically disregarded. In addition, waveform matching performance measurements may fail totally in certain cases. Finally, it has been demonstrated that a loss function based on scale-invariant signal-to-distortion ratio yields strong overall performance across a variety of common SE assessment metrics, suggesting that signal-to-distortion ratio is a solid general-purpose loss function for SE systems. The study [
23] presents an E2E utterance-based SE framework employing FCNs. Due to utterance-based optimization, temporal correlation information is used to directly improve the perception-based objective measures. The FCN is utilised to optimise speech intelligibility. Due to consistency between training and assessment measures, the experimental findings have suggested that the proposed SE improves the intelligibility over standard MSE-optimized speech. By adding intelligibility into model optimization, human subjects and automated ASRs can understand the enhanced speech better than with the least MSE criteria.
Using generative adversarial networks (GANs) on the raw signal, the study [
27] offers a generative technique to regenerate noisy signals into their clean versions. Different variants of the proposed system are investigated to determine the best architecture for an adversarially trained convolutional auto-encoder applicable for speech signals. The suggested approach is objectively and subjectively evaluated. The former lets us pick among variants and tweak hyperparameters, while the latter is employed in a 42-subject listening experiment to confirm the approach’s success. In addition, showed how the method may be used to regenerate whispered speech. The research [
37] offers time-domain SE using GAN, an extension of the generative adversarial network in the time-domain with metric assessment to alleviate the scale issue and give model training stability, thereby improving performance. In addition, provides a novel approach based on objective function mapping to analyse Metric GAN’s performance and explain why it is superior to Wasserstein GAN. Experiments prove that the suggested technique works and show the benefits of Metric GAN.
Table 1 summarizes the various neural models with research gap for SE.
In this paper, we propose and thoroughly examine an E2E waveform mapping-based SE approach utilising an alternative CRN. This method achieves better efficiency by combining the benefits of CNN and parallel recurrent models (LSTM, GRU, and SRU), which enables us to map waveforms from end-to-end. In contrast to CRN models that are based on spectral mapping, the proposed methods directly estimate feature masks from unprocessed waveforms using highly parallelizable recurrent networks. The contributions of this study include: (a) Unlike CRNs proposed in [
14,
15] based on spectral mapping, the proposed E2E-models directly generate feature masks from raw waveforms using highly parallelizable recurrent modules. For SE, we have examined our methodology using accessible datasets [
38,
39,
40] and obtained high speech quality ratings equivalent to the state-of-the-art technique while using a very straightforward architecture and l1 loss function. (b) There is no need for handmade acoustic features or their processing while using raw speech waveforms as model inputs. Furthermore, no linear interpolation techniques are needed for upsampling, which might result in the loss of essential information. The suggested E2E-model is a simple design which outperforms a number of complex neural network techniques. This architecture, we believe, may be used for regression challenges other than speech enhancement, which involves long-term dependency and high-resolution time-series data. We examined our E2E model using various objective measures, confirming its potential to greatly improve the voice quality and intelligibility.
3. Proposed E2E Waveform-Based SE Algorithm
This section describes the proposed E2E SE system in detail. The architecture is a completely discrete E2E neural network without any preprocessing or customized acoustic features. It jointly represents local and sequential information by leveraging the benefits of CNN and parallel RNNs.
Figure 2 illustrates the model’s general structure of the proposed SE algorithm.
Our model has adopted the 1D CNN input module for SE implementation based on waveform mapping. WaveCRN [
18] is the foundation for these SE models. For feature map extraction, the frames of input noisy speech and two-dimensional (2D) tensors are convolved. The convolution stride is selected to half the kernel size to decrease the length of the feature map. With such arrangements, the feature map is reduced from speech length to time steps in order to properly compute sequences. Following the 1-D convolutional layer (Conv-1D), there is a batch normalization (BN), PReLU activation, Bi-LSTM/Bi-GRU/Bi-SRU modules, and a 1-D deconvolutional layer (Deconv-1D). Conv-1D with Recurrent Net is an effective module for transforming noisy waveforms to clean waveforms. Convolution and recurrent networks may process speech at the frame and utterance levels, respectively. Three types of temporal encoders are used for this purpose: the bidirectional LSTM (Bi-LSTM), the bidirectional GRU (Bi-GRU), and the bidirectional SRU (Bi-SRU). Bi-LSTM, Bi-GRU, and Bi-SRU-based feature extractors are used to construct encoded features for all batches of feature maps. It is applied to the feature maps using a restricted feature mask (RFM). Bi-LSTM/Bi-GRU/Bi-SRU then encodes feature maps into restricted feature masks (RFM), which are element-wisely multiplied by feature maps to generate a masked feature map. There are two residual connections; (i) adding the recurrent net input to the recurrent net output and (ii) adding the input to the Deconv-1D layer output. These residual connections, we discovered, are important for developing a deep neural architecture. Finally, a transposed 1D convolution layer estimates the improved waveform y from the masked feature map.
Usually, the short-time Fourier Transform (STFT) is used to transform speech waveforms into the spectral domain in the case of spectral mapping-based SE systems. However, to perform waveform mapping-based SE, we replaced the STFT processing with a 1D CNN module. Different local patterns of speech signals are captured by a 1D convolutional module. Various feature maps relate to various periodic signal elements. In terms of signal processing, convolutional kernels can be considered as a collection of finite-impulse-response (FIR) filters. Convolutional kernels have the capacity to resemble ordinary filter banks [
20]. The outputs of the time-convolution are thus viewed as a concealed Time-Frequency (T-F) representation. The CNN module is completely trainable owing to the nature of neural nets. The input noisy audio
is convolved with a two-dimensional tensor
for every batch to extract the feature map
with the batch size
N, channels number
C, size of kernel
K, time steps
T, and speech length
L, respectively. Furthermore, in order to limit the sequence length for computational performance, the convolution stride was set to half the size of the kernel size, resulting in reducing the length of
from
L to
With a high computational load, RNN-based SE models can obtain good results [
19]. Therefore, various recurrent models are used in this study to examine SE performance when combined with CNN. We captured and examined the temporal correlation of the feature maps extracted by the input module in both directions using Bi-LSTM, Bi-GRU, and Bi-SRU. The feature maps are passed through the LSTM/GRU/SRU-based recurrent feature extractor for each batch. The encoded features are formed by concatenating the hidden states extracted in both directions. The feature maps are multiplied to the restricted mask
to transform the feature maps. With 1D temporal deconvolution (Deconv-1D), we upsampled the features back to raw waveforms. The deconvolutional layer enables the model to construct a waveform segment using the transformed features vector. However, this process is prone to uneven overlaps, resulting in an unusual pattern of distortions, shown in
Figure 3. When the kernel size is not divisible by the stride, the deconvolutional layer exhibits uneven overlap. For this reason, the stride was set to be half the kernel size to ensure that the outputs are equally balanced and free of distortions. Since the feature map length was reduced, length restoration is required to generate waveforms that have lengths similar to the input waveforms.
Given the input and output lengths as
,
, whereas stride and padding are as
and
, the relationship between input and output lengths is expressed as:
With
,
, and
,
is same as
L which indicates that output waveforms have the same length as input waveforms. The waveform error
is used as the time-domain loss function. For the output time-domain signal and the corresponding target signal with N samples, the
is defined as:
where
is the waveform error (loss function),
N are samples of target speech,
is input speech and
is output (estimated speech). We investigated each of the three RNNs separately and created E2E SE models. To reduce the computational cost of deep models while preserving noise suppression efficacy, RNNs are incorporated to capture temporal correlations. The internal structures of three RNN variations (LSTM, GRU, and SRU) are illustrated in
Figure 4. The three E2E SE models are denoted as E2E-BLSTM-CRN, E2E-BGRU-CRN, and E2E-BSRU-CRN, respectively.
The attention mechanism in the residual connections of the proposed model is composed of three components: Query
Q, Key
K, and Value
V. The correlation scores of rows in
Q are first calculated with all the rows in
K using the expression, given as:
where
is the transpose of
K. The correlations scores are than converted to the probabilities using the Softmax operator as:
Finally, the rows of V are linearly combined using weights in
Softmax (W) to obtain the attention output.
The attention mechanism is termed as self-attention if Q and K are computed from the same sequence.
6. Conclusions
For improving degraded speech, end-to-end deep learning models have attracted a lot of interest. The local and sequential attributes of speech signal should be effectively taken into consideration while modelling in order to enhance the performance of E2E models. We have developed resource-effective and compact neural models for waveform-based end-to-end speech enhancement that are noise-resistant. We developed three distinct speech enhancement systems based on LSTM, GRU, and SRU by fusing the Convolutional Encode-Decoder (CED) and Recurrent Neural Networks (RNNs) in the Convolutional Recurrent Network (CRN) architecture.the experiments show that the proposed models lead to improved quality and intelligibility with fewer trainable parameters, notably reduced model complexity, and inference time than existing recurrent and convolutional models. The E2E-BLSTM-CRN increased the STOI and PESQ over the babble noisy speech by 23.37% and 36.02%, respectively. Important improvements in STOI and PESQ were observed by E2E-BGRU-CRN over the noisy speech in exhibition hall noise, thereby improving STOI by 27.9% and 35.15%, respectively. The findings also concluded that the suggested E2E models outperformed the LSTM, DNN, CNN, FCNN, CNN-GRU and GAN models in terms of speech intelligibility and quality. Less distortion and residual noise are concluded in the enhanced speech generated by the proposed E2E SE models. It is also concluded that the ASR system outperformed other SE models when evaluated using utterances processed by E2E-CRN models. In the forward/back propagation pass during the training stage, E2E-BSRU-CRN outperforms Wave-U-Net, E2E-BLSTM-CRN, and E2E-BGRU-CRN while using less parameters than Wave-U-Net and the other two E2E-CRNs. Finally, the cross-corpus findings show that suggested and other deep models outperformed LibriSpeech and TIMIT when trained with the VoiceBank dataset.
Phase is an important aspect of modern speech enhancement systems since phase plays a significant role in improving the speech quality. This paper emphasize the speech magnitude enhancement. We will be devoted to include the phase estimation [
57] and incorporate with the proposed SE model. Moreover, more robust loss functions will be worked out for better results.