Evaluation of Mixed Deep Neural Networks for Reverberant Speech Enhancement

Speech signals are degraded in real-life environments, as a product of background noise or other factors. The processing of such signals for voice recognition and voice analysis systems presents important challenges. One of the conditions that make adverse quality difficult to handle in those systems is reverberation, produced by sound wave reflections that travel from the source to the microphone in multiple directions. To enhance signals in such adverse conditions, several deep learning-based methods have been proposed and proven to be effective. Recently, recurrent neural networks, especially those with long short-term memory (LSTM), have presented surprising results in tasks related to time-dependent processing of signals, such as speech. One of the most challenging aspects of LSTM networks is the high computational cost of the training procedure, which has limited extended experimentation in several cases. In this work, we present a proposal to evaluate the hybrid models of neural networks to learn different reverberation conditions without any previous information. The results show that some combinations of LSTM and perceptron layers produce good results in comparison to those from pure LSTM networks, given a fixed number of layers. The evaluation was made based on quality measurements of the signal’s spectrum, the training time of the networks, and statistical validation of results. In total, 120 artificial neural networks of eight different types were trained and compared. The results help to affirm the fact that hybrid networks represent an important solution for speech signal enhancement, given that reduction in training time is on the order of 30%, in processes that can normally take several days or weeks, depending on the amount of data. The results also present advantages in efficiency, but without a significant drop in quality.


Introduction
In real-environments, audio signals are affected by conditions such as additive noise, reverberation, and other distortions, due to elements that produce sounds simultaneously or are presented as obstacles in the signal path to the microphone. In the case of speech signals, communication devices and applications of speech technologies may be affected in their performance [1][2][3][4] by the presence of such conditions.
In recent decades, many algorithms have been developed to enhance degraded speech; these try to suppress or reduce distortions, as well as preserve or improve the quality of the perceived signal [5]. Many recent algorithms are based on deep neural networks (DNN) [6][7][8][9]. The most common implementation is based on approximating a mapping function from the degraded characteristics of speech with noise, towards the corresponding characteristics of clean speech.
The benefits of achieving this type of speech signal enhancement can be applied to signal processing in mobile phone applications, voice over Internet protocol, speech recognition systems, and devices for people with diminished hearing ability [10].
In addition to the classical perceptron model, created in the 1950s, new types of neural networks have been developed, e.g., recurrent neural networks (RNNs). An example of RNNs are the LSTM neural networks. In previous efforts to enhance speech, spectrum-derived characteristics, such as Mel-frequency cepstrum coefficients (MFCC), have been mapped successfully between noisy speech to clean speech [11,12].
The benefits of using LSTM, as well as other types of RNNs, are superior for modeling of the dependent nature of speech signals. Among the drawbacks of LSTM are the high computational cost of its training procedures.
In this work, we extend previous experiments with LSTM by evaluating deep neural networks, with a fixed number of three hidden layers, that combine LSTM layers (bidirectional) and simpler layers, based on perceptrons.
Such deep neural network algorithms have been successful in overcoming the performance of classical methods based on algorithms for signal processing, which have considered several signal-to-noise ratios (SNR) [12][13][14][15], or reverberant speech [16][17][18]. Some recent work has explored the use of mixed neural networks to achieve a better performance in different tasks, such as classifying the temporary stages of sleep, analyzing the real-time behavior of an online buyer, or the suppression of noise in a MEMS gyroscope, in which good results were obtained for specific situations and configurations [19][20][21]. The combination of different types of neural networks have been successfully presented in [22], in the form of ensemble models to predict diseases in images.
The wide variety of models applied in other fields, where regression, classification, and prediction are required, have also been analyzed [23,24], and show the multiple possibilities and the wide field of experimentation that is possible with deep neural networks.
Our main focus is on reducing the training time of the networks without a significant reduction in the capacity of the network. To achieve improvement, we consider all the different combinations of layers for de-reverberation, with the goal of accelerating the training process due to fewer connections. Thus, the process can become more efficient.
For this purpose, several objective measures were used to verify the results, which comparatively show the capacity of the BLSTM with three layers, and the combination with layers of perceptron, in improving speech conditions of reverberation. The rest of this document is organized as follows. Section 2 provides the background and context of the problem of improving reverberant speech and the BLSTM. Section 4 describes the experimental setup. Section 5 presents the results with a discussion. In Section 6, conclusions are presented.

Problem Statement
In real-world environments where speech signals are registered with microphones, the presence of reverberation is common. It is caused by the reflections of the audio signal on its path to the microphone.
This phenomenon is accentuated when the space is wide and the surfaces favor the reflection of the signals. It can be assumed that the reverberated signal x is a degraded version of the clean signal s. The relationship between both waves is described by [25]: where h = [h 1 , h 2 , . . . , h L ] is the impulse response of the acoustic channel from the source to the microphone, L is the index of the discrete-time impulse response coefficient vector, is the transpose of vector, and * is the convolution operation.
The degraded speech signal with reverberation is perceived as distant or as a very short type of echo. Consequently, this effect generally increases as the speaker's distance to the microphone increases.
Since this effect is not desired for proper recognition and analysis of the speech signal, new algorithms have been proposed to minimize it. Mainly, in the last few years, algorithms based on deep learning have stood out.
By implementing deep neural networks, an approximation to s(n) can be estimated using a function f (·) between the data of the reverberated signal and the clean signal: The quality of the approximation performed by f (·) usually depends on the amount of data and the algorithm selected. For the present work, we take as a base case the estimation of f (·) made by bidirectional LSTM (BLSTM) networks with three hidden layers.
The main motivation in applying these deep neural networks is their recent success in speech enhancement related tasks, where they surpassed other algorithms applied to improve speech in noisy or reverberant conditions. In most of these experiences, it is noted the high computational cost of training the LSTM and BLSTM networks as a constraint to perform extended experimentation.
In this model, we propose a comparison and statistical validation of results with mixed networks, which include combinations of BLSTM layers and perceptron.

Autoencoders of BLSTM Networks
Since the appearance of RNNs, there are new alternatives to model the character dependent on the sequential information in applications where the nature of the parameters is relevant. These types of neural networks are capable of storing information through feedback connections between neurons in their hidden layers or another network that is in the same layer [26,27].
With the purpose of expanding the capabilities of RNNs by storing information in the short and long term, the LSTM networks shown in [28] introduce a set of gates into the memory cells capable of controlling access and storage and propagation of values across the network. The results obtained when using LSTM networks in areas that depend on previous states of information, as is the case with voice recognition, musical composition, and handwriting synthesis, were encouraging [28][29][30].
In addition to the recurring connections between the internal units, each unit in the network has additional gates for storing values: One for input, one for memory clearing, one for output, and one for activating memory. In this way, it is possible to store values for many steps or have them available at any time [28].
The gates are implemented using the following equations: where σ is the sigmoid activation function, i is the input gate, f is the memory erase gate, and o t is the exit gate. c is the activation of memory. W mn is the matrix that contains the values of the connections between each unit and the gates. h is the output of the LSTM memory unit. Additional details about the training process and the implications of this implementation can be found at [31].
An additional extension of LSTM networks that has had a greater advantage in tasks related to temporal parameter dependence is the BLSTM. Here, the configuration of the network allows the updating of parameters in both directions of the process: One can convert the input parameters to the reference of the output, and vice versa. In this work, these units are used to make comparisons. The structure of a simple bidirectional network with input i, output o, and two hidden layers (h f and h b ) is shown in Figure 1. LSTM networks can handle information over long periods; however, using bidirectional LSTM (BLSTM) neural networks with two hidden layers connected to the same output layer gives them access to information in both directions. This allows bidirectional networks to take advantage of not just the past but also the future context [32].
One of the main architectures applied for regression tasks (including speech enhancement) using deep neural networks are the autoencoders. An autoencoder for speech enhancement is a neural network architecture that has been successful in various tasks related to speech [33]. This architecture consists of an encoder that transforms an input vector s into a representation in the hidden layers h through a f mapping. It also has a decoder that takes the hidden representation and transforms it back into a vector in the input space.
During training, the features of the distorted signal (noise or reverberation) are used as inputs for the noise elimination autoencoders, while the features of the clean speech are presented as outputs.
In addition, to learn the complex relationships between these sets of features, the training algorithm adjusts the parameters of the network. Currently, computers and algorithms have the ability to process large datasets, as well as networks with several hidden layers.

Experimental Setup
To test our proposed mixed neural networks LSTM/Perceptron to enhance reverberated speech, the experiment can be summarized in the following steps: 1. Selection of conditions: Given the large number of impulse responses contemplated in the databases, we randomly chose five reverberated speech conditions. Each of the conditions has the corresponding clean version in the database. 2. Extraction of features and input-output correspondence: A set of parameters was extracted from the reverberated and clean audio files. Those of the reverberated files were used as inputs to the networks, while the corresponding clean functions were the outputs. 3. Training: During training, the weights of the networks were adjusted as the parameters with reverberation and clean were presented to the network. As usual in recurrent neural networks, the updating of the values of the internal weights was carried out using the back-propagation algorithm through time. In total, 210 expressions were used for each condition (approximately 70% of the total database) to train each case. The details and equations of the algorithm followed can be found in [34].

Validation:
After each training step, the sum of the squared errors within the validation set of approximately 20% of the statements was calculated, and the weights of the network were updated in each improvement. 5. Test: A subset of 50 phrases, selected at random (about 10% of the total number of phrases in the database), was chosen for the test set, for each condition. These phrases were not part of the training process, to provide independence between training and testing.
In the following subsections, more details of the experimental procedure are provided.

Database
We used the Reverberant Voice Database created at the University of Edinburgh [35], which was designed to train and evaluate the methods of speech de-reverberation. The reverberated speech of the database was produced by convolving the recordings of 56 native English speakers with several impulse responses in various university halls. For this work, we randomly chose the following conditions: ACE Building Lobby 1, Artificial Room 1, Mardy Room 2, ACE Lecture Room 1, and ACE Meeting Room 2.

Feature Extraction
The pairs of WAV files corresponding to clean and reverberated speech were processed using the Ahocoder [36] software to obtain the coefficients. Those were extracted with a frame size of 160 samples and a frame-shift of 80 samples. For each frame of speech, we extracted the spectrum parameters (39 MFCC), fundamental frequency ( f 0 ), and the energy.
For this work, neural networks were applied to improve the 39 MFCC coefficients, while the rest of the parameters remained invariant. During training, the parameters of the reverberated speech were presented as the inputs of the network, while the correspondent parameters of the clean speech were outputs.
For the test set, the MFCC parameters of the reverberated speech were substituted with the enhanced version, and the evaluation measure was applied to the reconstructed WAVE file, also made with the Ahocoder system.

Evaluation
For the evaluation of the results, the following objective measures were applied: • Perceptual evaluation of speech quality (PESQ): This measure uses a model to predict the subjective quality of speech, as defined in ITU-T P.862.ITU recommendation. The results are in the range [0.5, 4.5], where 4.5 corresponds to the signal enhanced perfectly. PESQ is calculated as [37]: where D ind is the average disturbance and A ind is the asymmetric perturbation. The a k were chosen to optimize PESQ in the measurement of general speech quality. • Sum of squared errors (sse): This is the most common metric for the validation set error during the training process of a neural network. It is defined as: where c x is the known value of the outputs andĉ x is the approximation made by the network. Additionally, Friedman's statistical test was used to determine the statistical significance of the results in the test sets. Figure 2 shows the procedure followed for the comparison between the different architectures tested in this work. To analyze all the architectures that can be formed with a mixture of BLSTM layers and MLP layers, eight different neural networks were tested for each reverberation condition:

Experiments
The metrics were applied in each of these possibilities, which constitute all the possibilities that can be combined between the BLSTM and MLP layers in three layers.   Table 1 shows the training results for all networks and all possible combinations of three hidden layers. The training of each set was repeated three times, and the average values are reported. Following previously reported results, the network with only BLSTM layers provides the best results in most cases of reverberation conditions. For the five cases of reverberation considered in this paper, the network that stands out as a competitive alternative to the three-layer BLSTM network is the MLP-BLSTM-BLSTM configuration. In addition to presenting in two cases a better result between all the architectures (under the conditions "Lecture Room" and "Meeting Room"), the training time is almost 30% less per epoch in comparison to the BLSTM network. This is one of the main indicators sought in this work. Table 1 also shows how the training times are similar between those configurations consisting of two BLSTM layers and one MLP and those consisting of only one BLSTM layer and two MLPs. The MLP-MLP-MLP type networks, despite having very low training times per epoch, as expected, do not present competitive results in comparison to the others.

Results and Discussion
In addition to the verification of the training efficiency of the networks, Table 2 shows the results in terms of the PESQ quality metric. This is of the utmost importance, since the analysis of the problem of de-reverberation of speech signals is what is being presented. Thus, improvements in efficiency and sse values must also be checked in terms of the quality of the signal achieved. In the last table, the differences obtained for the BLSTM-BLSTM-BLSTM base system are presented, in terms of statistical significance, according to the Friedman test.
In each of the five reverberation conditions, the results of these tests can be summarized: • MARDY, Lecture Room and Artificial Room: Only two of the mixed configurations present results that do not significantly differ statistically with the base system. These mixed networks are BLSTM-BLSTM-MLP and MLP-BLSTM-BLSTM.
• Ace Building: In this case, three combinations of hidden layers present results that do not differ significantly from the base case. • Meeting Room: This is a particular case, because the combination BLSTM-BLSTM-MLP is the one that presents the best result, although the improvement is not significant compared to the base system. On the other hand, MLP-BLSTM-BLSTM, BLSTM-MLP-BLSTM, and MLP-BLSTM-MLP present results that do not differ significantly from the base system. Figure 3 shows the spectrograms corresponding to clean speech, as well as those corresponding to speech with reverberation and to two of the proposed configurations: One based solely on BLSTM layers and the mixed network that obtained better results (MLP-BLSTM-BLSTM). One can appreciate the improvements introduced by the neural networks and the proximity that is perceived visually in this representation of the spectrogram of the mixed network in comparison to that of the base system.  Considering the previous efficiency results and how these are reflected in the PESQ metric, it is emphasized that there are combinations of mixed networks, especially MLP-BLSTM-BLSTM, which reduce the times of training considerably, without significantly sacrificing the quality of results in the reverberation of the signals.However, to increase efficiency in further experiments, some processes can be parallelized and the proposal repeated in networks of greater depth.

Conclusions
In this work, the use of mixed neural networks, consisting of combinations of layers formed by perceptron units, with BLSTM layers was proposed as an alternative for the reduction of training time of purely BLSTM networks. Training time has represented a limitation for extensive experimentation with this type of artificial neural network in different applications, including some related to the improvement of speech signals.
One of the eight possible combinations of mixed networks presented competitive results, in terms of the metrics of the training system, and results that did not differ significantly from the purely BLSTM case in terms of the PESQ of the signals. The significance was determined with a statistical test. The reduction in training time is on the order of 30%, in processes that can normally take hours or days, depending on the amount of data.
The results presented here open the possibility for simplifying some neural network configurations to be able to perform extensive experimentation in different applications where it is required to map parameters with similar characteristics, as in the case of autoencoders.