Speech Enhancement for Hearing Impaired Based on Bandpass Filters and a Compound Deep Denoising Autoencoder

: Deep neural networks have been applied for speech enhancements efﬁciently. However, for large variations of speech patterns and noisy environments, an individual neural network with a ﬁxed number of hidden layers causes strong interference, which can lead to a slow learning process, poor generalisation in an unknown signal-to-noise ratio in new inputs, and some residual noise in the enhanced output. In this paper, we present a new approach for the hearing impaired based on combining two stages: (1) a set of bandpass ﬁlters that split up the signal into eight separate bands each performing a frequency analysis of the speech signal; (2) multiple deep denoising autoencoder networks, with each working for a small speciﬁc enhancement task and learning to handle a subset of the whole training set. To evaluate the performance of the approach, the hearing-aid speech perception index, the hearing aid sound quality index, and the perceptual evaluation of speech quality were used. Improvements in speech quality and intelligibility were evaluated using seven subjects of sensorineural hearing loss audiogram. We compared the performance of the proposed approach with individual denoising autoencoder networks with three and ﬁve hidden layers. The experimental results showed that the proposed approach yielded higher quality and was more intelligible compared with three and ﬁve layers.


Introduction
Speech is a fundamental means of human communication. In most noisy situations, the speech signal is mixed with other signals transmitting energy at the same time, which can be noise or even different speech signals. Consequently, it is important to improve speech quality and the intelligibility of degraded speech. Speech enhancement (SE) techniques have been applied to high technological telecommunication systems, e.g., mobile communication or speech recognition, and hearing aids devices. The main aim of SE algorithms is to improve some perceptual aspects of speech that are corrupted by additive background noise. For hearing aids (HA), SE algorithms are used to somehow clean the noisy signal before amplification by reducing the background noise, where hearingimpaired users experience extreme difficulty communicating in environments with varying levels and types of noise (caused by the loss of temporal and spectral resolution in the auditory system of the impaired ear) [1]. In many scenarios, reducing the background noise introduces speech distortion, which reduces speech intelligibility in noisy environments [2]. Since the quality reflects the individual preferences of listeners, it is a subjective performance evaluation metric. The intelligibility is an objective measure since it offers the percentage of words that can be correctly identified by listeners. Based on these two criteria, the considerable challenge in designing an effective SE algorithm for hearing aids is to boost the overall speech quality and to increase intelligibility by suppressing noise without introducing any perceptible distortion in the signal.
In the last few years, many theories and approaches have been developed and proposed for SE [3]. Spectral subtraction (SS) algorithms were proposed in the correlation domain by Weiss et al. [4], and L. Chen et al. in [5] proposed the spectral subtraction approach in modern hearing aids in real-time speech enhancement. The approach is based on a voice activity detector (VAD) to estimate the noise spectrum when speech pauses (in silence), and subtract it from the noisy speech to estimate the clean speech [4,5]. The SS approaches frequently produce a new type of noise occurring at random frequency locations in each frame. This type of noise referred to as musical noise, and it is sometimes more disturbing not only for the human ear but also for SE systems than the original distortions. Harbach in [6] used the Wiener filter method based on prior SNR estimation to improve speech quality by using a directional microphone. However, the all-pole spectrum of the speech signal might have unnaturally sharp peaks, which in turn can result in a significant decline in speech quality. The Minimum Mean Square Error (MMSE) approach based on log-magnitude was suggested in [7,8]. The approach works to find the coefficient by minimises the mean square error (MSE) of the log-magnitude spectra. The test result of this approach showed lower levels of residual noise. Meanwhile, deep neural network (DNN) [9] approaches offered great promise and attention in addressing challenges in SE [10]. For example, L. Ding et al. [11] used a DNN model for speech denoising. The model predicts clean speech spectra when presented with noisy speech inputs, which does not necessitate RBM pre-training or complex recurrent structures. X. Lu et al. [12] presented a regression model of the denoising autoencoder (DAE) to improve the quality of speech. The model maps a noisy input to a clean signal based on the log-power spectra (LPS) feature. The study used different types of noise in the training stage to achieve an excellent ability to generalise to unseen noise environments. The results of objective evaluations showed that the approach performs better than the conventional SE approach. S. Meng et al. in [13] presented a separate deep autoencoder (SDAE) approach, which estimates the noisy and clean spectra by minimising the total reconstruction error [14] of the noisy speech spectrum by adjusting the estimated clean speech signal. Lai et al. in [15,16] suggested using a deep denoising autoencoder networks (DDAE) model in cochlear implant (CI) simulations for improving the intelligibility of vocoded speech. The method simulates speech signal processing in existing CI devices and the actual CI receiver. The study evaluated the results of different clean signals and noise for the training and testing phases, and the LPS feature was used for noise classification.
Previous studies have proven that DNN suppresses noise from noisy corrupted speech efficiently [17]. However, the experimental results of these approaches showed that, for large variations of speech patterns and noisy environments, an individual DNN network with a fixed number of hidden layers cause strong interference effects that lead to (1) a slow learning process, (2) poor generalisation of new inputs in unknown signal-to-noise ratio (SNR) [18], (3) due to DNN manner of converting the speech frame by frame, some residual noise showed in the enhanced output even though context features are used as input to the deep denoising autoencoder, and (4) poor generalisation performance in real-world environments. In this work, we propose a new method for amplification in hearing aids based on combining two stages: (1) a set of bandpass filters, in which each performs a frequency analysis of the speech signal based on the human healthy cochlea, (2) multiple DDAE networks, in which each works for a specific enhancement task and learns to handle a subset of the whole training set. The rest of this work begins in Section 2, where we describe hearing loss and speech perception. Section 3 describes the details of the proposed method. Section 4 presents our experimental setup and evaluation. Finally, Section 5 discusses the experimental results of the work.

Speech Perception and Hearing Loss
Speech perception refers to the ability to hear human speech, to interpret it, and to understand it (human frequency range 1-8000 Hz). The speech signal from a single source is mixed with other unwanted signals. The human ear can distinguish between Symmetry 2021, 13, 1310 3 of 16 7000 different frequencies and enables the brain to locate sound sources. However, over 500 million of the population experience hearing loss (HL) [2], and 90% of hearing loss is what is known as sensorineural hearing loss (SNHL), which is caused by dysfunctions in the cochlea (the inner ear). There are tiny fine hairs in the cochlea responsible for sound transmission. The outer hair cells react to high frequencies, and the inner hair cells react to low frequencies. Together, they result in a smooth perception of the full sound range and a good separation of similar sounds. Typically, the high frequencies are affected by hearing loss first, as the respective hair cells are located at the entry of the cochlea, where every sound wave passes by ( Figure 1). This usually results in difficulty in hearing and understanding high-frequency sounds such as "s" and "th", which in normal hearing can distinguish words such as pass and path [1,19]. In the affected areas, the hair cells are no longer stimulated effectively, resulting in not enough impulses being transmitted to the brain for recognition. HA helps hearing loss by maximising a person's remaining hearing ability by increasing the volume of speech with minimum distortion [16]. However, treating all noise environments the same would result in unsatisfying performance for different hearing impairments [1]. SNHL people have a poorer ability to hear high-frequency components than low-frequency components; consequently, adding a noise frequency classification stage based on the human cochlea is of great importance when designing an SE approach for AH. is mixed with other unwanted signals. The human ear can distinguish between 7000 different frequencies and enables the brain to locate sound sources. However, over 500 million of the population experience hearing loss (HL) [2], and 90% of hearing loss is what is known as sensorineural hearing loss (SNHL), which is caused by dysfunctions in the cochlea (the inner ear). There are tiny fine hairs in the cochlea responsible for sound transmission. The outer hair cells react to high frequencies, and the inner hair cells react to low frequencies. Together, they result in a smooth perception of the full sound range and a good separation of similar sounds. Typically, the high frequencies are affected by hearing loss first, as the respective hair cells are located at the entry of the cochlea, where every sound wave passes by ( Figure 1). This usually results in difficulty in hearing and understanding high-frequency sounds such as "s" and "th", which in normal hearing can distinguish words such as pass and path [1,19]. In the affected areas, the hair cells are no longer stimulated effectively, resulting in not enough impulses being transmitted to the brain for recognition. HA helps hearing loss by maximising a person's remaining hearing ability by increasing the volume of speech with minimum distortion [16]. However, treating all noise environments the same would result in unsatisfying performance for different hearing impairments [1]. SNHL people have a poorer ability to hear highfrequency components than low-frequency components; consequently, adding a noise frequency classification stage based on the human cochlea is of great importance when designing an SE approach for AH.

Architecture of the Proposed System
This section presents the details about the approach of the two stages, which we call (HC-DDAEs). The architecture of the proposed method is presented in (Figure 2).

Architecture of the Proposed System
This section presents the details about the approach of the two stages, which we call (HC-DDAEs). The architecture of the proposed method is presented in (Figure 2).

Bandpass Filter
A bandpass filter (BPF) passes only a certain range of frequencies without reduction. The particular band of frequency that passes by the filter is known as the passband. During this stage (Figure 3), the input signal y is first pre-emphasised and then passed through a set of eight BPFs. The pre-emphasis is performed by finite impulse response (FIR) filters, which emphasise signals in eight passbands of 40 dB, and the forward and reverse coefficients (b p , a p ) are given by the following: where F s denotes the sampling frequency.

Bandpass Filter
A bandpass filter (BPF) passes only a certain range of frequencies without reduction. The particular band of frequency that passes by the filter is known as the passband. During this stage (Figure 3), the input signal is first pre-emphasised and then passed through a set of eight BPFs. The pre-emphasis is performed by finite impulse response (FIR) filters, which emphasise signals in eight passbands of 40 dB, and the forward and reverse coefficients ( , ) are given by the following: where denotes the sampling frequency.

Bandpass Filter
A bandpass filter (BPF) passes only a certain range of frequencies without reduction. The particular band of frequency that passes by the filter is known as the passband. During this stage (Figure 3), the input signal is first pre-emphasised and then passed through a set of eight BPFs. The pre-emphasis is performed by finite impulse response (FIR) filters, which emphasise signals in eight passbands of 40 dB, and the forward and reverse coefficients ( , ) are given by the following: where denotes the sampling frequency.  (Table 1). where ( ) ( ) denote the lower and upper cutoff frequencies for the − ℎ BPF (Figure 4), and and are the lowest and highest frequencies of passbands for the − ℎ BPF, which is specified as follows: Each of these passbands passes only a specific frequency band [ f 1 (i) f 2 (i)] of the entire input f low f high (Table 1). where f 1 (i) and f 2 (i) denote the lower and upper cut- off frequencies for the i-th BPF (Figure 4), and f low and f high are the lowest and highest frequencies of n passbands for the i-th BPF, which is specified as follows: where C b is the channel bandwidth, which is given by the following: where is the channel bandwidth, which is given by the following:  The gain in each filter could be set individually to any value between zero and one, where one corresponds to the gain provided by the original hearing aid. The gain adjustments are made on a computer keyboard and are shown graphically on the computer's display.
The channels in the output of the filter are then added together to produce the synthesised speech waveform using the periodicity detection method as follows: where and are the two channels-channel one (0-4 kHz) and channel two (4-8 kHz), respectively. The gain in each filter could be set individually to any value between zero and one, where one corresponds to the gain provided by the original hearing aid. The gain adjustments are made on a computer keyboard and are shown graphically on the computer's display.
The n channels in the output of the filter are then added together to produce the synthesised speech waveform using the periodicity detection method as follows: where x low and x high are the two channels-channel one (0-4 kHz) and channel two (4-8 kHz), respectively.

Compound DDAEs (C-DDAEs)
The output of the first stage passes through the compound approach. This approach utilises multiple networks. Each network is a multi-layer DDAE (three hidden layers for each DDAE) and has different hidden units as follows: 1.
DDAE-2: 512 units for each layer, F DDAE2 512×3 . Three frames of spectra used: where the target is the single spectrum |s t |.
3. DDAE3: Three hidden layers with 1024 units for each layer, F DDAE 1024×3 . Five frames of spectra are used: Each DDAE is specific for one enhancement task rather than one individual network being used for the general enhancement task. The total output of every network is the central frame of the model ( Figure 5). This stage includes two phases, namely, training and testing. In the training phase, the training set is divided into subsets, and for each DDAE network, there is a specific enhancement task to learn-a subset of the whole training set. Let us consider the statistical problem where the input speech signal in the spectral domain is given by the following.
where y(t) and x(t) are the noisy and clean versions of the signal, respectively. n(t) is unwanted noise at t − th time index (t = 0, 1, . . . , T − 1), which assumed to be zero-mean random white noise or coloured noise that is uncorrelated with x(t).

Compound DDAEs (C-DDAEs)
The output of the first stage passes through the compound approach. This approach utilises multiple networks. Each network is a multi-layer DDAE (three hidden layers for each DDAE) and has different hidden units as follows: 1. DDAE-1: 128 units for each layer, × . The magnitude spectrum is 513dimensional, which works as the input and target. Each DDAE is specific for one enhancement task rather than one individual network being used for the general enhancement task. The total output of every network is the central frame of the model ( Figure 5). This stage includes two phases, namely, training and testing. In the training phase, the training set is divided into subsets, and for each DDAE network, there is a specific enhancement task to learn-a subset of the whole training set. Let us consider the statistical problem where the input speech signal in the spectral domain is given by the following.
where ( ) and ( ) are the noisy and clean versions of the signal, respectively. ( ) is unwanted noise at − ℎ time index ( = 0, 1, … , − 1), which assumed to be zeromean random white noise or coloured noise that is uncorrelated with ( ). Firstly, the input vector passes through feature extraction to produce the featured version ∈ ℝ . Then, training pairs ( , ) are prepared by calculating the magnitude ratio = | | | | . Next, (a function) is mapped to produce an output vector ∈ ℝ to recover the clean original signal ( ): where is a × ∈ ℝ × weight matrix, ∈ ℝ is a bias vector, and (. ) is a nonlinear function. If the output for is incorrect, the weight on the model changes and updated and more tasks will sign to the next DDAE network (Figure 5), where the feedforward procedure is as follows: Firstly, the input vector y passes through feature extraction to produce the featured version y ∈ R K . Then, training pairs (y, x) are prepared by calculating the magnitude ratio x = |x| |x+n| . Next, F DDAE i (a function) is mapped to produce an output vector X ∈ R D to recover the clean original signal x(t) : where W is a d × d ∈ R n×m weight matrix, b ∈ R m is a bias vector, and f DDAE 1 (.) is a nonlinear function. If the output for DDAE i is incorrect, the weight on the model changes and updated and more tasks will sign to the next DDAE network (Figure 5), where the feedforward procedure is as follows: The hidden representation of h(Y) ∈ Rˆm is as follows: where σ is the logistic sigmoid function and h is a hidden layer.X E n is the vector containing the logarithmic amplitudes of enhanced speech corresponding to the noisy counterpart Y E n . The error of the network is the expected value of the squared difference between the target and actual output vector for each DDAE, which requires one to reduce the whole of the output vector to the next DDAE: is the proportional contribution of DDAE i to the combined output vector, and d c is the desired output vector in case c.
If the error of each DDAE network is less than the weighted average of the errors of the C-DDAEs, the responsibility for that case is increased, and vice versa. To take into account how well each DDAE does in comparison with other DDAEs, the error function is as follows: To compare the error function to the output of a DDAE network, we obtain the following: To minimise the sum of errors between the target and the estimated vector at L hidden layer, we obtain the following: where W (1) ∈ R K(l+1)×(K (l) +1) holds the network parameters at l-th layer, which participates in the feedforward procedure as follows: Note that z l ∈ R K (l) is a vector of K (l) hidden unit outputs

Experiments and Evaluation
This section describes our experimental setup: (A) the data used to train and test the proposed system, and (B) the comparison of spectrograms of noisy and enhanced signals.

Experimental Setup
We conducted our experiment using CMU_ARCTIC databases from Carnegie Mellon University, which include 1186 clean speech utterances spoken by US native Englishspeaking male (Dbl) and female (Slt) speakers [20]. The database was recorded at 16-bit, 32 kHz, in a soundproof room. In this work, we divided the database into (1) 75% of the entire dataset for the training set (about 890 speech signals); (2) 20% for the validation set (about 237 speech signals), and (3) the rest (5%) of the entire dataset for the testing set (about 56 utterances). Gaussian white and pink noise were generated and added to the training set at four different SNR levels (0, 5, 10, and 15 dB). The SNR levels were selected carefully to cover a range of noise levels (from light to high) for each noise type. Additionally, noise from a train and babbling were added to the testing set, neither of which were used in the training set. The overlapping frames had a 16 ms duration with a shift of 16 ms. The C-DDAEs model was trained offline and tested online, and layer by layer pre-training was used [21] with 20 epochs. The number of epochs for fine-tuning was 40. The activation function was sigmoid. Each windowed speech segment was addressed with a 256-point FFT and then converted to an MFCC feature vector [22]. For comparison's sake, we trained two individual DDAE networks with three and five hidden layers each. A MATLAB R2019b-based simulator was used to implement the processing for impaired subjects, which is run and tested in windows 10pro (Intel(R) Core (TM) i5-7200U CPU, 2.71 GHz with 8 GB memory). The objective used in this study was to reduce the symptoms of HFHL for subjects 1-7 (Table 2) to test the performance of the approach.

The Spectrograms Comparison
A spectrogram is a standard tool for analysing the time-varying spectral characteristics of speech signals [23]. Figure 6 shows five spectrograms: (a) a clean signal (extracted from a male voice saying, "God bless him, I hope I will go on seeing them forever."); (b) the noisy signal at the 0 dB SNR level; (c) the signal enhanced by DDAE-3; (d) the signal enhanced by DDAE-5; and (e) the signal enhanced by HC-DDAEs. . The spectrograms result for the utterance of "arctic_a0006". The clean utterance is for a man's voice saying: "God bless him, I hope I will go on seeing them forever." Figure 7. The spectrograms result for the utterance of "arctic_b0493". The clean utterance is for a man's voice saying: "Your yellow giant thing of the frost." Figure 6. The spectrograms result for the utterance of "arctic_a0006". The clean utterance is for a man's voice saying: "God bless him, I hope I will go on seeing them forever".
Symmetry 2021, 13, x FOR PEER REVIEW 9 of 16 Figure 6. The spectrograms result for the utterance of "arctic_a0006". The clean utterance is for a man's voice saying: "God bless him, I hope I will go on seeing them forever." Figure 7. The spectrograms result for the utterance of "arctic_b0493". The clean utterance is for a man's voice saying: "Your yellow giant thing of the frost." Figure 7. The spectrograms result for the utterance of "arctic_b0493". The clean utterance is for a man's voice saying: "Your yellow giant thing of the frost".

Speech Quality and Intelligibility Evaluation
This section describes the evaluation of our method that presented in (Figure 8). We used well-known metrics to evaluate the quality and intelligibility of the enhanced speech (higher scores represent better quality). We compared the enhanced signals to the desired ones of the test signals.

Speech Quality and Intelligibility Evaluation
This section describes the evaluation of our method that presented in (Figure 8). We used well-known metrics to evaluate the quality and intelligibility of the enhanced speech (higher scores represent better quality). We compared the enhanced signals to the desired ones of the test signals.

Speech Quality Perception Evaluation (PESQ)
The PESQ standard [23] was defined as in the ITU-T P.862 recommendation: it uses an auditory model that includes auditory filter, spectrum, and time masking. The PESQ tests the quality of the speech by comparing the enhanced signal with the clean signal. It does so by predicting the quality with a good correlation in an extensive range of conditions, which may contain distortions, noise, filtering, and errors. PESQ is a weighted sum of the average disturbance and the average asymmetrical disturbance , which can be defined as follows: where = 4.5, = −0.1, and = −0.0309. It produces a score from 0.5 to 4.5, with high values indicating better speech quality.

Hearing Aid Speech Quality Index (HASQI)
The HASQI is used to predict speech quality according to the hearing threshold of the hearing-impaired individual. The HASQI is based on two independent parallel pathways: (1) , which catches the noise and nonlinear distortion effects, and (2) , which catches the linear filtering and spectral changes by targeting differences in the long-term average spectrum [24]. and are calculated from the output of the auditory model to quantify specific changes in the clean reference signal and the enhanced signal by the following:

Hearing Aid Speech Perception Index (HASPI)
The HASPI predicts speech intelligibility based on an auditory model that incorporates changes due to hearing loss. The index first collects all aspects of normal and impaired auditory functions [25]. Then, it compares the correlation values (c) of the outputs of the auditory model for a reference signal to the outputs of the degraded signals from tests over time. The generation of the unprocessed reference input signal is as follows:

Speech Quality Perception Evaluation (PESQ)
The PESQ standard [23] was defined as in the ITU-T P.862 recommendation: it uses an auditory model that includes auditory filter, spectrum, and time masking. The PESQ tests the quality of the speech by comparing the enhanced signal with the clean signal. It does so by predicting the quality with a good correlation in an extensive range of conditions, which may contain distortions, noise, filtering, and errors. PESQ is a weighted sum of the average disturbance d sym and the average asymmetrical disturbance d sym , which can be defined as follows: PESQ = a 0 + a 1 ·d sym + a 2 ·d sym (16) where a 0 = 4.5, a 1 = −0.1, and a 2 = −0.0309. It produces a score from 0.5 to 4.5, with high values indicating better speech quality.

Hearing Aid Speech Quality Index (HASQI)
The HASQI is used to predict speech quality according to the hearing threshold of the hearing-impaired individual. The HASQI is based on two independent parallel pathways: (1) Q nonlin , which catches the noise and nonlinear distortion effects, and (2) Q lin , which catches the linear filtering and spectral changes by targeting differences in the long-term average spectrum [24]. Q nonlin and Q lin are calculated from the output of the auditory model to quantify specific changes in the clean reference signal and the enhanced signal by the following:

Hearing Aid Speech Perception Index (HASPI)
The HASPI predicts speech intelligibility based on an auditory model that incorporates changes due to hearing loss. The index first collects all aspects of normal and impaired auditory functions [25]. Then, it compares the correlation values (c) of the outputs of the auditory model for a reference signal to the outputs of the degraded signals from tests over time. The generation of the unprocessed reference input signal is as follows: where j and r(j) denote the basis function number and the normalised correlation, respectively. The reference signal of the auditory model is set for normal hearing, and the test signal of the model incorporates hearing loss. The auditory model is used to measure c of the high-level part (expressed as a high ) of the clean signal and the enhanced signal in each frequency band. The envelope is sensitive to the dynamic signal behaviour related to consonants, and the cross-correlation tends to retain the harmonics in stable vowels. Finally, the HASPI score is calculated according to c and a high . Let a Low be the low-level auditory coherence value, a Mid be the mid-level value and a High be the high-level value. Then, HASPI intelligibility is given by the following: More details of the HASPI auditory model can be found in [25].

Results and Discussion
In this section, we present the average scores of objective measurements of the test set for the proposed approach. Our method is compared with individual multi-layer state-ofthe-art DDAE approaches (with three (DDAE-3) and five hidden (DDAE-5) layers) that are based on the study in [1] and [20]. PESQ measures for different types of noise (white, pink, babble, and train noise) and SNR conditions (0, 5, 10, and 15 dB). The results of the PESQ evaluation are listed in (Table 3) and show the average PESQ scores of noisy, DDAE-3, DDAE-5, and HC-DDAES for white, pink, babble, and train noises at four different SNR levels (0, 5, 10, and 15 dB). The difference between loudness spectra is computed and an average quality score over time and frequency to produce the prediction of subjective Mean Opinion Score (MOS). PESQ score range is from −0.5 to 4.5 (a higher score present better speech quality). The experimental results of the listening test showed that (1) the proposed HC-DDAEs approach significantly enhanced the speech quality for most types of noises than the traditional DDAE with three and five hidden layers in all the test conditions. (2) the result of the babble noise in 15 dB SNR, the DDAE-5 achieved the almost same result as the Proposed HC-DDAEs. (3) Individual DDAE network with three hidden layers unable to obliterate the noise due to local minima especially in high SNR levels, and increasing the number of epochs in the training stage helps but time-consuming. (4) Increasing the hidden layers on individual DDAE to five (DDAE-5) gave better speech quality than the DDAE network with three hidden layers (DDAE-3) at most SNR levels. From these results, we found that HC-DDAEs provide better speech quality for hearing loss users. Figure 9 introduces the average score of the HASQI for (noisy, DDAE-3, DDAE-5, and HC-DDAEs) at four different SNR levels (0, 5, 10, and 15 dB) for seven hearing loss audiograms (1-7) over babble noise. HASQI score range is from 0 to 0.5 (A higher score present better speech quality). The results clearly showed that: (1) HC-DDAEs provided higher speech quality than DDAE-3 and DDAE-5 in all test cases for the seven audiograms, especially in low SNR levels. HC-DDAES performance is slightly degraded in audiogram seven at 15 dB SNR. (2) DDAE-3 performance is degraded in audiogram two and three at 10 dB SNR level and in audiogram four at 10-15 dB SNR. (3) Increasing the hidden layers on individual DDAE to five (DDAE-5) provided better speech quality than DDAE-3 in almost all hearing cases and slightly same speech quality with HC-DDAEs of audiogram four and audiogram seven at all SNR levels especially in SNR (5 and 10 dB).
The results of the average HASPI score are presented in Figure 10 for seven audiograms (1-7) at four different SNR levels (0, 5, 10, and 15 dB) over babble noise. HASPI score range is from 0 to 1 (A higher score present better speech quality). The experimental results show that: (1) the proposed HC-DDAEs achieved a higher score for HASPI than DDAE-3 in all test conditions, gave slightly same results with DDAE-5 in audiograms 2, 3, and 5 at 0 dB SNR level, and provided better intelligibility in the rest of the test cases. (2) DDAE-3 performance is degraded in audiogram 5 at 10 dB SNR and audiogram seven at 15 dB SNR level. (3) Increasing the hidden layers on individual DDAE to five (DDAE-5) gives a slightly better HASPI score than DDAE-3 in most listening cases, while provided the same or almost the same speech intelligibility with DDAE-3 in audiograms from 1-5 at 15 dB SNR.
Symmetry 2021, 13, x FOR PEER REVIEW 12 of 16 Figure 9. Average results of HASQI metrics of seven audiograms of sensorineural hearing loss. The vertical axis presents the HASQI score (between 0-0.5), and the horizontal axis presents the SNR levels.
The results of the average HASPI score are presented in Figure 10 for seven audiograms (1-7) at four different SNR levels (0, 5, 10, and 15 dB) over babble noise. HASPI score range is from 0 to 1 (A higher score present better speech quality). The experimental results show that: (1) the proposed HC-DDAEs achieved a higher score for Figure 9. Average results of HASQI metrics of seven audiograms of sensorineural hearing loss. The vertical axis presents the HASQI score (between 0-0.5), and the horizontal axis presents the SNR levels.
HASPI than DDAE-3 in all test conditions, gave slightly same results with DDAE-5 in audiograms 2, 3, and 5 at 0 dB SNR level, and provided better intelligibility in the rest of the test cases. (2) DDAE-3 performance is degraded in audiogram 5 at 10 dB SNR and audiogram seven at 15 dB SNR level. (3) Increasing the hidden layers on individual DDAE to five (DDAE-5) gives a slightly better HASPI score than DDAE-3 in most listening cases, while provided the same or almost the same speech intelligibility with DDAE-3 in audiograms from 1-5 at 15 dB SNR. Figure 10. Average results of HASPI metrics of seven audiograms of sensorineural hearing loss. The vertical axis presents the HASPI score (between 0-1), and the horizontal axis presents the SNR levels. Figure 10. Average results of HASPI metrics of seven audiograms of sensorineural hearing loss. The vertical axis presents the HASPI score (between 0-1), and the horizontal axis presents the SNR levels. Based on the previous results we found that the proposed HC-DDAEs improved the quality and intelligibility of the speech. However, the benefits of the proposed HC-DDAEs approach for hearing losses seem to have degraded audiograms three, five and seven in high SNR levels (Figures 9 and 10). In other words, HC-DDAEs still have room for improvement to enhance speech for hearing-impaired users in real-world noisy environments with high SNR levels. Additionally, the study was a computer-based simulation and not tested on real hearing devices, which limited the benefit of the approach. More hearing loss information such as the disturbance level and hearing threshold must be considered for future work to improve the performance of the proposed approach.

Conclusions
In this work, we investigated the performance of the deep denoising autoencoder for speech enhancement. We proposed a new denoising approach for improving the quality and the intelligibility of speech for hearing applications based on bandpass filters and a compound multiple deep denoising autoencoder networks. In this first stage, a set of bandpass filters splits up the speech signal into eight separate channels between 0 and 8000 Hz based on the cochlea's frequency responses. Then, a compound model composed of multiple DDAE networks was proposed, each network of which is specialised for a specific enhancement subtask of the whole enhancement task. To monitor the improvement of speech quality in unsuitable conditions, different speech utterances and noises were used for training and testing the model. We evaluated the speech intelligibility and quality of the proposed model based on PESQ, HASQI and HASPI, which were applied to seven HFHL audiograms. Then, we compared the results of the proposed approach to results obtained by DDAE networks with three and five hidden layers separately. Based on the experimental results in this study, we concluded that: (1) The proposed HC-DDAEs approach provided better speech quality based on PESQ and HASQI in most test conditions than DDAE-3 and DDAE-5. (2) The enhanced speech based on the DDAE-3 network was unable to totally obliterate the noise due to local minima especially in high SNR levels, and increasing the number of epochs in the training stage helps but is time-consuming.
(3) Increasing the hidden layers on individual DDAE to five (DDAE-5) gave better speech quality than the DDAE network with three hidden layers (DDAE-3) at most SNR levels. Meanwhile, the proposed HC-DDAEs provided low speech intelligibility based on HASPI for the audiograms 1-7 in 0 SNR level. Several issues must be further investigated for future work. First, future real-world noises in different SNR levels will be used to examine the proposed approach and compare it with further state-of-the-art studies. Additionally, the study was a computer-based experiment and will be tested on real hearing aid devices for future work.