Speech Enhancement for Hearing Aids with Deep Learning on Environmental Noises

: Hearing aids are small electronic devices designed to improve hearing for persons with impaired hearing, using sophisticated audio signal processing algorithms and technologies. In general, the speech enhancement algorithms in hearing aids remove the environmental noise and enhance speech while still giving consideration to hearing characteristics and the environmental surroundings. In this study, a speech enhancement algorithm was proposed to improve speech quality in a hearing aid environment by applying noise reduction algorithms with deep neural network learning based on noise classiﬁcation. In order to evaluate the speech enhancement in an actual hearing aid environment, ten types of noise were self-recorded and classiﬁed using convolutional neural networks. In addition, noise reduction for speech enhancement in the hearing aid were applied by deep neural networks based on the noise classiﬁcation. As a result, the speech quality based on the speech enhancements removed using the deep neural networks—and associated environmental noise classiﬁcation—exhibited a signiﬁcant improvement over that of the conventional hearing aid algorithm. The improved speech quality was also evaluated by objective measure through the perceptual evaluation of speech quality score, the short-time objective intelligibility score, the overall quality composite measure, and the log likelihood ratio score.


Introduction
The people with impaired hearing have difficulty hearing because they have a higher hearing threshold and narrower dynamic range than a person with normal hearing [1,2]. Consequently, hearing aids are worn as a way to reduce the difficulties caused by hearing loss, with hearing aids compensating for that loss [3]. Hearing aids are small electronic devices that amplify speech in noisy environments and improve the quality of communication for the hearing impaired.
Daily life is full of sound, and hearing aid technology is being constantly developed to reduce noise such as car horns, restaurants noise, the buzzing from electrical equipment, and random voices in the surroundings. Consequently, hearing aid technologies need to reduce environmental noise and amplify voices in order to improve speech intelligibility. However, one of the major complaints of hearing aid users is the inability of hearing aids to completely reduce environmental noise, along with the associated amplification of unexpected noise together with speech [4]. This is because the actual hearing aid operates in scenarios of various, irregular, and random environmental noise.
Digital hearing aids operated various audio signal processing algorithms, among which the speech enhancement algorithms are extremely important. Furthermore, speech enhancement algorithms for hearing aids, which are dissimilar to general signal processing algorithms, need to be realistic and specific due to the expectations of people with impaired hearing.
A typical speech enhancement algorithm for hearing aids classifies environmental noise near the hearing aid itself, estimating the noise power from the input signal, in order to reduce the estimated noise. For example, popular hearing aid noise classification algorithms include the support vector machine (SVM) [5] and gaussian mixture model (GMM) [6], popular noise estimation algorithms include the minimum statistics (MS) [7], minima controlled recursive averaging (MCRA) [8], and improved minima controlled recursive averaging (IMCRA) [9] algorithms, and popular noise reduction algorithms include the spectral subtraction (SS) [10], noise reduction using the wiener filter [11], the minimum mean square error short-time spectral amplitude (MMSE-STSA) [12], and the log minimum mean square error (logMMSE) algorithms [13].
Yu et al. [14] used the minimum mean square error-log scale amplitude (MMSE-LSA) estimator and modified for improving hearing aids speech perception. Nayan et al. [15] proposed a noise reduction and speech enhancement algorithm for hearing aid using the matrix wiener filter and compared the performance with multichannel wiener filter. Chandan et al. [16] used a power spectrum estimation and calculated the gain using spectral subtraction for speech enhancement of hearing aids, and Yuyong and Sangmin [17] proposed a speech enhancement algorithm for hearing aids using the IMCRA for noise estimation and removing the noise using logMMSE. After that, the noises were estimated using adaptive IMCRA parameters based on the noise classification using GMM, the parameters of which were used for noise reduction [18]. Though various conventional algorithms for hearing aids were proposed to improve the speech quality, they did not apply deep learning techniques for speech enhancement of hearing aids, which was known to be effective for noise classification and speech recognition technology.
The speech quality evaluation components of most typical speech enhancement algorithms for hearing aids use popular noise databases which are unnatural and homogenized, such as the NOISEX-92 database [19], the NOIZEUS database [20], the CHiME background noise data [21], and the AURORA-2 corpus [22]. However, the actual hearing aids operates in various irregular, and random noisy environments. The actual environmental noise may comprise two or more types of noise-for example, the random babble of subway sounds-and unexpected sounds may suddenly occur.
In this study, a speech enhancement algorithm was proposed that considers the environmental noise in which hearing aids operate and improve the speech quality using deep learning techniques. In order to accurately assess the type of environmental noise, the noise classification results with convolutional neural networks (CNNs) were used to accurately recognize the type of environmental noise [23], and a noise reduction algorithm using deep neural networks (DNNs) was applied to improve the speech enhancement.

Materials and Methods
The DNNs used for the proposed speech enhancement and the results of previous studies on noise classification are described in Sections 2.1 and 2.2, respectively. Section 2.3 includes an explanation of the process and the application of the proposed algorithm to efficient speech enhancement for hearing aids. Section 2.4 introduces the conditions pertaining to self-recorded noises and determination of the DNNs architecture and hyperparameters. In order to evaluate the performance of speech quality, the objective measurements are described in Section 2.5. Overall, this section deals with the process of speech enhancement and the operation of the network presented in this paper based on the environmental noise for hearing aids.

Deep Neural Networks
Artificial neuron networks (ANNs) are a computational model based on the structure and functions of biological neural networks [24]. DNNs are ANNs with multiple layers between the input and output layers, and deep learning is a machine learning technique that constructs artificial neural networks [25][26][27]. As the number of hidden layers increases, the characteristics of DNNs become more detailed and complex, and derive high-level function from input information. Based on this network model, the work of extracting features and determining the form of unstructured data is called deep learning.
DNNs consist of three types of neuron layers: the input layer, the hidden layers, and the output layer [28]. The input layer receives input data, and the output layer returns output data. The hidden layers perform mathematical computations on the inputs. Depending on the weight and transfer function of the input values, an activation function is applied to the next layer, and activation is implemented in networks until the neurons reach the output layer [29]. The difference between the value of output layer and the expected target value are compared using a cost function. Reducing it by changing the weight determines the performance of the network model.

Classification of Environmental Noise
To properly reduce the noise in a noisy environment, different noise reduction algorithms have to be applied to each noise. For that, the current noise characteristics need to be extracted according to the hearing aid's environment and distinctly classified into various noise categories.
The noise classification result using CNNs was applied for speech enhancement. CNNs are a kind of deep learning techniques and are widely used for image classification while maintaining the spatial information of the image. Convolutional and pooling layers are added between the input and output layers of CNNs, for excellent performance in processing data composed of multi-dimensional arrays, such as color images [23]. The feature map of the input data is produced by moving a convolution filter in the convolution layer, and the values obtained from the final feature maps are then extracted to reduce the computational complexity and improve the accuracy of the pooling layer [30].
In order to improve the classification rate of the environmental noise, a spectrogram image of the noise was used to transform sound signals in the time frequency-domain into image signals. In addition, a sharpening mask and median filter were applied to improve the noise classification rate [31,32]. The environmental noise used the same dataset as the experiment outlined in this paper. The classification results using CNNs were better than the conventional noise classification algorithms of hearing aids. This improved noise classification can contribute to the enhancement of speech for hearing aids [23].

Proposed Algorithm
The noise reduction algorithm was operated using the stored learning model according to the environmental noise which had been trained using ten kinds of noise conditions for an effective noise reduction performance. For hearing aids operation, the DNNs were pre-trained and stored in each type of environmental noise.
The storage elements of the trained model were the training configuration, such as the architecture of the model, the weight of each node in the layers, the cost function, and the optimization method. In particular, since the weight of each node was different based on the type of environmental noise, the weight value varied depending on the noise classification. As a result, the weight set increased in proportion to the type of environmental noise learnt using the DNNs, and it was possible to reduce this according to the noise.
The process of the training stage and test stage of the DNNs for noise reduction are shown in Figure 1. First, the characteristic feature sets of noisy speech were extracted to generate input data for the DNNs [33]. The number of input data were 123, which consisted of mel-frequency cepstral coefficients (MFCCs), amplitude modulation spectrogram (AMS), relative spectral transformed perceptual linear prediction coefficients (RASTA-PLP), and 64 gamma-tone filterbanks (GFs) [34]. The DNNs had three hidden layers and an output layer, and there were 1024 nodes in the hidden layer.
perceptual linear prediction coefficients (RASTA-PLP), and 64 gamma-tone filterbanks (GFs) [34]. The DNNs had three hidden layers and an output layer, and there were 1024 nodes in the hidden layer. In the training stage, the cost function of the output layer was calculated as the mean square error between 64 IBM derived from the GFs and 64 output data applied to the sigmoid activation function. The ideal binary mask (IBM) is defined as [35]: where , and , are the power of the speech signal and noisy speech signal, the range of values being from 0 to 1.
In the test stage, 64 output data ( , , … , ), which were the gain values of the power spectrum with the gamma-tone filterbank, were generated from 123 characteristic features of noisy speech from the input data [36]. The gain values were multiplied by the power spectrum that had been applied to the GF of the noisy speech, as follows: The gain values were inversely transformed from the GF to the enhanced speech. Subsequently, the estimated enhanced speech was generated and evaluated using objective measures.

Experimental Setting
Since the noise databases, for example the NOISEX-92, the NOIZEUS, the CHiME background noise and the AURORA-2, are processed and machined to be applicable to most audio signal In the training stage, the cost function of the output layer was calculated as the mean square error between 64 IBM derived from the GFs and 64 output data applied to the sigmoid activation function. The ideal binary mask (IBM) is defined as [35]: where S 2 (t, f ) and N 2 (t, f ) are the power of the speech signal and noisy speech signal, the range of values being from 0 to 1. In the test stage, 64 output data (t 1 , t 2 , . . . , t 64 ), which were the gain values of the power spectrum with the gamma-tone filterbank, were generated from 123 characteristic features of noisy speech from the input data [36]. The gain values were multiplied by the power spectrum that had been applied to the GF of the noisy speech, as follows: The gain values were inversely transformed from the GF to the enhanced speech. Subsequently, the estimated enhanced speech was generated and evaluated using objective measures.

Experimental Setting
Since the noise databases, for example the NOISEX-92, the NOIZEUS, the CHiME background noise and the AURORA-2, are processed and machined to be applicable to most audio signal processing, it is easy to compare the results with those of other studies. However, the databases do not provide the variety of mixed sound found in real life.
Ten types of noises were recorded in the actual environments in which hearing aids are used: white noise (white, N0), café noise around Inha University, Korea (café, N1), interior noise in a moving car (car interior, N2), fan noise in a laboratory (fan, N3), laundry noise in a laundry room (laundry, N4), noise in the library at Inha University (library, N5), normal noise in a university laboratory (office, N6), various noises in a restaurant (restaurant, N7), noise in subway car (subway, N8), and traffic noise around an intersection (traffic, N9). Each noise was recorded three times at different times on different days, and the noise data for each noise type were generated for 30 min. The noises were recorded at 44.1 kHz, which is the highest sampling frequency of microphone recording to minimize loss of audio data as possible, and the noise data was need to down-sampled to 16 kHz, because 44.1 kHz was too high to use in hearing aids signal processing. The process of down-sampling was that low-pass filter with 8 kHz of cut-off frequency was used to prevent the effect of aliasing. Unlike the noise database, the speech database used TIMIT because speech needs to be provided with a constant speed and a uniform tone to mix with the noise. Noisy speech used a mixture of TIMIT sentences with recorded noises at 0, 5, 10, and 15 SNR.
The DNNs used three hidden layers, and each layer had 1024 nodes and rectified linear unit (ReLU). The standard backpropagation algorithm with dropout regularization was used to train the DNNs, the dropout rate of which was 0.2. The stochastic gradient descent algorithm with Adagrad was used for training the weights in the DNNs. A momentum rate of 0.5 was used in the first five epochs, after which the rate was increased to 0.9. The cost function in the last layer was the mean square error (MSE). The batch was set to 1024, the learning rate was 0.001, and the network was trained for 20 epochs.
The noise dataset was generated and the DNNs architecture designed with MATLAB R2019a (MathWorks Inc, Natick, MA, USA) and Parallel Computing Toolbox in MATLAB was use for deep learning.

Performance Evaluation
For objective speech quality evaluation, the results of speech enhancement were compared using the perceptual evaluation of speech quality (PESQ), the short-time objective intelligibility (STOI), the log likelihood ratio (LLR), and the overall quality composite measure (OQCM). The PESQ scored from −0.5 to 4.5, and the STOI scored from 0.0 to 1.0, with a higher score being an indication of better quality. The LLR scored from 0.0 to 2.0, in which a lower score is an indication of better speech quality. the OQCM was calculated using a combination of the existing objective evaluation measures to form a new measure [37], as follows: The higher values of the OQCM represent a lower signal distortion and higher quality of speech. PESQ, LLR, and the weighted spectral slope (WSS) represent the perceptual evaluation of speech quality, the log likelihood ratio, and weighted slope spectral distance in each frequency band, respectively.

Results
The results of speech enhancement for hearing aids with various objective measurements and the detailed performance of the proposal algorithm, compared to a conventional hearing aid algorithm, are outlined below.

Speech Enhancement
This section presents experimental results of the proposed speech enhancement for hearing aids using DNNs in each noise type. Table 1 shows the results of applying the speech enhancement algorithm (After) according to each objective evaluation measure. The quality of speech enhancement in noisy signals with large SNRs produced better results than those with low SNRs, in all noise types. In particular, the subway noise, fan noise, and laundry noise exhibited higher PESQ, STOI, and OQCM, and lower LLR. The reason is that the three noises had regular patterns and the babble noise interfered less when speech enhancement was applied using the DNNs. Table 1.
Evaluation of speech quality by proposed noise reduction algorithm based on noise classification. As shown by the average of ten noises outlined in Table 1, all objective measurements of the proposed enhancement algorithm's application (After) demonstrated improved speech quality over the noisy speech which did not apply the algorithm (Before). In particular, the improvement of speech quality using OQCM, which was the closest subjective measure, increased 45% in the 0 dB SNR, 45% in the 5 dB SNR, 41% in the 10 dB SNR, and 35% in the 15 dB SNR tests, respectively.

Comparison Algorithms
In this session, the quality of speech enhancement using DNNs based on environmental noise classification was compared to the quality of the conventional hearing aid algorithm. In order to compare the quality of speech enhancement with the proposed algorithm, the results of the speech enhancement algorithm that were not based on the noise classification results and the conventional speech enhancement algorithm for hearing aids are presented in Table 2. The conventional speech enhancement algorithm for hearing aids classified the environmental noise using the Gaussian Mixture Models with covariance matrix, estimated the noise using MCRA by applying the optimized five parameters for each noise type, and reduced the noise using logMMSE [16]. The PESQ and STOI scores of speech enhancement using the proposed DNNs without the classification result were similar to or slightly higher than the results of the speech enhancement using the conventional algorithm for hearing aids. However, the results of the speech enhancement using the proposed DNNs with the classification result produced higher PESQ, STOI, and OQCM and lower LLR scores.
In particular, in the case of noisy speech with a low SNR, the improvement of speech quality was seen to be greater. Consequently, the speech enhancement for hearing aids can be expected to improve by applying different noise reduction algorithms depending on the type of noise.

Conclusions
In this study, the speech enhancement for hearing aids was investigated in actual self-recorded noisy environments. In order to improve speech quality, the environmental noises were classified using convolutional networks, and noise reduction using DNNs was applied bases on the classified noise. The environmental noise was recorded in ten places most related to the environment in which hearing aids are used, and the objective evaluation of speech quality improvement was made using the PESQ, STOI, OQCM, and LLR scores.
With the proposed algorithm, comprising noise reduction based on noise classification, the PESQ score increased 2.17% in the 0 dB SNR, 3.50% in the 5 dB SNR, 3.69% in the 10 dB SNR, and 2.62% in the 15 dB SNR tests, respectively, than when the classification results were not applied. The STOI score increased 3.23% in the 0 dB SNR, 2.71% in the 5 dB SNR, 1.89% in the 10 dB SNR, and 1.30% in the 15 dB SNR tests, respectively. The OQCM, which was calculated using a combination of the existing objective evaluation measures, increased 0.203 in the 0 dB SNR, 0.243 in the 5 dB SNR, 0.225 in the 10 dB SNR, and 0.161 in the 15 dB SNR tests, respectively. The LLR score was 7.57% lower in the 0 dB SNR, 7.79% lower in the 5 dB SNR, 5.86% lower in the 10 dB SNR, and 4.20% lower in the 15 dB SNR tests, respectively, than the performance of the noise reduction without applying the classification results.
The proposed speech enhancement of hearing aids provided the best quality of speech in various and irregular noisy environments of the typical hearing aid user. Because recorded actual noises were closer to environments of hearing aid use, noise datasets were effectively used to improve the speech quality for hearing aid users. The speech enhancement using deep learning in hearing aid resulted in the improved speech quality compare with conventional speech enhancement algorithms in hearing aids. In addition, the proposed speech enhancement algorithm, which was applied different DNNs models according to the classified noise, had better performance of speech quality than the case without applying classified noise results. In summary, the increased noise classification rate using CNNs (with two types of image filters) improved the noise reduction performance using DNNs, and through the proposed algorithm, the quality of speech enhancement of the hearing aid improved, resulting in increased listening satisfaction for the people with impaired hearing. As hearing aid chips is advanced and speech signal processing for hearing aids is developed continuously, more complex algorithms could be applied to the hearing aids and more detailed hearing compensation would be provide to the diverse characteristics of hearing.

Conflicts of Interest:
The authors declare no conflict of interest.