Environmental Noise Classification Using Convolutional Neural Networks with Input Transform for Hearing Aids.

Hearing aids are essential for people with hearing loss, and noise estimation and classification are some of the most important technologies used in devices. This paper presents an environmental noise classification algorithm for hearing aids that uses convolutional neural networks (CNNs) and image signals transformed from sound signals. The algorithm was developed using the data of ten types of noise acquired from living environments where such noises occur. Spectrogram images transformed from sound data are used as the input of the CNNs after processing of the images by a sharpening mask and median filter. The classification results of the proposed algorithm were compared with those of other noise classification methods. A maximum correct classification accuracy of 99.25% was achieved by the proposed algorithm for a spectrogram time length of 1 s, with the correct classification accuracy decreasing with increasing spectrogram time length up to 8 s. For a spectrogram time length of 8 s and using the sharpening mask and median filter, the classification accuracy was 98.73%, which is comparable with the 98.79% achieved by the conventional method for a time length of 1 s. The proposed hearing aid noise classification algorithm thus offers less computational complexity without compromising on performance.


Introduction
Hearing difficulty is a symptom of hearing loss caused by anomalies in the human sound signal transmission process. The difficulty in hearing a particular sound is due to an increase in the corresponding hearing threshold and narrowing of the dynamic range [1,2]. The use of a hearing aid is one of the methods for solving hearing difficulty and compensating for hearing loss [3]. Hearing aids use various technologies such as noise reduction, sound compensation, directional microphones, and feedback cancelation, and are tuned to the hearing characteristics of the users and the environments of use [4].
Daily life is full of noises, and hearing aid technologies are continuously being developed to reduce these noises, such as those in restaurants, car horns, buzzing from electrical equipment, and random voices in the surroundings. However, because the operating environment of a hearing aid varies with time, place, and other factors, 100% performance satisfaction is not achieved [5,6]. One of the biggest complaints of hearing aid users is the inability to completely reduce ambient noise, and the tendency of the noise to be amplified with the human voice [7]. Speech intelligibility is affected when surrounding noise is incorrectly interpreted as human voice, or if the voice is misinterpreted and removed with noise. This is due to performance problems and inaccurate sound classification of the noise reduction algorithm [8].
The traditional noise classification algorithm in hearing aids proceeds with the extraction of characteristic features from the data, finding the class with the highest probability based on those features, and classifying them based on the identified class [9]. The noise classification algorithm mainly focuses in the performance of the hearing aid, which has to operate with a low computational complexity and low power [10].
However, with the recent development of hearing aid chips with early smart phone-level CPU performance, such as Ezioro 71XX, the use of environmental noise classification algorithms that use deep learning is now feasible. It is generally not easy to extract sound signal characteristics that can be used as input data for deep learning, compared with image signals. This is because the time-domain data of sounds are difficult to know with respect to their signal information or their characteristics in the frequency domain. Therefore, various feature extraction algorithms are used to switch the data into the frequency domain and to distinguish the frequency characteristics of the different sounds. Nevertheless, real sound signals are a mixture of different sounds, and it remains difficult to distinguish between the characteristics of the contained noises and voices.
In this study, a noise signal spectrogram was used to transform sound signals in the time frequency-domain into image signals for hearing aid noise classification, as an alternative to the use of extracted frequency-domain features. The long noise estimation period was employed, and deep learning was used to improve the low classification accuracy. The image data transformed from the sound signals were particularly used in the present study for the classification of environmental noises with the aid of convolutional neural networks (CNNs), which are some of the best methods for image classification.

Conventional Noise Classification Algorithms
One of the most basic classification algorithms in use is the Bayesian classifier [11]. It classifies with the help of histograms of the class-specific probabilities. The K-nearest neighbors classification algorithm is a simple process that determines the class of a new input [12]. It is suitable for simple classification problems with relatively few training features, because, as the number of training feature increases, both the computational complexity and time increase. Support Vector Machine [13,14] and Neural Networks [15] are discriminative classification algorithm. These algorithms can be effective when there is enough sufficiently varied data to train the classifier, and can work even in those situations where the underlying probability distributions for the features are unknown. Hidden Markov models [10,16,17] are a widely used statistical method for speech recognition. One major advantage of HMMs over the previously described classifiers is that they account for the temporal statistics of the occurrence of different states in the features. Clustering refers to a group of unsupervised processes that group features based on their measured similarity. Clustering is related to classification in that both divide unknown inputs into classes.

Convolutional Neural Networks
A CNN is a deep learning technology based on supervised learning, and is widely used for image processing while maintaining the spatial information of the image [18,19]. As shown in Figure 1, convolutional and pooling layers were added between the input and output layers of the present CNN, for excellent performance in processing data composed of multi-dimensional arrays such as color images. The convolution work is for extracting the high-level features such as edges, from input data. Similar to the convolutional layer, the pooling work is responsible for reducing the spatial size of the convolved feature [20]. This is to decrease the computational power required to process the data through dimensionality reduction.
The feature map of the input data is produced by moving a convolution filter in the convolutional layer, and the values obtained from the final feature maps are then extracted to reduce the computational complexity and improve the accuracy of the pooling layer [21]. In this paper, the CNN has two hidden layers, namely the 5 × 5 convolution layer and max pooling layer, which uses a 2 × 2 window. The activation function is a ReLu function, which is the most commonly used function, and the loss function is a cross-entropy function. The overall data was divided into training and test sets at a ratio of 75:25, respectively, the batch sizes of the training was set to 16. The number of epochs was 12 and learning rate was set 0.001.

The Proposed Algorithm
In this paper, the spectrogram images of noise signals were used as input data for the CNNs, without feature extraction or conversion to the frequency domain. A spectrogram is a visual representation of the frequency spectrum of the signals with respect to time. The amplitude of the sound frequency was indicated by color in the spectrogram. Because the spectrogram consisted of different image colors, it had the advantage of enabling verification of the time and energy information in the frequency domain over a certain period. Therefore, the characteristics of the spectrogram images varied with the amplitude of the frequency and the time information. Figure 2 shows a flow chart of the proposed environment noise classification algorithm. Generally, the input sound signal data were transformed into spectrogram images represented as RGB colors for noise classification using the CNNs. Two types of filters were combined and used to distinguish the noise characteristics, because the spectral image of the sound signals contained irregular amplitude changes over time unlike normal image signals. Each filter was introduced and applied to compare the results of the proposed algorithm in the process.
The first image filter uses a sharpening mask (method #1: spectrogram + Sharpening Mask), which enables enhancement of the boundaries of the noise characteristics [22]. The filter clearly identifies the boundaries of the colors, so that the area of the high-energy noise signals can be more clearly displayed.
The second image filter uses a median filter (method #2: spectrogram + Median Filter) [23]. In a conventional noise signal spectrogram, there are irregular low-energy pixels between the noise feature pixels that appear red. The use of the median filter compensates for these low-energy pixels when the data is used as input for the CNNs. Sets of input data with four different time lengths (Sets A, B, C, and D) were fed to the CNNs, and the corresponding noise classification accuracies were compared. In this paper, the CNN has two hidden layers, namely the 5 × 5 convolution layer and max pooling layer, which uses a 2 × 2 window. The activation function is a ReLu function, which is the most commonly used function, and the loss function is a cross-entropy function. The overall data was divided into training and test sets at a ratio of 75:25, respectively, the batch sizes of the training was set to 16. The number of epochs was 12 and learning rate was set 0.001.

The Proposed Algorithm
In this paper, the spectrogram images of noise signals were used as input data for the CNNs, without feature extraction or conversion to the frequency domain. A spectrogram is a visual representation of the frequency spectrum of the signals with respect to time. The amplitude of the sound frequency was indicated by color in the spectrogram. Because the spectrogram consisted of different image colors, it had the advantage of enabling verification of the time and energy information in the frequency domain over a certain period. Therefore, the characteristics of the spectrogram images varied with the amplitude of the frequency and the time information. Figure 2 shows a flow chart of the proposed environment noise classification algorithm. Generally, the input sound signal data were transformed into spectrogram images represented as RGB colors for noise classification using the CNNs. Two types of filters were combined and used to distinguish the noise characteristics, because the spectral image of the sound signals contained irregular amplitude changes over time unlike normal image signals. Each filter was introduced and applied to compare the results of the proposed algorithm in the process.
The first image filter uses a sharpening mask (method #1: spectrogram + Sharpening Mask), which enables enhancement of the boundaries of the noise characteristics [22]. The filter clearly identifies the boundaries of the colors, so that the area of the high-energy noise signals can be more clearly displayed.
The second image filter uses a median filter (method #2: spectrogram + Median Filter) [23]. In a conventional noise signal spectrogram, there are irregular low-energy pixels between the noise feature pixels that appear red. The use of the median filter compensates for these low-energy pixels when the data is used as input for the CNNs. Sets of input data with four different time lengths (Sets A, B, C, and D) were fed to the CNNs, and the corresponding noise classification accuracies were compared.

Materials and Methods
The conditions of noise data and signal processing information will first be discussed in Section 4.1; in 4.2, the specific noise classification experiment and input data transformation process will be described. Overall, the determination of the input data and detailed pre-processing of image data in CNNs are described for the noise classification algorithm.

Recording Environmental Noises
Ten kinds of noise were recorded from real environments in which hearing aids are used: white noise (white, N0), café noise around Inha University, Korea (café, N1), interior noise in a moving car (car_interior, N2), single fan noise in a laboratory (fan, N3), laundry noise in a laundry room (laundry, N4), noise in the library of Inha University (library, N5), normal noise in a university laboratory (office, N6), various noises in a restaurant (restaurant, N7), noise in subway car (subway, N8), and traffic noise around an intersection (traffic, N9).
Each noise was recorded three times at different times on different weekdays, and noise data for each noise type was generated for 30 min. To be closely related to the hearing aid's environment, recording places were selected such as Starbucks, the biggest restaurant in the Inha Student Union building, etc. The noises were recorded on an iPhone 6S at 44.1 kHz, which is the highest possible sampling frequency, and the artificial noise generated at the beginning and end of recording was not included. The noise data was subsequently down-sampled to 16 kHz, which is the proper frequency for signal processing for hearing aids.

Experiment Data
The Matlab R2019b program developed by MathWorks was used to divide the recorded noise data into certain time intervals. The noise signals consisted of 16,000 samples per second, divided into four sets with time lengths of 1.0, 2.0, 4.0, and 8.0 s, respectively. Each frame was overlapped by 25% on either side to achieve a continuous noise signal and prevent to loss some data [24]. The spectrogram images obtained from the sound signals consisted of 23,960 images with a time length of 1 s (Set A), 11,960 of length 2 s (Set B), 6,000 of length 4 s (Set C), and 3,000 of length 8 s (Set D) in each of the 10 noises.
The conversion functions in the signal processing tool in the Matlab Toolbox was used to transform the noise signals into a spectrogram. The spectrogram image had a resolution of 904 × 713 pixels and was used as the input of the CNNs. To increase the classification accuracy of the spectrogram images of the noise signals, a 3 × 3 sharpening mask was used to enhance the boundaries of the colors, while a 5 × 5 median filter was used to clearly represent the pattern of the colors and make a color smoothing for random noise pixels. Figure 3 shows the results of a

Materials and Methods
The conditions of noise data and signal processing information will first be discussed in Section 4.1; in Section 4.2, the specific noise classification experiment and input data transformation process will be described. Overall, the determination of the input data and detailed pre-processing of image data in CNNs are described for the noise classification algorithm.

Recording Environmental Noises
Ten kinds of noise were recorded from real environments in which hearing aids are used: white noise (white, N0), café noise around Inha University, Korea (café, N1), interior noise in a moving car (car_interior, N2), single fan noise in a laboratory (fan, N3), laundry noise in a laundry room (laundry, N4), noise in the library of Inha University (library, N5), normal noise in a university laboratory (office, N6), various noises in a restaurant (restaurant, N7), noise in subway car (subway, N8), and traffic noise around an intersection (traffic, N9).
Each noise was recorded three times at different times on different weekdays, and noise data for each noise type was generated for 30 min. To be closely related to the hearing aid's environment, recording places were selected such as Starbucks, the biggest restaurant in the Inha Student Union building, etc. The noises were recorded on an iPhone 6S at 44.1 kHz, which is the highest possible sampling frequency, and the artificial noise generated at the beginning and end of recording was not included. The noise data was subsequently down-sampled to 16 kHz, which is the proper frequency for signal processing for hearing aids.

Experiment Data
The Matlab R2019b program developed by MathWorks was used to divide the recorded noise data into certain time intervals. The noise signals consisted of 16,000 samples per second, divided into four sets with time lengths of 1.0, 2.0, 4.0, and 8.0 s, respectively. Each frame was overlapped by 25% on either side to achieve a continuous noise signal and prevent to loss some data [24]. The spectrogram images obtained from the sound signals consisted of 23,960 images with a time length of 1 s (Set A), 11,960 of length 2 s (Set B), 6,000 of length 4 s (Set C), and 3,000 of length 8 s (Set D) in each of the 10 noises.
The conversion functions in the signal processing tool in the Matlab Toolbox was used to transform the noise signals into a spectrogram. The spectrogram image had a resolution of 904 × 713 pixels and was used as the input of the CNNs. To increase the classification accuracy of the spectrogram images of the noise signals, a 3 × 3 sharpening mask was used to enhance the boundaries of the colors, while a 5 × 5 median filter was used to clearly represent the pattern of the colors and make a color smoothing for random noise pixels. Figure 3 shows the results of a transformation of sound signals into a spectrogram image and the application of the sharpening mask and median filter. The spectrogram images were also subsequently used as input data. transformation of sound signals into a spectrogram image and the application of the sharpening mask and median filter. The spectrogram images were also subsequently used as input data. Using Set A as an example, there were four types (spectrogram image, spectrogram image + Sharpening Mask, spectrogram image + Median Filter and spectrogram image + Sharpening Mask + Median Filter) of input data for each of the 10 considered environments, from which 23,960 spectrogram images were obtained. The same number of images were obtained after the application of the sharpening mask and median filter, respectively.

Experimental Results
In this section, results of the classification for hearing aids are introduced with various conditions, showing the detailed performance as proposed algorithms. Results are presented as a Confusion Matrix and a Receiver Operating Characteristic (ROC) curve. A Confusion Matrix is a table that is often used to describe the performance of the classification on a set of test data [25]. A ROC curve is a graph showing the performance of a classification model at all classification thresholds [26].

Performance Evaluations
This section presents experimental results of the proposed environmental noise classification for hearing aids using a CNNs. The classification produced varying results because the noise signals were randomly divided into training (0.75) and test (0.25) sets, and the spectrogram images corresponded to different times. The total number of input data was 5990 when the length of time was 1 s, and the number of test data in each noise was 599. Because the number of input data was dependent on length of time, the number of test data is 2990 in 2 s, 1500 in 4 s, and 750 in 8 s.
In order to show significant classification results, every experiment of training set and test set were randomly divided at a constant rate. As indicated in Table 1, the values in the tables are the classification accuracy (%), which is the ratio of number of correct predictions to the total number of input data, and the noise classification was performed 10 times. Bold numbers in the bottom two rows of each table are an average and a standard deviation of classification accuracies for comparing with other conditions. The conventional method is based on the deep convolutional neural networks and became famous as the winner of the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012 [21].
Method #1 was used to classify the image data using the sharpening mask to emphasize the boundary of colors, while method #2 used the median filter to remove ambient noise pixels. The Using Set A as an example, there were four types (spectrogram image, spectrogram image + Sharpening Mask, spectrogram image + Median Filter and spectrogram image + Sharpening Mask + Median Filter) of input data for each of the 10 considered environments, from which 23,960 spectrogram images were obtained. The same number of images were obtained after the application of the sharpening mask and median filter, respectively.

Experimental Results
In this section, results of the classification for hearing aids are introduced with various conditions, showing the detailed performance as proposed algorithms. Results are presented as a Confusion Matrix and a Receiver Operating Characteristic (ROC) curve. A Confusion Matrix is a table that is often used to describe the performance of the classification on a set of test data [25]. A ROC curve is a graph showing the performance of a classification model at all classification thresholds [26].

Performance Evaluations
This section presents experimental results of the proposed environmental noise classification for hearing aids using a CNNs. The classification produced varying results because the noise signals were randomly divided into training (0.75) and test (0.25) sets, and the spectrogram images corresponded to different times. The total number of input data was 5990 when the length of time was 1 s, and the number of test data in each noise was 599. Because the number of input data was dependent on length of time, the number of test data is 2990 in 2 s, 1500 in 4 s, and 750 in 8 s.
In order to show significant classification results, every experiment of training set and test set were randomly divided at a constant rate. As indicated in Table 1, the values in the tables are the classification accuracy (%), which is the ratio of number of correct predictions to the total number of input data, and the noise classification was performed 10 times. Bold numbers in the bottom two rows of each table are an average and a standard deviation of classification accuracies for comparing with other conditions. The conventional method is based on the deep convolutional neural networks and became famous as the winner of the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012 [21].
Method #1 was used to classify the image data using the sharpening mask to emphasize the boundary of colors, while method #2 used the median filter to remove ambient noise pixels. The proposed algorithm involved the combined use of the sharpening mask and median filter for clear representation and removal of the noise pixels in the spectrogram. Regarding the classification accuracy in the time division, Set A produced the highest percentage classification in comparison with other Sets. With increasing time length of the spectrogram, the percentage classification decreased when using the CNNs. This was because of the longer time spent in changing the noise environment, and the increased probability of error in the classification due to the reduced number of image data. The detailed confusion matrix of classification results is further analyzed in Section 5.2, below. Each number is the average of 10 classifications. Table 2 is a confusion matrix of classification results in the time length of Set A using the CNNs. The vertical noise numbers in the table represent the true class (Target Class), while the horizontal noise numbers represent the predicted class (Output Class). The numbers in the diagonal cells are the numbers of correct classifications, while those in the off-diagonal cells are the numbers of incorrect classifications. The percentage of correct classifications relative to the total number of observations are also shown for each noise number. The results reveal high classification accuracies irrespective of the use or type of filter. In addition, there are no significant differences between the spectrogram image classifications for the four different methods, because there was enough input data to classify, and the performance of the CNN was excellent. Note: the correct classifications in the diagonal cells and especially the incorrect classifications in the rest to be express in different colors. Figure 4 shows the Receiver Operating Characteristic (ROC) curve of multilabel classification for Table 2. The classification performance was confirmed through the ROC curves of all noises, which are close to the top and left-hand borders. All of the area under the ROC curve (AUC) were 1.0, meaning that the score describes the quality of the classification performance. Figure 4 shows the Receiver Operating Characteristic (ROC) curve of multilabel classification for Table 2. The classification performance was confirmed through the ROC curves of all noises, which are close to the top and left-hand borders. All of the area under the ROC curve (AUC) were 1.0, meaning that the score describes the quality of the classification performance. As can be seen from Table 3, when the environmental noises of the spectrogram image with a time length of Set B were classified using the four different methods, the classification accuracies were similar to those of Set A. In the cases of subway noise (N8) and traffic noise (N9), the classification rates using the sharpening mask and median filter were much better than when the filters were not used.

Data Analysis
Café noise (N1) was misclassified as subway noise (N8) and traffic noise (N9), respectively, because the irregular high-frequency noises in the café presented energy distributions similar to those of subway noise (N8) and traffic noise (N9). Subway noise (N8) was also misclassified as traffic noise (N9). Because the energy distributions of these two noises are similar, neither the sharpening mask (c) nor the median filter (d) produced significantly differing effects from them.  As can be seen from Table 3, when the environmental noises of the spectrogram image with a time length of Set B were classified using the four different methods, the classification accuracies were similar to those of Set A. In the cases of subway noise (N8) and traffic noise (N9), the classification rates using the sharpening mask and median filter were much better than when the filters were not used.
Café noise (N1) was misclassified as subway noise (N8) and traffic noise (N9), respectively, because the irregular high-frequency noises in the café presented energy distributions similar to those of subway noise (N8) and traffic noise (N9). Subway noise (N8) was also misclassified as traffic noise (N9). Because the energy distributions of these two noises are similar, neither the sharpening mask (c) nor the median filter (d) produced significantly differing effects from them. Note: the correct classifications in the diagonal cells and especially the incorrect classifications in the rest to be express in different colors. Figure 5 shows the Receiver Operating Characteristic (ROC) curve of multilabel classification for Table 3. The classification performance was confirmed through the ROC curves of all noises, which are close to the top and left-hand borders. All of the area under the ROC curve (AUC) were 1.0, meaning that the score describes the quality of the classification performance. Figure 5 shows the Receiver Operating Characteristic (ROC) curve of multilabel classification for Table 3. The classification performance was confirmed through the ROC curves of all noises, which are close to the top and left-hand borders. All of the area under the ROC curve (AUC) were 1.0, meaning that the score describes the quality of the classification performance.  Table 4 presents the environmental noise classification results for a time length of Set C, which produced decreased image classification accuracies for some noise types compared with the classifications for Set A and B. Specifically, the classification accuracies when using the median filter (Method #2) were lower than the other results. This means that the characteristics and the distribution of noise could not be distinguished over longer time lengths because the median filter caused a smoothing effect. In Table 4c, café noise (N1) is incorrectly classified as subway noise (N8) and traffic noise (N9), and traffic noise (N9) s incorrectly classified as cafe noise (N1) and subway noise (N8), resulting in a reduced overall classification accuracy. Café noise (N1), traffic noise (N9) have similar energy distributions because they contain multiple voices in other complexed environments, with the sounds concentrated in the low-frequency range. In the case of café noise (N1), subway noise (N8) and traffic noise (N9), for which the conventional method produces relatively low classification accuracies, proposed algorithm affords significant improvements.  Table 4 presents the environmental noise classification results for a time length of Set C, which produced decreased image classification accuracies for some noise types compared with the classifications for Set A and B. Specifically, the classification accuracies when using the median filter (Method #2) were lower than the other results. This means that the characteristics and the distribution of noise could not be distinguished over longer time lengths because the median filter caused a smoothing effect. In Table 4c, café noise (N1) is incorrectly classified as subway noise (N8) and traffic noise (N9), and traffic noise (N9) s incorrectly classified as cafe noise (N1) and subway noise (N8), resulting in a reduced overall classification accuracy. Café noise (N1), traffic noise (N9) have similar energy distributions because they contain multiple voices in other complexed environments, with the sounds concentrated in the low-frequency range. In the case of café noise (N1), subway noise (N8) and traffic noise (N9), for which the conventional method produces relatively low classification accuracies, proposed algorithm affords significant improvements. Table 4. Summary of the classification accuracy (%) applying different methods in Set C: (a) Conventional Method; (b) Method #1, only the sharpening mask is applied; (c) Method #2, only the median filter is applied; (d) proposed algorithm, both the sharpening mask and the median filter are applied; the length of time of Set C is 4 s. Note: the correct classifications in the diagonal cells and especially the incorrect classifications in the rest to be express in different colors. Figure 6 shows the Receiver Operating Characteristic (ROC) curve of multilabel classification for Table 4. The classification performance was confirmed through the ROC curves of all noises, which are close to the top and left-hand borders. All of the area under the ROC curve (AUC) were 0.99, meaning that the score describes the quality of the classification performance. Figure 6 shows the Receiver Operating Characteristic (ROC) curve of multilabel classification for Table 4. The classification performance was confirmed through the ROC curves of all noises, which are close to the top and left-hand borders. All of the area under the ROC curve (AUC) were 0.99, meaning that the score describes the quality of the classification performance.  Table 5 presents the environmental noise classification results for a time length of Set D. Comparison of Tables 5a and 5d shows that the proposed algorithm produces a 97.93% classification accuracy for cafe noise (N1), which is highly classified compared with the other methods. In addition, the classification accuracy of traffic noise (N9) with the proposed algorithm was also increased to 98.22%.

N0
Overall, the proposed algorithm produces >96.4% classification accuracy for all environmental noises. That means the results show that the classification accuracy does not significantly decrease even for a time length of Set D when the two types of filters are applied to the input data of the CNNs.   Table 5a,d shows that the proposed algorithm produces a 97.93% classification accuracy for cafe noise (N1), which is highly classified compared with the other methods. In addition, the classification accuracy of traffic noise (N9) with the proposed algorithm was also increased to 98.22%.
Overall, the proposed algorithm produces >96.4% classification accuracy for all environmental noises. That means the results show that the classification accuracy does not significantly decrease even for a time length of Set D when the two types of filters are applied to the input data of the CNNs.  Figure 7 shows the Receiver Operating Characteristic (ROC) curve of multilabel classification for Table 5. The classification performance was confirmed through the ROC curves of all noises, which are close to the top and left-hand borders. All of the area under the ROC curve (AUC) were 1.0, meaning that the score describes the quality of the classification performance. Figure 7 shows the Receiver Operating Characteristic (ROC) curve of multilabel classification for Table 5. The classification performance was confirmed through the ROC curves of all noises, which are close to the top and left-hand borders. All of the area under the ROC curve (AUC) were 1.0, meaning that the score describes the quality of the classification performance.

Conclusions
In this study, we proposed an algorithm for the classification of environmental noises in hearing aids and verified the performance. The proposed algorithm was to transform the sound data into image data for using as the input data of CNNs. The spectrogram images of the transformation were generated by dividing 10 environmental noises using four different time lengths, respectively, and the correct classification accuracies were compared for cases when a sharpening mask, median filter, and both were applied to the image data, respectively. We found that the correct noise classification accuracies for hearing aids using a CNNs gradually decreased with increasing time length of the spectrogram images due to the randomly changing noise characteristics. Regarding the type of filter used, the classification accuracy for the sharpening mask was higher than that for the median filter. In other words, it was more effective to sharpen the boundaries of the energy distribution in the spectrogram images than to remove the noise pixels from the images with obvious colors. Particularly, the combined use of the sharpening mask and median filter for a spectrogram time length of Set D increased the classification accuracy from 95.24% when no filter is used to 98.73%, which is comparable to the classification accuracy (98.79%) without a filter (conventional method) for a time length of Set A.

Conclusions
In this study, we proposed an algorithm for the classification of environmental noises in hearing aids and verified the performance. The proposed algorithm was to transform the sound data into image data for using as the input data of CNNs. The spectrogram images of the transformation were generated by dividing 10 environmental noises using four different time lengths, respectively, and the correct classification accuracies were compared for cases when a sharpening mask, median filter, and both were applied to the image data, respectively. We found that the correct noise classification accuracies for hearing aids using a CNNs gradually decreased with increasing time length of the spectrogram images due to the randomly changing noise characteristics. Regarding the type of filter used, the classification accuracy for the sharpening mask was higher than that for the median filter. In other words, it was more effective to sharpen the boundaries of the energy distribution in the spectrogram images than to remove the noise pixels from the images with obvious colors. Particularly, the combined use of the sharpening mask and median filter for a spectrogram time length of Set D increased the classification accuracy from 95.24% when no filter is used to 98.73%, which is comparable to the classification accuracy (98.79%) without a filter (conventional method) for a time length of Set A.
The proposed noise classification algorithm is thus effective for low computational complexity in long-term noise estimation and classification for hearing aids, as well as for environmental noise monitoring over a period of time, eliminating the need for real-time noise estimation. In addition, other types of filters that can clearly identify noise characteristics can be combined to further improve the use of CNNs for noise classification toward enhancing the performance of hearing aids.