Auditory Property-Based Features and Artiﬁcial Neural Network Classiﬁers for the Automatic Detection of Low-Intensity Snoring/Breathing Episodes

: The deﬁnitive diagnosis of obstructive sleep apnea syndrome (OSAS) is made using an overnight polysomnography (PSG) test. This test requires that a patient wears multiple measurement sensors during an overnight hospitalization. However, this setup imposes physical constraints and a heavy burden on the patient. Recent studies have reported on another technique for conducting OSAS screening based on snoring/breathing episodes (SBEs) extracted from recorded data acquired by a noncontact microphone. However, SBEs have a high dynamic range and are barely audible at intensities >90 dB. A method is needed to detect SBEs even in low-signal-to-noise-ratio (SNR) environments. Therefore, we developed a method for the automatic detection of low-intensity SBEs using an artiﬁcial neural network (ANN). However, when considering its practical use, this method required further improvement in terms of detection accuracy and speed. To accomplish this, we propose in this study a new method to detect low SBEs based on neural activity pattern (NAP)-based cepstral coefﬁcients (NAPCC) and ANN classiﬁers. Comparison results of the leave-one-out cross-validation demonstrated that our proposed method is superior to previous methods for the classiﬁcation of SBEs and non-SBEs, even in low-SNR conditions (accuracy: 85.99 ± 5.69% vs. 75.64 ± 18.8%). and Exp-2 using two classiﬁers, i.e., MLP-ANN and RBF-ANN. These results suggest that, irrespective of the type of ANN, the proposed method can detect SBEs with higher accuracy than the conventional method, even for mixed recorded data.


Introduction
Obstructive sleep apnea syndrome (OSAS) is characterized by complete or incomplete obstruction of the upper airway during sleep. The main symptoms of OSAS are light sleep, excessive daytime sleepiness, and snoring; these are said to increase the risk of developing serious illnesses, such as ischemic heart disease, hypertension, stroke, and cognitive dysfunction [1]. Furthermore, it is said that 6-19% of females and 13-33% of males have OSAS, with the prevalence rate increasing with age [2,3]. A definitive diagnosis of OSAS is currently made using polysomnography (PSG) tests. However, this test requires multiple measurement sensors (e.g., oral thermistor, nasal pressure cannula, chest belt) to be worn directly on the body all night, which imposes a heavy burden on the patient. Previous studies suggested that the discomfort of wearing multiple sensors during PSG and restricted movements affect sleep efficiency, electrocardiographic (EEG) spectral power, and rapid-eye movements [4][5][6][7].
In [8][9][10][11][12][13], snoring characteristics were extracted using ZCR, MFCC, and other statistical processing from the respiratory sounds during sleep which were obtained from patients; it was then shown that SBE sections could be classified using deep learning with accuracies in the range of 75.1-96.8% in various environments, including noise.
In [14][15][16][17][18], the effectiveness of various characteristics in OSAS and non-OSAS patients in terms of temporal, frequency, intensity, and clinical features was evaluated to characterize OSAS-related upper airway obstruction.
In [19][20][21], snoring sounds obtained from noncontact microphones were segmented, features were extracted by statistical processing and formant acoustic analysis, and machine learning tools, such as logistic regression and AdaBoost were used to classify OSAS/non-OSAS and sleep/waking states at sensitivities in the range of 80-90%, while doing so at a low cost.
As shown in [22][23][24][25], it has recently been suggested that sleep-awake activity and sleep quality could be estimated based on the analysis of respiratory sounds obtained during sleep. These results emphasized the importance of detecting SBEs during sleep. The automatic detection of SBEs from sleep sounds is the first step for automatic OSAS screening based on snoring. However, SBEs have a high dynamic range and are barely audible at intensities >90 dB. Specifically, there is a need for a method to automatically detect low-intensity SBEs without any contact, even in a low-SNR environment.
Therefore, our research group has been developing a system that automatically detects low-intensity SBEs from sleep sounds obtained by noncontact recording [10,12]. It has been suggested that the automatic detection of low-intensity SBEs has a high performance compared with other methods proposed in recent studies. However, the calculation speed and performance must be improved further for practical use.
The purpose of our study was to develop a more efficient method to detect lowintensity SBEs in sleep sound recordings.
Even if low-intensity SBEs are present in sleep sounds, human hearing can distinguish them from sleep sounds by careful listening. This is because the human auditory pathway has an innate function which is used to analyze the fine temporal characteristics of sound. The auditory image model (AIM) [26][27][28], which simulates a human auditory mechanism from an engineering perspective, was developed by Patterson in 1995 [26].
To generate a stabilized auditory image (SAI), this AIM describes a process of strobed temporal integration which transforms the signal flow from the cochlea up the auditory nerve to the brain. For sound event classification [1,13,29], front-end, ear-like audio analysis has been conducted by generating features extracted from an SAI. However, the calculation of SAI requires large computational and memory costs. Conversely, sound event detection performed based on the peaks corresponding to glottal pulses was apparent in the neural activity pattern (NAP) which was converted into an SAI [30][31][32]. Furthermore, the NAP which produces spectral profiles from AIM were used for the communication of sound recognition and the analysis of cochlear implant representations [33,34]. From these reports, we hypothesize that NAP carries information on the presence or absence of sound events even before SAI modeling.
A novel aspect of this study is that we propose the new feature, NAP-based cepstral coefficients (NAPCC), for the automatic, accurate, and faster detection of low-intensity SBEs in sleep sound recordings.
Based on leave-one-out cross validation of sleep sound data stored in a database, the performance of the proposed method was investigated and compared with that of the low-intensity SBEs detection method developed in our previous study in 2018 [10,12].
To date, sleep-awake evaluation methods and OSAS screening methods have been developed using SBEs obtained based on the noncontact approach [19][20][21]. High-intensity SBEs can be detected by the energy-based approach; however, if low-intensity SBEs can be detected efficiently and automatically by this study, then the presence or absence of patient's breathing can be estimated from the recorded data, regardless of SBE intensity.
A noncontact approach based on sleep sound analysis was developed with the objective of a cost-effective alternative approach to OSAS diagnosis. Incorporating the proposed method in these approaches may enable more accurate OSAS screening and sleep stage evaluations.

Snoring/Breathing Episodes
This study was conducted after obtaining approval from the ethics review boards of the Division of Science and Technology, Graduate School of Technology, Industrial and Social Sciences, Tokushima University, and Anan Kyoei Hospital. Sleep sounds were recorded during a PSG test conducted at the Anan Kyoei Hospital. A microphone (Model NT3, RODE, Sydney, Australia) was placed approximately 50 cm away from the patient, and its distance could vary from 40 to 70 cm depending on the patient's movements. Sleep sounds were recorded using a preamplifier (Mobile pre USB, M-Audio, CA, USA), with a sampling rate of 44.1 kHz and digital resolution of 16 bits/sample. However, in this study, the recorded data were downsampled to 11.025 kHz at the time of analysis in consideration of the main SBE components [35,36].
The SBEs and non-SBEs used in this study were identified by three annotators who carefully listened to the recorded data. The SBE/non-SBE sections were finally determined from the average values of the start and end points of the SBEs/non-SBE sections identified by the three annotators after they carefully listened to the recordings. The degree of matching of the annotations of the three annotators was calculated using Cohen's kappa [37,38] to guarantee the reliability of annotations. The SNRs of the SBEs included in the recorded data were calculated from the annotation results using the following equation: Herein, P S and P N denote the SBE and noise power, respectively. In this study, the recorded data were selected to satisfy the following conditions to evaluate the performance of the proposed method in low-SNR conditions. Furthermore, the recorded data used in this study included a 120 s section extracted from the 6 h sleep sound data, wherein multiple low-intensity SBEs existed which were composed of SBE and non-SBE sections.

1.
The amplitude of SBE within the 120 s interval did not change considerably across all the recorded data 2.
SBEs with low SNR were repeated in the 120 s interval of recorded data (1) Exp-1: SBE detection of the recorded data from 25 individuals, wherein as SBEs and non-SBEs only included silence periods and (2) Exp-2: SBE detection of recorded data from 15 individuals, wherein SBEs and non-SBEs included talking, alarm sounds, footsteps, and fan noise which may have occurred during actual sleep. Table 1 shows the subject record databases used in Exp-1 and Exp-2. It can be observed from Table 1 that in Exp-1, the range of average SNR of the SBEs recorded from the OSAS (AHI > 10) and non-OSAS (AHI < 10) subjects ranged from −8.34 ± 1.40 to 0.88 ± 3.24. In Exp-2, the range of average SNRs of the SBEs recorded from the OSAS and non-OSAS subjects ranged from −13.84 ± 4.02 to 0.05 ± 3.22. The average number of segments with SBEs/silence used in Exp-1 was 54.7 ± 13.2 and 54.1 ± 12.4, respectively. The number of segments with SBEs/non-SBEs used in Exp-2 is described in detail in Table 2. It can be observed from Table 2 that non-SBEs that are expected to occur during actual sleep were used in the experiment.

Auditory Property-Based Features and Artificial Neural Network Classifiers
We describe herein a new method based on the use of auditory model-based features wherein artificial neural network (ANN) classifiers were used to detect quickly low-intensity SBEs in the sleep sound records. Humans can distinguish small sound events and silence from sleep sounds. Therefore, in this study, we used the AIM of Patterson et al. [26], which simulated the processing mechanism of the auditory system. AIM consists of precochlear processing (PCP), basilar membrane motion (BMM), and the NAP and stabilized auditory image (SAI), which is converted using strobe temporal integration modules. However, given that (i) the calculation of SAI requires large computational and memory costs and (ii) the information on the detection of sound event is included prior to the strobe temporal integration processing [39], we used the PCP, BMM, and NAP modules of the AIM. Figures 1 and 2 show the flow charts of the automatic SBE detection system proposed in this study. nerve information, the filter output obtained from the BMM stage was filtered by halfwave rectification and a low-pass filter. The output obtained from the NAP stage was framed with a frame size of 1024 samples and a shift size of 512 samples. Through this frame processing, a total number of 2582 frames for the 120 s recorded data were obtained.  nerve information, the filter output obtained from the BMM stage was filtered by halfwave rectification and a low-pass filter. The output obtained from the NAP stage was framed with a frame size of 1024 samples and a shift size of 512 samples. Through this frame processing, a total number of 2582 frames for the 120 s recorded data were obtained.  As shown in Figure 1, the 120 s recorded data segment first underwent preprocessing with a bandpass filter that simulated the characteristics of the outer and middle ears at the PCP stage of AIM (lower cutoff frequency: 1000 Hz, and upper cutoff frequency: 6000 Hz) [40]. After the recorded data passed through the PCP stage, the recorded data were divided in windows with widths equal to 1024 samples and a shift width equal to 1 sample. Subsequently, at the BMM stage, filtering was conducted with an auditory filter bank that simulated the cochlear frequency analysis mechanism. In this study, we used a gamma chirp filter bank composed of 50 channel filters between the asymptotic frequencies of 100 Hz and 5000 Hz as an auditory filter bank. Given that the NAP stage simulates the phase fixing characteristics when converting BMM physical information into acoustic nerve information, the filter output obtained from the BMM stage was filtered by half-wave rectification and a low-pass filter. The output obtained from the NAP stage was framed with a frame size of 1024 samples and a shift size of 512 samples. Through this frame processing, a total number of 2582 frames for the 120 s recorded data were obtained.
We applied power-law nonlinearity with an exponent of 1/15 on each NAP frame to derive cepstral features. Furthermore, DCT which has a property of energy compaction was also applied.
The output extracted from the DCT generally needed to be normalized before the classification process. However, there is no optimal way of normalization or formal correction, as described in [41]. Thus, the output extracted from the DCT was normalized using mean normalization [42] and sigmoid normalization [43], which are typically performed on the analysis data. Herein, we compared the effect of mean normalization with that of sigmoid normalization on the output from the DCT.
In this study, we used a new feature extraction algorithm called NAPCC that was based on auditory processing which corresponded to the above procedure. Figure 2 shows the structure of the new NAPCC approach that we introduce in this study.
We used an ANN based on NAPCC as a discriminator for the classification of SBE and non-SBE sections from the recorded data.
We used multilayer perceptron (MLP)-ANN and radial basis function (RBF)-ANN as a classifier for the classification of SBE and non-SBE sections from the recorded data. The MLP-ANN consists of three layers: input layer, hidden layer, and output layer [44]. Herein, the output function of the hidden layer unit was a hyperbolic tangent function, and the transfer function of the output layer unit was a linear function. The number of units in the hidden and output layers of the MLP-ANN were 10 and 1, respectively.
RBF-ANN is also composed of three layers: an input layer, a hidden radial basis function layer, and an output layer. The number of units in the hidden and output layers of the RBF-ANN were also 10 and 1, respectively. The weighted input of the RBF hidden layer is computed by the ratio of the Euclidean distance between the weight vector and the input vector to the spread parameter (σ) which allows the sensitivity of the radial basis neuron. In this work, the spread parameter (σ) was set to 1. This network is known to have strong tolerance to input noise and fast and comprehensive training and responds well to test patterns [45,46].
The NAPCC extracted from each of the frames of the recorded data were given as input to the input layer of both ANN. The target signals of 1 and 0 were provided to the SBE and non-SBE sections, respectively, and to prevent overfitting, both ANN were trained by the error back propagation method based on the Levenberg-Marquardt method with early stopping [43,44]. The output result after learning was subjected to 4th-order median filtering to eliminate the influences of sudden changes.

Evaluation of the Performance of the Proposed Detection Method
The effectiveness of the SBE/non-SBE classification of the proposed method was validated by dividing the recording data of N individuals into the recording data of one individual for test purposes and the recorded data of N-1 individuals for training purposes; the leave-one-out cross-validation (which repeats the validation process N times so that the recording data of each individual were selected once as test data) was used. The training data were composed of NAPCC patterns extracted frame by frame from the recorded data of N 1 individuals. The testing data were composed of NAPCC patterns extracted frame by frame from the recorded data of one individual.
Receiver operating characteristic (ROC) analysis was conducted from the output of the ANN after learning was achieved from the training data, and the test data were used to estimate the optimal threshold T h for use when the SBEs and non-SBEs were classified. The optimal threshold value at this time is the threshold T h that minimizes the Euclidean distance from the position of sensitivity of 1 and specificity of 1 on the ROC curve [47]. Based on the threshold value that was estimated in this way, the SBE/non-SBE classification accuracies that used the test data were calculated. Specifically, the sensitivity, specificity, PPV, NPV, accuracy, and F1 score were estimated. As this study used the leave-one-out cross-validation method, validation was conducted N times in total, and the means and standard deviations of the classification accuracies were calculated for a total number of N.

Results
Fifty dimensional features extracted from the DCT were normalized using mean or sigmoid normalizations. In consideration of the initial value dependence of MLP-ANN, leave-one-out cross-validation was used, and a mean F1 score was obtained. Subsequently, the initial value of the MLP-ANN was changed, and validation was repeated 10 times; a performance evaluation of the proposed method was then conducted based on the trial results that maximized the F1 score. Table 3 shows the results of the inter-annotator agreement of the three human judges using Cohen's kappa for our careful listening process of the sleep sound recordings. From this table, we can confirm that our annotators achieved a kappa coefficient > 0.9, which indicates an almost perfect agreement.

Normalization Used for NAPCC and Optimum Number of NAPCC
In this section, the performance of the proposed method was evaluated by changing the number of dimensions of NAPCC presented to MLP-ANN to investigate how the number of dimensions of NAPCC obtained based on the use of these normalization methods influenced the automatic extraction performance of SBEs. Figure 3 shows the relationship between the mean F1 score and the number of NAPCC dimensions when sigmoid or mean normalizations were used in Exp-1. Figure 4 shows the respective results for Exp-2.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 7 of 14 specificity, PPV, NPV, accuracy, and F1 score were estimated. As this study used the leave-one-out cross-validation method, validation was conducted N times in total, and the means and standard deviations of the classification accuracies were calculated for a total number of N.

Results
Fifty dimensional features extracted from the DCT were normalized using mean or sigmoid normalizations. In consideration of the initial value dependence of MLP-ANN, leave-one-out cross-validation was used, and a mean F1 score was obtained. Subsequently, the initial value of the MLP-ANN was changed, and validation was repeated 10 times; a performance evaluation of the proposed method was then conducted based on the trial results that maximized the F1 score. Table 3 shows the results of the inter-annotator agreement of the three human judges using Cohen's kappa for our careful listening process of the sleep sound recordings. From this table, we can confirm that our annotators achieved a kappa coefficient > 0.9, which indicates an almost perfect agreement.

Normalization Used for NAPCC and Optimum Number of NAPCC
In this section, the performance of the proposed method was evaluated by changing the number of dimensions of NAPCC presented to MLP-ANN to investigate how the number of dimensions of NAPCC obtained based on the use of these normalization methods influenced the automatic extraction performance of SBEs. Figure 3 shows the relationship between the mean F1 score and the number of NAPCC dimensions when sigmoid or mean normalizations were used in Exp-1. Figure 4 shows the respective results for Exp-2.  It can be observed from Figure 3 that the F1 score of sigmoid normalization was approximately 0.5% and 1% higher than that of mean normalization in Exp-1 and Exp-2, respectively. There were no large differences in the standard deviation between the sigmoid and the mean normalizations. Given that sigmoid normalization seems to perform

Evaluation of the Performance of the Proposed Method and Comparison of the Proposed Method with Our Previous Method
We developed an SBE/non-SBE classification technique based on MLP-ANN, w was used as a time-series classifier [12]. When this MLP-ANN was used as a subjectpendent classifier, we showed that our previous method can classify low-intensity and low-intensity non-SBEs from sleep sounds recorded in noisy environments wi average accuracy of 75.10%. The performance of our previous technique was comp with that of other recent techniques. Even though we focused on the detection of intensity SBEs in sleep sounds, the classification accuracy was as good as that att with recent techniques. In this study, we compared the performance of the prop It can be observed from Figure 3 that the F1 score of sigmoid normalization was approximately 0.5% and 1% higher than that of mean normalization in Exp-1 and Exp-2, respectively. There were no large differences in the standard deviation between the sigmoid and the mean normalizations. Given that sigmoid normalization seems to perform better based on the results obtained in these experiments, we employed the sigmoid normalization in the NAPCC extraction process in this study.
According to the results associated with the use of sigmoid normalization in Figure 3, it was confirmed that the maximum score in 13 dimensions was obtained when the mean F1 scores in each dimension obtained based on Exp-1 and Exp-2 were multiplied. The 13-dimensional NAPCC was determined as the optimal characteristic vector used in the proposed method in this study. In the following sections, this characteristic vector was used for the evaluation of the performance of the proposed method and for the comparison of the proposed method with the previous method. Figure 4 shows (as examples) (a) 20 s of recorded data, (b) NAP output obtained by analyzing the recorded data, (c) trained MLP-ANN output results, and (d) labeling results by three annotators (1 for SBE sections, 0 for non-SBE sections). It can be confirmed from these figures that the NAP output and the trained MLP-ANN output reflected even the information of SBEs that appeared to be buried in background noise.

Evaluation of the Performance of the Proposed Method and Comparison of the Proposed Method with Our Previous Method
We developed an SBE/non-SBE classification technique based on MLP-ANN, which was used as a time-series classifier [12]. When this MLP-ANN was used as a subjectindependent classifier, we showed that our previous method can classify low-intensity SBEs and low-intensity non-SBEs from sleep sounds recorded in noisy environments with an average accuracy of 75.10%. The performance of our previous technique was compared with that of other recent techniques. Even though we focused on the detection of low-intensity SBEs in sleep sounds, the classification accuracy was as good as that attained with recent techniques. In this study, we compared the performance of the proposed method with that of this method. For this purpose, Exp-1 and Exp-2 were performed using the same parameters adopted in our previous work. Tables 3 and 4 show the performance evaluation results of the proposed and the conventional methods for Exp-1 and Exp-2, respectively. As mentioned in Section 3.1, in consideration of the initial value dependence of MLP-ANN, leave-one-out cross-validation was conducted 10 times, and performance evaluations of the proposed method and previous method were conducted based on trial results that maximized the F1 score. According to Table 4, in the case of Exp-1, we found that the mean value was larger than that of our previous method, and the standard deviation of the proposed method was smaller than that of our previous method. As shown in Table 5, the results of Exp-2 yield the same trend as that of Exp-1. These results clearly indicate that the performance of the proposed method was superior to that of our previous method even in noisy environments.  Figure 5a,b shows the F1 scores of the subjects obtained via the proposed method or our previous method in Exp-1 and Exp-2, respectively. Please note that herein, the mean F1 score value was used after leave-one-out cross-validation was conducted 10 times. It can be observed from these figures that the F1 score of the proposed method in each subject exceeded the F1 score of the conventional method in most cases. In particular, the results of Exp-2 suggest that the use of the proposed method improved the detection performance of subjects for whom detection was difficult with the conventional method. Exp-1 demonstrates that in most cases, the proposed method worked better than the previous method. In particular, for the recorded data (No. 9 and 10) of subjects who were not detected, the F1 score of the proposed method was improved by about 20-30% compared with that of the previous method.
These findings suggest that the proposed method was more effective than the conventional method even when SBEs were detected from the recorded data which contained low-SNR non-SBEs and SBEs.
Regarding subject No. 10 in Exp-1 and subject No. 5 in Exp-2, a degradation in performance was found compared to the results obtained from the other subjects. This could be due to the shorter duration of low-intensity SBEs.
It was considered that in cases of practical use of this proposed method, the recorded data used in Exp-1 and in Exp-2 would be mixed together; therefore, in subsequent experiments, using two classifiers, namely, MLP-ANN and RBF-ANN, the effectiveness of the proposed method was validated for recording datasets constructed by mixing together the recorded data used in this study. Table 6 lists the performance outcomes of the proposed method for the mixture of the data used in Exp-1 and Exp-2 using two classifiers, i.e., MLP-ANN and RBF-ANN. These results suggest that, irrespective of the type of ANN, the proposed method can detect SBEs with higher accuracy than the conventional method, even for mixed recorded data. ject exceeded the F1 score of the conventional method in most cases. In particular, the results of Exp-2 suggest that the use of the proposed method improved the detection performance of subjects for whom detection was difficult with the conventional method. Exp-1 demonstrates that in most cases, the proposed method worked better than the previous method. In particular, for the recorded data (no. 9 and 10) of subjects who were not detected, the F1 score of the proposed method was improved by about 20-30% compared with that of the previous method.

Discussion and Conclusions
In this study, we proposed a new method that used the auditory property-based features of NAPCC and ANN discriminators to automatically detect low-intensity SBEs from recorded data that were acquired using a noncontact microphone. The effectiveness of the proposed method was investigated by detecting SBEs from recorded data of 25 individuals which comprised silence and SBEs (Exp-1) and from recorded data of 15 individuals which included non-SBEs thought to be noise, generated in an actual environment (Exp-2). Leaveone-out cross-validation was used to evaluate the performance in each experiment. The results suggested that SBEs could be detected with an average accuracy of 85.83% in Exp-1 and of 85.99% in Exp-2. A comparison of performance with the MLP-ANN-based SBE detection method proposed in our previous work [12] showed that the proposed method was approximately 3% better in the case of Exp-1 and approximately 10% better in the case of Exp-2. In particular, the standard deviation in both Exp-1 and Exp-2 became smaller with the proposed method when compared with the conventional method. Hence, it is thought that the influence of each individual subject could be reduced. Furthermore, the large improvement in specificity and PPV suggests that the new method is useful for the effective and automatic detection of silent or apneic sections contained in sleep sounds.
To date, gammatone frequency cepstral coefficients (GFCC) [48] and BMM-based cepstrum coefficients [49] have been developed, but both require gammatone filter bank outputs or mean normalization. To achieve a more accurate human auditory perception, a DCGA filter was developed to extend the domain of the gammatone auditory filter. This filter bank accommodates the nonlinear behavior observed in human psychophysics and can be useful for perceptual signal processing [50]. The DCGA filter bank, which is the front-end of NAPCC, may be more noise-robust than the gammatone filter bank. Furthermore, the results obtained in this work showed that the noise robustness of NAPCC was improved by using the sigmoid normalization instead of the mean normalization used in GFCC and BMM-based cepstrum coefficients. Therefore, NAPCC which uses DCGA and sigmoid normalization should improve the performance of GFCC and BMM-based cepstrum coefficients.
In particular, to detect low-intensity SBEs in the sleep sound recordings, the method developed in our previous study in 2018 outperformed the recent techniques published up to 2018 [12]. Since 2018, more recent technologies have been developed to classify snoring episodes and non-snoring episodes from sleep sounds.
Lim et al. proposed a recurrent neural network (RNN)-based classification method that was capable of classifying snoring episodes and non-snoring episodes with the use of features obtained from the sleep data recorded with a smartphone. The RNN-based classifiers achieved an accuracy of 98.9% using relatively small datasets [9]. However, snoring segments used in this work were created based on the peak point of snoring signals obtained via the peak-detection algorithm. This means that relatively high-intensity snoring was selected for the analysis.
Jiang et al. [51] proposed an automatic snore detection method using sound maps and a series of neural networks. The results demonstrated that the method is appropriate for identifying snores with an accuracy in the range of 91.8-95.1%. However, in this study, potential snoring episodes were segmented using the improved sub-band spectral entropy method which is based on sub-band energy calculation.
Shen et al. [8] proposed the use of the MFCC feature extraction method and the LSTM model for the binary classification of snoring data. The experimental results showed that the developed method yielded the highest accuracy rate of 87%. However, for the analysis, very weak snoring sounds were not labeled in the data presented on PSG used in this study.
Furthermore, a sleep sound classification method based on AIM has been proposed for sleep sounds extracted by an energy-based approach, and it has been confirmed that sleep sounds can be classified with high accuracy [13]. However, there is a need to use multiple acoustic features obtained from SAI which is converted from NAP using strobed temporal integration (STI).
This study has the following advantages. The proposed method can be conducted with low-computational costs because it eliminates the computationally expensive STI processing used in AIM and can be built using stages up to NAP. Additionally, it has been confirmed that the use of the proposed method allows for the detection of lowintensity SBEs with higher performance compared with our previous method [12], and the computational speed was also significantly improved. Given that the performance of the proposed method was superior to that of our previous method even in the case of Exp-2, it is suggested that the new feature (NAPCC) proposed in this study is an acoustic feature that is robust against noise.
However, our study has some limitations: (i) a relatively small size of the dataset, which cannot satisfy sound variety, was used; (ii) for SBEs of short duration, the performance of the proposed method was degraded because the output of the NAPCC corresponding to the SBEs became small in the NAPCC spectrogram, which is the ordered series of the NAPCC of each frame for the recorded data; (iii) the proposed method uses the DCGA filter bank approach which has the highest calculation cost in the NAPCC calculation procedure.
The proposed method is expected to contribute as a pretreatment step to OSAS screening based on snoring and respiratory sounds. It is thought to be useful for the effective and automatic identification of respiratory sound information, particularly apneic sections and silence, from sleep sounds acquired without contact.