You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

29 June 2024

Detection and Classification of Obstructive Sleep Apnea Using Audio Spectrogram Analysis

,
,
and
1
Laboratory of Digital Signal Processing, Department of Engineering, University of Messina, C.da di Dio, 1 (Vill. S. Agata), 98166 Messina, Italy
2
CNIT Research Unit, Department of Engineering, University of Messina, C.da di Dio, 1 (Vill. S. Agata), 98166 Messina, Italy
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advances in Image Processing and Computer Vision Based on Machine Learning

Abstract

Sleep disorders are steadily increasing in the population and can significantly affect daily life. Low-cost and noninvasive systems that can assist the diagnostic process will become increasingly widespread in the coming years. This work aims to investigate and compare the performance of machine learning-based classifiers for the identification of obstructive sleep apnea–hypopnea (OSAH) events, including apnea/non-apnea status classification, apnea–hypopnea index (AHI) prediction, and AHI severity classification. The dataset considered contains recordings from 192 patients. It is derived from a recently released dataset which contains, amongst others, audio signals recorded with an ambient microphone placed ∼1 m above the studied subjects and apnea/hypopnea accurate events annotations performed by specialized medical doctors. We employ mel spectrogram images extracted from the environmental audio signals as input of a machine-learning-based classifier for apnea/hypopnea events classification. The proposed approach involves a stacked model which utilizes a combination of a pretrained VGG-like audio classification (VGGish) network and a bidirectional long short-term memory (bi-LSTM) network. Performance analysis was conducted using a 5-fold cross-validation approach, leaving out patients used for training and validation of the models in the testing step. Comparative evaluations with recently presented methods from the literature demonstrate the advantages of the proposed approach. The proposed architecture can be considered a useful tool for supporting OSAHS diagnoses by means of low-cost devices such as smartphones.

1. Introduction

Sleep, which accounts for a third of human life, is of great importance for maintaining health [1]. Unfortunately, a hidden sleep disorder known as obstructive sleep apnea–hypopnea syndrome (OSAHS) negatively affects the quality of life for many individuals [2,3].
OSAHS is primarily caused by the constriction of the upper airways at different levels. While increased muscle tone during wakefulness prevents the collapse of the upper airways, during sleep, a combination of extraluminal pressure from surrounding soft tissues and negative intraluminal pressure during inspiration can lead to upper airway collapse. In addition, obese people may experience further narrowing of the upper airways, which leads to even more pronounced clinical consequences. In individuals affected by sleep apnea, sleep quality may deteriorate, leading to daytime sleepiness, memory impairment, increased risk of accidents due to excessive sleepiness, and overall decreased productivity. In more severe cases, adults with OSAHS can develop conditions such as high blood pressure, coronary heart disease, stroke, cardiac arrhythmias, and other related conditions. Additionally, in infants, OSAHS can also lead to behavioral disorders and, in extreme cases, even sudden death [4]. People affected by this condition, especially adults, have an increased risk of causing traffic accidents. Moreover, they often suffer from mood swings and depression, which contributes to the significant financial and social impact of OSAHS [5]. Nowadays, approximately 6–13% of the world population suffers from this disease [6], but 80% of apnea patients remain undiagnosed [7]. To represent the severity of OSAHS, a specific parameter called “apnea/hypopnea index” (AHI) is used. This index counts the number of apneas and hypopneas per hour of sleep. A single apnea event occurs when peak inspiratory flow falls below 10% of baseline for at least 10 s [8]. OSAHS is classified as mild if the AHI is in the [5–15] range, moderate if the AHI is in the [16–30] range and severe if the AHI is above 30. To determine the AHI, patients must undergo a clinical examination known as polysomnography (PSG), which is usually performed in a hospital. During PSG, various biosignals are recorded, including breathing patterns, heart rate, movements, snoring, oxygen saturation, pharyngeal movements, and others. To capture all these signals, patients must wear sensors that can simultaneously capture and convert them into electrical signals. A recording device then acquires these electrical signals via cables. The presence of these cables, sensors, and devices can cause discomfort for patients, particularly in a setting that differs from the comfort of their own bedroom [9]. As a result, OSAHS diagnoses often go unrecognized, resulting in a significant number of individuals affected by OSAHS remaining untreated [10]. Therefore, it is crucial to promote the advancement of equipment and technologies that can facilitate the diagnosis of OSAHS in a more comfortable way [11].
The aim of the present study is along these lines. Indeed, we investigate the ability to recognize OSAHS just by recording an audio signal, which is subsequently processed with machine learning (ML) approaches to determine the AHI. The proposed investigation starts with an analysis of the audio signal in the frequency domain to obtain mel spectrograms. A spectrogram is a visual representation of the spectrum of frequencies in a signal changing over time. It is a three-dimensional graph where the x-axis represents time, the y-axis represents frequency, and the color intensity or darkness represents the strength of the signal’s energy at each time and frequency. A mel spectrogram is a type of spectrogram that takes into account the principles of human auditory perception, known as psychoacoustic [12]. It is derived from a standard spectrogram but uses a mel scale to map the frequency content of the signal to a perceptually relevant scale. The mel scale is a nonlinear scale that mimics the sensitivity of the human ear to different frequencies. In a mel spectrogram, the frequency axis is divided into mel frequency bins and the intensity of each bin is represented by a color or gray scale, similar to a conventional spectrogram.
VGGish networks, designed for audio classification tasks [13], take these mel spectrograms as input representations and extract high-level features that contribute to the network’s ability to recognize and categorize audio signals.
Since the main objective of this work is the identification/classification of obstructive sleep apnea–hypopnea syndrome events based on features extracted from audio signals recorded during sleep, we investigated the capability of VGGish-based architectures widely used for audio classification. Specifically, we developed a bidirectional network with long short-term memory (bi-LSTM) trained to classify input sequences obtained from a pretrained VGGish network using a transfer learning (TL) approach.
This latter architecture was already presented in [14], where preliminary results were investigated using a reduced subgroup of patients. Specifically, in [14], we proposed a pretrained network based on a VGGish model to perform “apnea” detection based on audio signal processing. The network was trained to classify mel spectrograms extracted from audio signals with a duration of 1 s according to an “apnea”/“no-apnea” classification. To determine the output class for excerpts with a duration of 5 to 15 s, each individual prediction obtained with a step of 250 ms was combined with a majority decision rule. The analysis was performed on audio excerpts from a subset of 25 patients from the dataset described in [15]. We extracted 860 clips that were evenly distributed across both classes. We analyzed confusion matrix charts and compared the performance of our approach with the support vector machine-based classifiers proposed in [16], which served as a baseline.
The present study represents a significant advance by extending and building on these results:
  • Increasing the subgroup of patients from 25 to 192;
  • Performing the training and testing of the networks using a separate subset of data related to the patients (i.e., no data from patients used in the training appears in the test);
  • Presenting the bi-LSTM architecture and performance comparisons in terms of confusion matrix for the identification of OSAHS;
  • Providing results related to AHI prediction and AHI severity classification.
The remaining of the paper is organized as follows: Section 2 presents related work; in Section 3, the used dataset is described; the proposed methodology for the classification of apnea events is described in Section 4, the analysis of the simulation results is provided in Section 5; finally, the conclusions are drawn in Section 6.

3. OSAHS Dataset

As described in the publication by Korompili et al. (2021) [15], data were collected from a cohort of 212 subjects seeking a diagnosis of sleep apnea syndrome (SAS) at the Sleep Study Unit of the Sismanoglio—Amalia Fleming General Hospital in Athens. The audio signals were recorded with a portable two-channel multitrack recorder (Tascam DR-680 MK II, TEAC AMERICA, INC., Santa Fe Springs, CA, USA) and synchronized with the PSG data. The first channel was connected to a contact microphone (Clockaudio CTH100, Clockaudio Ltd., Waterlooville, Hampshire, UK) placed on the patient’s trachea. Simultaneously, the second channel was connected to an ultra-linear measurement condenser microphone (Behringer ECM8000, Behringer, Willich, Germany) positioned approximately 1 m above the patient’s bed, specifically over the head area. Both sound signals were sampled at a rate of 48 kHz and originally recorded at 24 bits per sample. To facilitate storage in European Data Format (EDF) while maintaining synchronization with other polysomnography signals, the bit depth per sample was later reduced to 16. The PSG data comprise 16 channels, incorporating electroencephalogram (EEG), electroculogram (EOG), leg movement signal, electrocardiogram (ECG), RR interval in the ECG, pulse rate derived from the ECG, changes in thoracic volume, changes in abdominal volume, the nasal/oral flow pressure, the body position, and the oxygen content of the blood (oxygen saturation). The PSG study is conducted by medical professionals in the Sleep Studies Department of the Sismanoglio–Amalia Fleming General Hospital in Athens. Each patient’s recordings are analyzed by two specialists according to “sleep stages” and “apnea events” scoring. First, a certified technician performs the initial evaluation stage, followed by a final evaluation performed by a highly experienced and certified doctor with 30 years of experience. In this final assessment, the actual positively annotated events are reviewed and any missed events are added to ensure a comprehensive assessment. The original dataset includes EDF files containing polysomnogram signals for 287 patients, along with RML files containing all annotations provided by the medical team of the Sismanoglio General Hospital of Athens. For our study, we only considered the annotations of the medical team after the automatic rejection of the false-positive apneas, whose annotations appear in the subfolder named “APNEA_RML_clean”. The folder contains annotations for 194 patients. For each patient, we downloaded the rml file with the annotations and the corresponding edf file from the folder “APNEA_EDF” with the PSG signals and audio recordings. From this latter, we extracted the channel number 20, containing the ambient microphone (placed 1 m above patient head) recordings. We had to remove the recordings of 2 patients (id “00001339” and “00001394”) due to errors in the management of the associated edf files. We then downsampled each recording to 16,000 samples per second using “sox” [31] and saved all files in Microsoft RIFF wav format. In our experiments, we only used the audio signals recorded from the ambient microphone. Microphones are characterized by several features: polar pattern, impedance, sensitivity, frequency response, and dynamic range.
The microphone used to record the ambient audio signals on channel 20 of the dataset is a Behringer ECM8000. It has an omnidirectional characteristic, i.e., it picks up sound evenly from all directions and is therefore ideal for recording ambient sounds. It has an impedance of 200 ohms, which ensures compatibility with most audio equipment and helps reduce signal loss. The sensitivity is −39.2 dBV/Pa, which means that even quiet sounds can be picked up clearly and distinctly. The frequency response of 20 Hz to 20,000 Hz enables precise reproduction of the entire spectrum of human hearing and ensures high-fidelity sound recording.
In our vision of a possible application of the proposed OSAHS automatic recognition system, based on the processing of audio signals, we took into account the very high prevalence of smartphones in people’s everyday lives. For their original function as a phone, these devices are all equipped with microphones. Usually, people do not switch off their device at night if they have it lying on their bedside table. Accordingly, for future applications, smartphones can be considered good candidates for capturing audio signals during the night, which can then be processed for automatic OSAHS diagnosis.
Nowadays, smartphones are equipped with high-quality microphones that have evolved considerably in recent years. Modern smartphone microphones use technology based on micro-electromechanical systems (MEMSs), as they are very small, robust, and powerful. They can have both omnidirectional and directional polar patterns. Many smartphones use multiple omnidirectional and directional microphones to improve sound recording quality and enable noise reduction features. To ensure compatibility with the phone’s audio processing circuits, reduce signal loss, capture a broad spectrum, and ensure high-quality audio recording suitable for both speech and ambient sounds, smartphone microphones typically have low impedance (∼200 ohms), high sensitivity (e.g., −42 dBV/Pa), and wide frequency response (20 Hz to 20,000 Hz). In addition, multiple microphones are typically used to filter out background noise for clearer voice pickup in noisy environments. The characteristics of the MEMS microphones in today’s smartphones seem to be similar to or better than those of the Behringer ECM8000. Accordingly, we can assume that the results obtained with this dataset can be easily transferred to real-world applications to achieve analogue performance.
The annotations of type “Respiratory” were used as a reference and, in particular, grouped into two classes. The first, named “Apnea”, includes the events “Central Apnea”, “Hypopnea”, “Mixed Apnea”, “Obstructive Apnea” and “Periodic Breathing”. The second, called “No-Apnea”, includes everything that does not belong to the “Apnea” class. To extract data from the two classes in a balanced way, we processed the recordings for each patient and the corresponding annotation file. To obtain apnea/non-apnea excerpts, we analyzed each audio segment whose samples fell into the corresponding category and extracted sequences with a duration of 5 s with a step of 10 s. We divided each segment into evenly spaced windows of 5 s. the files of the “Apnea” segments were labeled with the “patient number”, the label “IN”, a label referring to the apnea category, such as “CA” = “Central Apnea”, “HA” = “Hypopnea”, “MA” = “Mixed Apnea”, “OA” = “Obstructive Apnea”, “PR” = “Periodic Respiration”, and a sequential number for each patient and subcategory. The “No-Apnea” extracts were named with the “Patient ID”, the designation “OUT”, and a consecutive number for each patient. Table 1 shows the number of excerpts obtained in each category.
Table 1. Distribution of excerpts in the various classes.
In total, the dataset under consideration contains 352,319 nonoverlapping audio excerpts, each lasting 5 s. According to the input format required by VGGish, we extracted mel spectrograms from each audio excerpt, each representing 1 s of audio and with a step size of 0.25 s. Each spectrogram was evaluated as a matrix of 96 × 64 values: 96 is the number of 25 ms partially overlapping frames in each mel spectrogram and 64 represents the number of mel bands, ranging from 125 Hz to 7.5 kHz.

4. Proposed Methodology: Deep Neural Networks for Audio Processing

In recent years, the dramatic progress in computing power, mainly related to the introduction of massive multicore GPUs, has driven the use of deep learning techniques to develop classification and regression networks that are applied in many different fields [32]. ML approaches have been widely applied in both the fields of signal processing and telecommunication networks [33,34,35,36,37,38]. Sound classification and recognition is one of the research areas investigated, including music classification applications [39,40,41,42], speech recognition [43], and others.
In this work, among the various solutions available in the literature, we chose to use a CNN based on the VGG family, which was introduced in 2014 by the Visual Geometry Group (VGG) at the University of Oxford. CNNs are a type of artificial neural network architecture specifically designed for analyzing grid-like data, such as images or time series. They use convolutional filters to automatically extract features from the input data as they pass through multiple layers, allowing CNNs to learn increasingly complex representations of the data [44]. VGG models exemplify this approach by utilizing stacks of convolutional layers with small receptive fields (typically 3 × 3) and the rectified linear unit (ReLU) activation function to achieve state-of-the-art performance in various image recognition tasks. VGG networks have been successfully adapted for various applications, including audio classification. Building on the strengths of VGG networks, a pretrained model called VGGish was specifically designed for audio classification tasks. This model was originally adopted for image processing and subsequently adapted by Google in 2017 for use with audio signals The training was performed on a large YouTube dataset [45]. The pretrained VGGish network contains 24 layers with nine layers containing learnable weights: six convolutional and three fully connected layers. The input provided to the VGGish consists of a series of mel spectrograms obtained by decomposing the audio signals into a series of overlapped time frames. The original VGGish neural network was trained to classify a wide range (128) of sound events, including, but not limited to, various environmental sounds such as music, speech, animal calls, and mechanical noises. To build our classification model and focus on task-specific features for sleep apnea classification, we employed a piecewise TL approach. TL is an ML technique where knowledge acquired on one task is leveraged to enhance performance on a related task. In our study, we employed a bidirectional long short-term memory (bi-LSTM) network in conjunction with a pretrained VGGish model. Bi-LSTMs are particularly effective for processing sequential data, which are essential for capturing the temporal dynamics associated with sleep apnea events. The bidirectional nature of bi-LSTMs allows them to analyze sequences in both forward and backward directions, enhancing their ability to understand context and dependencies within the data. This approach leverages the strengths of bi-LSTMs in temporal modeling to accurately classify and identify patterns indicative of obstructive sleep apnea. Accordingly, we built a neural network architecture consisting of two submodels:
  • The first submodel consists of the VGGish convolutional stages up to the “pool4” layer; this submodel is not trained further and it works with the same hyperparameters of the original VGGish;
  • On the top of the previous submodel, a bi-LSTM network runs: it uses as input the output of the VGGish at the “pool4” layer and it is specifically trained to classify “Apnea” and “No-Apnea” events.
Figure 1 shows examples of sequences of mel spectrograms for “Apnea” categories and in “No-Apnea” state. It highlights difference between the “Apnea” and “No-Apnea” state. All the “Apnea” categories of the patient show continuous energy distribution on the middle- and low-frequency bands. It is evident that there is a noncontinuous distribution of energy in a specific band for sound captured in “No-Apnea” state. Moreover, the sequences of spectrograms extracted in “Apnea” state show an almost stationary behavior over the time, while spectrograms extracted in “No-Apnea” state are most irregular. This last behavior can be captured by a bi-LSTM network, and it justifies its use in our architecture.
Figure 1. Typical examples of sequences of mel spectrograms extracted for different types of apnea category and in “No-Apnea” state: (a) “Central Apnea”, (b) “Hypopnea”, (c) “Mixed Apnea”, (d) “Obstructive Apnea”, (e) “No-Apnea”. The x-axis and y-axis are the time and frequency of the sound, respectively.
The block scheme of the proposed architecture is depicted in Figure 2; the sizes written at the output of each layer are related to the processing of a single mel spectrogram, showcasing the flow of information from the input through the VGGish and bi-LSTM layers to the final classification output.
Figure 2. Block scheme of the proposed method.
We constructed our model using MATLAB Version 23.2.0.2515942 (R2023b) Update 7 [46] and we ran training/validation and test on a Dell Server model PowerEdge R740xd, equipped with 2 × Intel(R) Xeon(R) Gold 6238R CPU @ 2.20 GHz, 6 × 64 GB RAM modules, 4 × GPU TESLA M10 with 8 GB memory.
To obtain input sequences for training the bi-LSTM network, we process each wav file of the dataset, which lasts 5 s, to extract the corresponding sequence of mel spectrograms (96 × 64 × 17). These sequences are the input to the VGGish network and are processed to obtain the activations at the output of layer “pool4” (12,288 × 17). These last 17 vectors with 12,288 elements are the input for the bi-LSTM network. By processing all 352,319 audio excerpts of the dataset, we obtained so many sequences that form the dataset used for defining the architecture, setting the thresholds, training, and testing the bi-LSTM network. We divided this dataset into 6 distinct folds, with each fold containing the sequences extracted from the audio excerpts of 32 patients. Consequently, sequences from each patient are exclusively present in one of the 6 folds. Patients were randomly assigned to each fold, and Table 2 illustrates this assignment.
Table 2. Association of patient IDs with folds.
For defining the architecture of the bi-LSTM and establishing thresholds to differentiate between the “Apnea” and “No-Apnea” classes, we used fold #6. Bayesian optimization was employed to select optimal hyperparameters for the bi-LSTM network, aiming to maximize the complement to 1 of the maximum “recall” metric evaluated on the validation subset. This metric was calculated as an average for each class. The parameters optimized included the number of hidden units in the bi-LSTM network and the dropout probability of the subsequent layer.
Specifically, we considered the set {1, 3, 5, 6, 8, 10, 11, 13, 15, 17} as possible values for the number of hidden units and the {0.2, 0.3, 0.4, 0.5} as possible values for the dropout probability. We chose discrete quantities to speed up the whole optimization process. The first quantity was determined as a percentage between 10% and 100% of the number of vectors in each training sequence. We performed the training with the sequences of fold #6 with a splitting percentage training/validation of 75%/25%, which is equally distributed between the two classes “Apnea” and “No-Apnea”. Other relevant and fixed parameters for training were the mini-batch size, which was set to 64, and the maximum number of epochs, which was set to 30.
Figure 3 shows the result of the optimization process. In particular, Figure 3a shows the estimated objective function values compared to the hyperparameters, and Figure 3b shows the minimum objective function values compared to the number of function evaluations. The minimum estimated objective function value was obtained with a number of hidden units of 15 and a dropout probability of 0.5. These values are used in the following experiments.
Figure 3. Bi-LSTM architecture optimization: (a) Estimated objective function values versus hyperparameters; (b) Minimum objective function values versus number of function evaluations.

5. Experimentation and Results

5.1. Classification Results

Using the developed neural architecture, we trained five different classification models ( M 1 , , M 5 ), each one using the subgroup of patients in the corresponding fold.
Table 3 shows the values of the parameters used for training (the parameters for bi-LSTM layer, dropout layer, fully connected layer, and SoftMax layer not listed in the table are set as default values as in the corresponding Matlab creation functions). Figure 4 shows the curves of the results of the “Loss” function evaluated for the training set (cyan) and the validation set (blue circles), and the curves of the results of the “Recall” function evaluated for the training set (green) and the validation set (green circles). Each subfigure shows the result for a specific subset and the corresponding model. The trend of the curves seems to be similar when the subfolds are varied. Regarding the results in the training set, the values seem to be strongly influenced by the data in the respective batch. However, as expected, the general trend seems to increase towards 1 for “Recall” and decrease towards 0 for “Loss”. The results of the validation set seem to show the best performance after a few iterations (lower values for “Loss” and higher values for “Recall”). Moreover, the trend of “Recall” becomes quite constant after reaching the maximum value, while the trend of “Loss” appears to slightly increase after reaching the minimum value for the effect of overtraining. The policy used to select the weights of the network was “best validation loss”.
Table 3. Bi-LSTM training parameters.
Figure 4. Training/validation loss and training/validation recall for (a) M 1 , (b) M 2 , (c) M 3 , (d) M 4 , and (e) M 5 model.
We tested each model trained with a specific subfold against all the sequences stored in the remaining subfolds. Accordingly, we tested model M i , trained with sequences in subfold i, against sequences in the other subfolds j i { 1 , , 5 } . In this way, the model will be never tested with sequences obtained from patients used for training it.
Figure 5 shows the confusion matrices obtained as output of the tests by each model. The confusion matrix indicates how many sequences of a given class (“IN” sequences related to “Apnea” event; “OUT” sequences related to “No-Apnea” state) are correctly or incorrectly recognized. The columns on the left show the class-wise recall, i.e., the percentage of correctly/incorrectly classified class objects in relation to the number of all class objects. The lower rows show the class-wise precision, i.e., the percentage of correctly/incorrectly classified class objects in relation to the number of objects classified in the same way. The lowest value for class-wise recall, equal to 63.9%, is achieved by the model M 4 for the class “IN”. For the class “OUT”, on the other hand, it has the highest class-wise recall at 90.8%. Model M 1 shows the highest class-wise precision for the class “OUT” at 88.6%, while model M 3 shows a value of 68.4% as class-wise precision for the class “IN”, which is the worst result. Comparing the true and predicted classes, we can identify the true positive (TP) as the number of “IN” sequences correctly identified as “IN” (elements at position ( 1 , 1 ) in the matrices); true negative (TN) as the number of “OUT” sequences correctly identified as “OUT” (elements at position ( 2 , 2 ) in the matrices); false negative (FN) as the number of “IN” sequences incorrectly identified as “OUT” (elements at position ( 1 , 2 ) in the matrices); false positive (FP) as the number of “OUT” sequences incorrectly identified as “IN” (elements at position ( 2 , 1 ) in the matrices). Accordingly, the overall performance indices can be evaluated as follows:
Figure 5. Confusion matrix, class-wise “precisions”, and class-wise “recalls” for (a) M 1 , (b) M 2 , (c) M 3 , (d) M 4 , and (e) M 5 model.
  • Precision: The ability to properly identify positive samples P = T P T P + F P ;
  • Recall: The fraction of positive samples correctly classified R = T P T P + F N ;
  • Specificity: The fraction of negative samples correctly classified S = T N F P + T N ;
  • Accuracy: The fraction of patterns correctly classified A = T P + T N T P + F P + F N + T N ;
  • F1-score: The harmonic mean between precision and recall F 1 = 2 × P × R P + R .
Table 4 shows the performance parameters for the different trained models. The best values for each parameter are highlighted in bold. Model M 4 outperforms all other models in terms of precision, specificity, and accuracy, while model M 1 shows the best results in terms of recall and F1-score. When analyzing the individual parameters, it can be seen that the precision is always above 68 % , the minimum value for the recall is around 64 % , the specificity performs better with a minimum value of 86 % , the accuracy is in the range of 81 to 83%, and the F1-score is in the range of 66 to 72%.
Table 4. Performance parameters on extracted sequences.

5.1.1. Discussion

It is important to emphasize that these results concern the classification between sounds recorded during an apnea phase and sounds recorded under non-apnea conditions. A direct comparison with recent works in the literature using the same dataset (dataset at the First Affiliated Hospital of Guangzhou Medical University and the dataset at Amalia Fleming General Hospital) [15], such as [28,29,30], cannot be performed because the contributions of these papers mainly focus on the classification of snoring sounds of healthy snorers and OSAHS patients.
Specifically, the authors of [28] achieved 90 % accuracy, 95.65 % precision, 91.67 % sensitivity, and 93.62 % F1-score. As reported in the study, nocturnal sounds recorded from a subset of 24 OSAHS patients and 6 simple snorers were first filtered using the Sox noise reduction algorithm and then segmented into potential snoring episodes using the adaptive thresholding method proposed by Wang et al. [24]. The classifiers based on GMMs fed with a set of selected 100 features were trained to discriminate snoring sounds from simple snorers and snoring sounds from OSAHS patients. Feature selection and classification experiments were performed using leave-one-subject-out cross-validation (LOSOCV). Accordingly, in each experiment, the audio data of 29 subjects were used for training and the remaining subjects were used for testing. The authors presented the results by taking the average over the 30 different tests performed, with one subject removed. No AHI estimation and/or AHI severity prediction was analyzed in this paper.
In [29], the authors developed three classification models, each using different features to classify simple snoring and abnormal snoring: a CNN and a pretrained ResNet18 based on the mel spectrum and an XGBoost classifier based on acoustic features. In this work, the number of subjects in the dataset was expanded to 40, 30 subjects were used to train and validate the proposed system, and the remaining 10 subjects were used to further estimate the AHI. From the training set, 9728 abnormal snoring sounds and 39,039 simple snoring sounds were obtained. During training, 10,000 simple snoring sounds were randomly selected to obtain a relatively balanced data set. This dataset was then randomly divided into a training set, a validation set, and a test set in a ratio of 3:1:1. Accordingly, the results for the test set were obtained by including sound excerpts from subjects matching the training and validation set. The best results were obtained with the CNN model: 81.83 % accuracy, 78.21 % precision, 78.13 % sensitivity, 85.88 % recall, and 81.87 % F1-score. An improvement of these results was achieved with a fusion model and a soft voting approach, showing an accuracy of 83.68 % , a precision of 79.30 % , a sensitivity of 78.73 % , a recall of 89.09 % , and an F1-score of 83.91 % .
In [30], the dataset was created considering 50 subjects of the original one (10 simple snorers and 40 OSAHS patients). The snoring sounds were labeled by ear, nose, and throat experts based on PSG. The snoring sounds of simple snorers (SSS) and the snoring sounds of OSAHS patients (SSP) were labeled. Based on the apnea and hypopnea events detected by PSG, the normal snoring of OSAHS patients (NSP) and apnea–hypopnea snoring of OSAHS patients (ASP) were also labeled. Three different experiments were carried out:
  • SSS versus SSP: For this experiment, the snoring sounds of all subjects were randomly divided into training, validation, and test sets in the ratio 8:1:1;
  • NSP versus ASP: To conduct this experiment, snoring sounds from OSAHS patients were randomly divided into training, validation and test sets with a ratio of 8:1:1;
  • NSP versus ASP-LOSOCV: For this experiment, snoring sounds of one subject were selected as the test set, and the remaining 39 subjects were used as the training set. This process was repeated 40 times, rating the subject under test, to calculate the average metrics.
The authors compared the performance of different combinations of models (ResNet50 + LSTM, CNN + LSTM, VGG19 + LSTM and Xception + LSTM) and obtained the best result with the VGG19 + LSTM model. In particular, the latter model enabled an accuracy of 99.31 % , a sensitivity of 99.13 % , a precision of 99.58 % , and an F1-score of 99.34 % for experiment n.1. For the discrimination of NSP and ASP, VGG19 + LSTM achieved 85.21 % accuracy, 84.45 % sensitivity, 84.65 % precision, and 84.55 % F1-score at experiment n.2. Finally, to discriminate NSP and ASP in experiment n.3, the proposed VGG19 + LSTM model achieved 66.29 % accuracy, 67.27 % sensitivity, 42.94 % precision, and 51.54 % F1-score.
These last results show how difficult it is to classify NSP and ASP, especially when the audio excerpts of the patient under test have never been used in training (as in our experiments). The models we propose can distinguish sounds recorded during “Apnea–Hypopnea” events from all other types of sounds. Performance was evaluated by never using patients whose recordings were not included in the training phase. Comparing the results in Table 4 with the experiment n.3 of [30], the proposed models outperform the ones available in the literature when similar working hypotheses are considered.

5.2. AHI Estimation and Classification

Encouraged by the results obtained, we tried to verify the ability of the system to predict the AHI value using patients’ recordings over a whole night. An “Apnea”/“Hypopnea” event only needs to be reported if the apnea lasts for a certain time. To detect these types of events by analyzing the patient recordings, we consider a sliding window of 5 s duration moving at a step of 1 s. At each step, we process the sound samples within the window to obtain a sequence of 17 mel spectrograms, which is used as input to the VGGish network. The output of the VGGish network at the “pool4” layer will be a sequence of 17 arrays. This last sequence will be the input for the trained bi-LSTM networks. Accordingly, the outputs of these networks are generated every 1 s and consist of the predicted values for the classes “IN” and “OUT”. These two prediction values are in the range of 0 to 1 and an output value next to 1 means a strong prediction of class membership. Conversely, values approaching 0 indicate a poor prediction of class membership. We evaluated the difference d = I N O U T (in the range 1 , , 1 ) between the two outputs of the architecture depicted in Figure 6.
Figure 6. Flow diagram of the proposed architecture: from the process recordings to the class prediction.
The evaluation of d for each subsequent sliding window allows us to obtain a curve d ( t ) where values approaching 1 denote “No-Apnea” conditions, while values approaching 1 denote “Apnea/Hypopnea” conditions. To smooth the trend of these curves, we evaluate the median value in a sliding window of 17 elements corresponding to the size of the audio processing window. The smoothed value is denoted as d ^ ( t ) . Figure 7 shows an example of the trend of d ^ ( t ) for a time interval of half an hour.
Figure 7. Typical time evolution of d ^ ( t ) trend. The thresholds T h 1 and T h 2 are also reported.
In our study, we adopt a methodological approach that utilizes two thresholds, T h 1 and T h 2 , to identify and classify “Apnea/Hypopnea” events from sleep recordings. Specifically, an interval of time is classified as an “Apnea/Hypopnea” event when the smoothed value d ^ ( t ) remains above T h 1 for a minimum duration specified by T h 2 . This criterion ensures robust event detection by requiring sustained deviations indicative of respiratory disturbance. In addition, OSAHS patients are medically categorized into four levels: normal ( A H I < 5 ), mild ( 5 A H I < 15 ), moderate ( 15 A H I < 30 ), and severe ( 30 A H I ). Accordingly, we can consider the classification performance of our models with respect to these classes.
Unfortunately, our dataset exhibits a significant imbalance across these severity classes, as depicted in Table 5, which presents challenges in model training and evaluation. To address this issue, we employ Bayesian optimization to define optimal values for T h 1 and T h 2 . This optimization strategy aims to enhance the sensitivity and specificity of our event detection method, thereby improving the accuracy of OSAHS classification.
Table 5. Distribution of recordings in the various folds according to the AHI classes.
We used the N F = 32 records from patients in fold #6, which were never used to train and test the models. We called A H I T ( r ) the true AHI for recording r and A H I P ( r ) the predicted AHI for the same recording and considered three different objective functions:
  • The absolute value of the AHI error averaged over all recordings
    O A = r = 1 N F A H I e ( r ) N F ;
  • The relative absolute value of the AHI error, averaged over all recordings
    O R = r = 1 N F A H I e ( r ) A H I T ( r ) N F ;
  • The class-weighted absolute value of the AHI error, averaged over all recordings
    O C A = r = 1 N F A H I e ( r ) N x ( r ) ;
where A H I e ( r ) = | A H I T ( r ) A H I P ( r ) | is the absolute error in estimating the AHI for the processed recording r and N x ( r ) is equal to the number of recordings in the fold that have the same class of recording r.
The objective function in Equation (2) takes into account that equal absolute errors are not of equal importance due to the different amplitudes of the AHI intervals used to define the classes. The objective function in Equation (3) aims to compensate for the unbalanced distribution of recordings in each class of the dataset. We searched for the optimal values of the two thresholds T h 1 and T h 2 for each of the five models. Figure 8 shows, as an example, the results of the objective function and the corresponding optimal pairs of thresholds ( T h 1 , T h 2 ) evaluated with the model M 3 and the three objective functions considered.
Figure 8. Outcome of the considered objective functions and related optimal values of threshold pairs ( T h 1 , T h 2 ) for model M 3 : (a) O A objective function, (b) O R objective function, (c) O C A objective function.
Table 6 shows the thresholds obtained by applying Bayesian optimization to each model according to the three proposed objective functions.
Table 6. Thresholds to evaluate AHI from d ^ ( t ) evaluated with bayesopt Matlab function for the three different objective functions and five models.
To evaluate the performance of each model, we processed the corresponding output d ^ M i , r ( t ) and applied the corresponding pair of thresholds ( T h 1 , M i , T h 2 , M i ) for each recording r R i ¯ to obtain the predicted A H I p ( r ) . R i ¯ is the complementary set of R i containing the 32 recordings used to train the model M i (Table 2). Accordingly, the number of recordings used to test each model is equal to 128.
Figure 9 shows curves related to the percentage of recordings that have an A H I e e for each model considering the different objective functions.
Figure 9. Curves indicating the percentage of recordings with A H I e e for each model and taking into account (a) O A objective function, (b) O R objective function, and (c) O C A objective function.
The results obtained with the different models are quite overlapping, except M 4 for the objective function O A and M 5 for the objective function O C A . In general, about 50 % of the tested recordings show an A H I e 10 .
The scatter plots (Figure 10) illustrate a good correlation between the true AHI and the predicted AHI value. The Pearson correlation coefficient (PCC) assumes a larger value for thresholds optimized with the objective function O A . The associated p-value is always very small.
Figure 10. Scatter diagram between true and predicted AHI. The legends report the value of Pearson correlation coefficient (PCC) and related p-value obtained by each model and thresholds optimized by (a) O A objective function, (b) O R objective function, and (c) O C A objective function.
However, it is necessary to evaluate the result of the classification because the same absolute error may or may not produce an error depending on the distance from the boundaries. Figure 11 shows the confusion matrices obtained by processing the recordings for each model and for each threshold optimization approach. If we analyze the mapping column by column, we can show the behavior of the models using the same optimization approach. For example, the high number of bad classifications when using the model M 4 with the optimization approach O A (Figure 11j) and using the model M 5 with the optimization approach O C A reflects the results in A H I e obtained for the model M 4 in Figure 9a and for the model M 5 in Figure 9c.
Figure 11. Confusion matrices obtained to classify OSAHS severity for each model and optimization approach: (a) M 1 , O A ; (b) M 1 , O R ; (c) M 1 , O C A ; (d) M 2 , O A ; (e) M 2 , O R , (f) M 2 , O C A ; (g) M 3 , O A ; (h) M 3 , O R ; (i) M 3 , O C A ; (j) M 4 , O A ; (k) M 4 , O R ; (l) M 4 , O C A ; (m) M 5 , O A ; (n) M 5 , O R ; (o) M 5 , O C A .
Analyzing the rows of Figure 11, it is possible to infer how the different objective functions influence the classification performance. The O A approach allows us to obtain a higher number of correct classifications for “Severe” conditions, but the other classes are generally overestimated. This effect is partially reduced with the O R and O C A approaches, but the number of underestimations for the “Severe” class increases.
With the aim of presenting the results in a compact way, we analyzed the percentage of correctly estimated, overestimated, and underestimated recordings. Accordingly, starting from a general confusion matrix C , we evaluated
E R = i = 1 N c C i , i i = 1 , j = 1 N c C i , j E O = i = 1 , j = i + 1 N c C i , j i = 1 , j = 1 N c C i , j E U = j = 1 , i = j + 1 N c C i , j i = 1 , j = 1 N c C i , j
where N c is the number of classes. Table 7 shows the results obtained for the three different objective functions and five models.
Table 7. Percentage of right estimation E R , underestimation E U , and overestimation E O for the three different objective functions and five models.
The best results in terms of correct estimation are obtained with the O A approach and, in particular, M 2 provides the higher value of 84.38 % . From a medical point of view, it seems important to keep the underestimation rate low as it corresponds to sick patients classified as healthy. Excluding the M 4 model, the O A approach shows better results ( 3.91 % for the M 2 and M 3 models), while O C A shows the worst performance with the lowest value of 14.06 % for the M 3 model.
The previous results can be further improved by using a weighting matrix to give different weights to the type of misclassification. Accordingly, it is possible to consider the modified performance parameters:
E R = i = 1 N c A i , i C i , i i = 1 , j = 1 N c A i , j C i , j E O = i = 1 , j = i + 1 N c A i , j C i , j i = 1 , j = 1 N c A i , j C i , j E U = j = 1 , i = j + 1 N c A i , j C i , j i = 1 , j = 1 N c A i , j C i , j
where A is specifically built to weight the type of misclassification. We consider an A-matrix that weights the misclassification depending on the “distance” to the correct class. Accordingly, the weighting for the same class is equal to 1, the weighting for two neighboring classes is equal to 2 (“Normal”–“Mild”, “Mild”–“Moderate”, “Moderate”–“Severe”), and so on:
A = 1 2 3 4 2 1 2 3 3 2 1 2 4 3 2 1
Table 8 shows the results obtained for the three different objective functions and five models, taking into account the weighted errors. The O A approach continues to give the best performance, especially for the M 3 model. Other A matrices can also be used, for example, to emphasize the overestimation with respect to the underestimation.
Table 8. Percentage of weighted correct estimation E R , underestimation E U , and overestimation E O for the three different objective functions and five models.

Discussion

We base our discussion on the same paper that we analyzed in Section 5.1.1. In [28], no AHI estimation and/or AHI severity prediction was analyzed. In [30], the AHI was estimated based on the detected apnea–hypopnea snoring sounds and yielded a PCC of 0.966 (p-value < 0.001 ). This result was obtained by selecting the snoring sounds of 39 subjects in the training set and the snoring sound of 1 subject as the test set. The process was repeated 40 times to calculate the average metrics. The authors showed the results only as a scatter plot; the prediction of AHI severity cannot be analyzed further. In [29], the AHI determined in the experiment and measured by PSG had a PCC of 0.913 (p-value < 0.001 ). This result was obtained by analyzing the snoring sounds of 10 subjects who did not participate in the training phase. The results in terms of AHI severity classification appear to be good, with a detection rate of 90 % (only one subject’s AHI severity was misclassified from “Moderate” to “Mild”). However, it should be noted that the recordings tested do not include “Mild” and “Normal” AHI severity grades. In addition, the number of recordings tested is very small, so that the statistical significance of the result appears questionable.

5.3. Models Aggregation

As a final experiment, we propose the classification of AHI severity by aggregation of models. To classify a particular recording, it is possible to use the results of each individual model and create a unique classification by aggregating the results. The recordings r R i are processed with the models M j , j = 1 , , 5 : j i ; accordingly, each recording is processed with four different models. To perform the aggregate classification, we analyzed the class predicted by each model for the recordings r, P j ( r ) , and counted the results for each class:
C P c l a s s ( r ) = j = 1 4 P j ( r ) = = c l a s s
where c l a s s { Normal , Mild , Moderate , Severe } . Specifically, we chose two different approaches:
  • P 1 :
    P ^ ( r ) = c l a s s if C P c l a s s ( r ) > C P c l a s s ( r ) c l a s s c l a s s u n d e f i n e d otherwise
  • P 2 :
    P ^ ( r ) = c l a s s if c l a s s : C P c l a s s ( r ) > 2 u n d e f i n e d otherwise
where P ^ ( r ) is the aggregated predicted class. We introduced the u n d e f i n e d result to not assign a class when the predictions of the models are not coherent. More precisely, the policy P 1 assigns the c l a s s to the aggregated prediction if the number of models predicting the c l a s s exceeds all others (for example, ( C P n o r m a l ( r ) = 1 , C P m i l d ( r ) = 0 , C P m o d e r a t e ( r ) = 1 , C P s e v e r e ( r ) = 2 ) P ^ ( r ) = s e v e r e , ( C P n o r m a l ( r ) = 2 , C P m i l d ( r ) = 0 , C P m o d e r a t e ( r ) = 2 , C P s e v e r e ( r ) = 0 ) P ^ ( r ) = u n d e f i n e d ); policy P 2 assigns c l a s s to the aggregated prediction if the number of models predicting c l a s s is at least half plus one of all models (for example, ( C P n o r m a l ( r ) = 1 , C P m i l d ( r ) = 0 , C P m o d e r a t e ( r ) = 1 , C P s e v e r e ( r ) = 2 ) P ^ ( r ) = u n d e f i n e d , ( C P n o r m a l ( r ) = 2 , C P m i l d ( r ) = 0 , C P m o d e r a t e ( r ) = 2 , C P s e v e r e ( r ) = 0 ) P ^ ( r ) = u n d e f i n e d , ( C P n o r m a l ( r ) = 0 , C P m i l d ( r ) = 3 , C P m o d e r a t e ( r ) = 1 , C P s e v e r e ( r ) = 0 ) P ^ ( r ) = m i l d ).
Figure 12 and Figure 13 show the confusion matrices obtained with the two different strategies, respectively. We analyzed the results using all the proposed approaches to set the threshold pair ( T h 1 , T h 2 ) to evaluate the AHI. For both strategies, the number of undefined recordings is the lowest when the thresholds are optimized using the O A approach. As expected, the strategy P 2 is more aggressive and classifies more recordings as “undefined” than P 1 . Differently, the number of underestimated recordings obtained with the strategy P 2 is very low and underestimation only occurs in the “Severe” class.
Figure 12. Confusion matrices obtained to classify OSAHS severity using aggregate model results and policy P 1 : thresholds optimized using (a) O A , (b) O R , and (c) O C A .
Figure 13. Confusion matrices obtained to classify OSAHS severity using aggregate model results and policy P 2 : thresholds optimized using (a) O A , (b) O R , and (c) O C A .
Table 9 and Table 10 show the performance for aggregate models in terms of undefined severity results ( E u d ), correct severity estimation ( E R ), underestimation of severity ( E U ), and overestimation of severity ( E O ) for the aggregate model and policy P 1 and P 2 , respectively. Although the best performance is obtained with the P 1 policy and the O A approach to set thresholds, the results are most conservative when the P 2 policy and the O C A approach to set thresholds are used. The percentage of undefined records is high, but the model never returns too-low values for moderate and mild classes. The percentage of “Severe” recordings that are underestimated is 4.55 % (Figure 13c). Obviously, the “undefined” recordings can be further processed by doctors to assign the correct AHI class.
Table 9. Percentage of undefined recordings E u d , right estimation E R , underestimation E U , and overestimation E O for the three different objective function using policy P 1 and aggregate model.
Table 10. Percentage of undefined recordings E u d , right estimation E R , underestimation E U , and overestimation E O for the three different objective function using policy P 2 and aggregate model.

6. Conclusions

This study aimed to propose new strategies for the investigation of OSAHS by exploring the use of low-cost and noninvasive audio signals for OSAHS diagnosis. It investigated the identification of OSAHS events, including classification of apnea/non-apnea status, prediction of apnea–hypopnea index (AHI), and classification of AHI severity. We used a dataset consisting of recordings from a recently curated cohort of subjects undergoing diagnosis for sleep apnea syndrome. The proposed approach involves the use of a convolutional neural network based on a VGGish structure in combination with a bidirectional LSTM for sequence classification. The audio signals are processed into mel spectrograms, which are then fed into the CNN to extract high-level features that serve as input to the bi-LSTM. The results show promising performance of the proposed bi-LSTM in combination with the VGGish architecture, outperforming previous approaches in terms of precision, specificity, accuracy, recall, and F1-score. The classification models demonstrated the ability to discriminate between apnea and non-apnea events even when using data from patients who did not participate in the training phase. In addition, the optimal thresholds for apnea/hypopnea event detection and AHI severity classification were determined using Bayesian optimization. The results showed a strong correlation between the predicted and actual AHI values.
Aggregation strategies were also proposed for the developed model using different optimization methods and merging strategies. The results were good in terms of reducing overestimation and underestimation of severity. The presence of undefined cases can be addressed with the help of medical doctors whose effort in labeling the entire dataset is significantly reduced. To summarize, the research presents a promising approach to OSAHS diagnosis using audio signals acquired through low-cost environmental devices, and deep neural networks. The proposed models show robust performance in event classification and AHI estimation and offer a potentially more convenient and scalable alternative to traditional PSG methods. This work opens exciting avenues for future research, including expanding the training data by collecting information from more subjects and also addressing the problem of AHI severity imbalance using data augmentation techniques. In addition, incorporating explainable AI techniques (XAI) can help clinicians trust the results and give patients confidence in the diagnosis. Finally, exploring multimodal data fusion and validating the model in the clinical setting promises a more robust and practical OSAHS diagnostic tool.

Author Contributions

Conceptualization, S.S.; methodology, S.S. and L.P.; software, S.S.; validation, S.S.; formal analysis, S.S.; investigation, S.S.; resources, S.S. and M.S.; data curation, S.S.; writing—original draft preparation, S.S.; writing—review and editing, S.S., L.P., O.S. and M.S.; visualization, S.S.; supervision, S.S.; project administration, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in PSG-Audio at https://www.scidb.cn/en/detail?dataSetId=778740145531650048 (accessed on 29 May 2024), reference number CSTR: https://cstr.cn/31253.11.sciencedb.00345 (accessed on 29 May 2024), DOI: https://doi.org/10.11922/sciencedb.00345 (accessed on 29 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pavlova, M.K.; Latreille, V. Sleep disorders. Am. J. Med. 2019, 132, 292–299. [Google Scholar] [CrossRef] [PubMed]
  2. Armstrong, M.; Wallace, C.; Marais, J. The effect of surgery upon the quality of life in snoring patients and their partners: A between-subjects case-controlled trial. Clin. Otolaryngol. Allied Sci. 1999, 24, 510–522. [Google Scholar] [CrossRef] [PubMed]
  3. Gall, R.; Isaac, L.; Kryger, M. Quality of life in mild obstructive sleep apnea. Sleep 1993, 16, S59–S61. [Google Scholar] [CrossRef] [PubMed]
  4. Zhu, K.; Li, M.; Akbarian, S.; Hafezi, M.; Yadollahi, A.; Taati, B. Vision-based heart and respiratory rate monitoring during sleep—A validation study for the population at risk of sleep apnea. IEEE J. Transl. Eng. Health Med. 2019, 7, 1900708. [Google Scholar] [CrossRef] [PubMed]
  5. Imtiaz, S.A. A systematic review of sensing technologies for wearable sleep staging. Sensors 2021, 21, 1562. [Google Scholar] [CrossRef]
  6. Sabil, A.; Glos, M.; Günther, A.; Schöbel, C.; Veauthier, C.; Fietze, I.; Penzel, T. Comparison of apnea detection using oronasal thermal airflow sensor, nasal pressure transducer, respiratory inductance plethysmography and tracheal sound sensor. J. Clin. Sleep Med. 2019, 15, 285–292. [Google Scholar] [CrossRef] [PubMed]
  7. Fietze, I.; Laharnar, N.; Obst, A.; Ewert, R.; Felix, S.B.; Garcia, C.; Gläser, S.; Glos, M.; Schmidt, C.O.; Stubbe, B.; et al. Prevalence and association analysis of obstructive sleep apnea with gender and age differences—Results of SHIP-Trend. J. Sleep Res. 2019, 28, e12770. [Google Scholar] [CrossRef]
  8. Berry, R.B.; Brooks, R.; Gamaldo, C.E.; Harding, S.M.; Marcus, C.; Vaughn, B.V. The AASM manual for the scoring of sleep and associated events. Rules Terminol. Tech. Specif. Darien Illinois Am. Acad. Sleep Med. 2012, 176, 2012. [Google Scholar]
  9. Bhutada, A.M.; Broughton, W.A.; Garand, K.L. Obstructive sleep apnea syndrome (OSAS) and swallowing function—A systematic review. Sleep Breath. 2020, 24, 791–799. [Google Scholar] [CrossRef]
  10. Almazaydeh, L.; Elleithy, K.; Faezipour, M. Obstructive sleep apnea detection using SVM-based classification of ECG signal features. In Proceedings of the 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, San Diego, CA, USA, 28 August–1 September 2012; pp. 4938–4941. [Google Scholar]
  11. Mendonca, F.; Mostafa, S.S.; Ravelo-Garcia, A.G.; Morgado-Dias, F.; Penzel, T. A review of obstructive sleep apnea detection approaches. IEEE J. Biomed. Health Inform. 2018, 23, 825–837. [Google Scholar] [CrossRef]
  12. Zhou, Q.; Shan, J.; Ding, W.; Wang, C.; Yuan, S.; Sun, F.; Li, H.; Fang, B. Cough Recognition Based on Mel-Spectrogram and Convolutional Neural Network. Front. Robot. AI 2021, 8, 580080. [Google Scholar] [CrossRef] [PubMed]
  13. Castro-Ospina, A.E.; Solarte-Sanchez, M.A.; Vega-Escobar, L.S.; Isaza, C.; MartÃnez-Vargas, J.D. Graph-Based Audio Classification Using Pre-Trained Models and Graph Neural Networks. Sensors 2024, 24, 2106. [Google Scholar] [CrossRef] [PubMed]
  14. Serrano, S.; Patanè, L.; Scarpa, M. Obstructive Sleep Apnea identification based on VGGish networks. In Proceedings of the Proceedings—European Council for Modelling and Simulation, ECMS, Florence, Italy, 20–23 June 2023; pp. 556–561. [Google Scholar]
  15. Korompili, G.; Amfilochiou, A.; Kokkalas, L.; Mitilineos, S.A.; Tatlas, N.A.; Kouvaras, M.; Kastanakis, E.; Maniou, C.; Potirakis, S.M. PSG-Audio, a scored polysomnography dataset with simultaneous audio recordings for sleep apnea studies. Sci. Data 2021, 8, 1–13. [Google Scholar] [CrossRef] [PubMed]
  16. Yang, C.; Cheung, G.; Stankovic, V.; Chan, K.; Ono, N. Sleep apnea detection via depth video and audio feature learning. IEEE Trans. Multimed. 2016, 19, 822–835. [Google Scholar] [CrossRef]
  17. Amiriparian, S.; Gerczuk, M.; Ottl, S.; Cummins, N.; Freitag, M.; Pugachevskiy, S.; Baird, A.; Schuller, B. Snore sound classification using image-based deep spectrum features. In Proceedings of the INTERSPEECH 2017, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
  18. Dong, Q.; Jiraraksopakun, Y.; Bhatranand, A. Convolutional Neural Network-Based Obstructive Sleep Apnea Identification. In Proceedings of the 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS), Chengdu, China, 23–26 April 2021; pp. 424–428. [Google Scholar]
  19. Wang, L.; Guo, S.; Huang, W.; Qiao, Y. Places205-vggnet models for scene recognition. arXiv 2015, arXiv:1508.01667. [Google Scholar]
  20. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  21. Targ, S.; Almeida, D.; Lyman, K. Resnet in resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar]
  22. Maritsa, A.A.; Ohnishi, A.; Terada, T.; Tsukamoto, M. Audio-based Wearable Multi-Context Recognition System for Apnea Detection. In Proceedings of the 2021 6th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Kyushu, Japan, 25–27 November 2021; Volume 6, pp. 266–273. [Google Scholar]
  23. Maritsa, A.A.; Ohnishi, A.; Terada, T.; Tsukamoto, M. Apnea and Sleeping-state Recognition by Combination Use of Openair/Contact Microphones. In Proceedings of the INTERACTION 2022; Information Processing Society of Japan (IPSJ): Tokyo, Japan, 2022; pp. 87–96. [Google Scholar]
  24. Wang, C.; Peng, J.; Song, L.; Zhang, X. Automatic snoring sounds detection from sleep sounds via multi-features analysis. Australas. Phys. Eng. Sci. Med. 2017, 40, 127–135. [Google Scholar] [CrossRef] [PubMed]
  25. Shen, F.; Cheng, S.; Li, Z.; Yue, K.; Li, W.; Dai, L. Detection of snore from OSAHS patients based on deep learning. J. Healthc. Eng. 2020, 2020, 8864863. [Google Scholar] [CrossRef] [PubMed]
  26. Wu, D.; Tao, Z.; Wu, Y.; Shen, C.; Xiao, Z.; Zhang, X.; Wu, D.; Zhao, H. Speech endpoint detection in noisy environment using Spectrogram Boundary Factor. In Proceedings of the 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Datong, China, 15–17 October 2016; pp. 964–968. [Google Scholar] [CrossRef]
  27. Cheng, S.; Wang, C.; Yue, K.; Li, R.; Shen, F.; Shuai, W.; Li, W.; Dai, L. Automated sleep apnea detection in snoring signal using long short-term memory neural networks. Biomed. Signal Process. Control 2022, 71, 103238. [Google Scholar] [CrossRef]
  28. Sun, X.; Ding, L.; Song, Y.; Peng, J.; Song, L.; Zhang, X. Automatic identifying OSAHS patients and simple snorers based on Gaussian mixture models. Physiol. Meas. 2023, 44, 045003. [Google Scholar] [CrossRef]
  29. Song, Y.; Sun, X.; Ding, L.; Peng, J.; Song, L.; Zhang, X. AHI estimation of OSAHS patients based on snoring classification and fusion model. Am. J. Otolaryngol. 2023, 44, 103964. [Google Scholar] [CrossRef] [PubMed]
  30. Ding, L.; Peng, J.; Song, L.; Zhang, X. Automatically detecting apnea-hypopnea snoring signal based on VGG19 + LSTM. Biomed. Signal Process. Control 2023, 80, 104351. [Google Scholar] [CrossRef]
  31. SoX-Sound eXchange. Available online: https://sourceforge.net/projects/sox/ (accessed on 29 May 2024).
  32. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
  33. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
  34. Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.C.; Kim, D.I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 3133–3174. [Google Scholar] [CrossRef]
  35. Bkassiny, M.; Li, Y.; Jayaweera, S.K. A survey on machine-learning techniques in cognitive radios. IEEE Commun. Surv. Tutor. 2012, 15, 1136–1159. [Google Scholar] [CrossRef]
  36. Serrano, S.; Scarpa, M.; Maali, A.; Soulmani, A.; Boumaaz, N. Random sampling for effective spectrum sensing in cognitive radio time slotted environment. Phys. Commun. 2021, 49, 101482. [Google Scholar] [CrossRef]
  37. Bithas, P.S.; Michailidis, E.T.; Nomikos, N.; Vouyioukas, D.; Kanatas, A.G. A survey on machine-learning techniques for UAV-based communications. Sensors 2019, 19, 5170. [Google Scholar] [CrossRef]
  38. Grasso, C.; Raftopoulos, R.; Schembra, G.; Serrano, S. H-HOME: A learning framework of federated FANETs to provide edge computing to future delay-constrained IoT systems. Comput. Netw. 2022, 219, 109449. [Google Scholar] [CrossRef]
  39. Serrano, S.; Sahbudin, M.A.B.; Chaouch, C.; Scarpa, M. A new fingerprint definition for effective song recognition. Pattern Recognit. Lett. 2022, 160, 135–141. [Google Scholar] [CrossRef]
  40. Sahbudin, M.A.B.; Chaouch, C.; Scarpa, M.; Serrano, S. IOT based song recognition for FM radio station broadcasting. In Proceedings of the 2019 7th International Conference on Information and Communication Technology (ICoICT), Kuala Lumpur, Malaysia, 24–26 July 2019; pp. 1–6. [Google Scholar]
  41. Sahbudin, M.A.B.; Scarpa, M.; Serrano, S. MongoDB clustering using K-means for real-time song recognition. In Proceedings of the 2019 International Conference on Computing, Networking and Communications (ICNC), Honolulu, HI, USA, 18–21 February 2019; pp. 350–354. [Google Scholar]
  42. Serrano, S.; Scarpa, M. Fast and Accurate Song Recognition: An Approach Based on Multi-Index Hashing. In Proceedings of the 2022 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Split, Croatia, 22–24 September 2022; pp. 1–6. [Google Scholar] [CrossRef]
  43. Alharbi, S.; Alrazgan, M.; Alrashed, A.; Alnomasi, T.; Almojel, R.; Alharbi, R.; Alharbi, S.; Alturki, S.; Alshehri, F.; Almojil, M. Automatic speech recognition: Systematic literature review. IEEE Access 2021, 9, 131858–131876. [Google Scholar] [CrossRef]
  44. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  45. Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN Architectures for Large-Scale Audio Classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
  46. The MathWorks Inc. MATLAB Version: 23.2.0.2515942 (R2023b) Update 7. Available online: https://www.mathworks.com (accessed on 29 May 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.