1. Introduction
The Sleep Apnea Syndrome (SAS) is a breathing disorder that was recognized as a pathologic chronic condition in the late 70s [
1]. It is now proven that this syndrome, characterized by frequent cessation or partial reduction of the breathing process [
2,
3], is linked to severe daytime fatigue [
4,
5], headaches, depression [
5,
6], and an increased risk of work and vehicle accidents [
7,
8,
9]. In the long term, this syndrome is a strong risk factor for severe pathological conditions such as hypertension [
10,
11] and cardiovascular diseases [
12].
The clinical gold standard for diagnosing apnea syndrome is polysomnography, a full-night, supervised sleep study that is performed in a laboratory and includes continuous monitoring of the respiratory, cardiac, and neurological activity of the body during sleep [
13]. The most common index for quantifying the level of severity of the syndrome is the Apnea-Hypopnea Index (AHI), defined as the average count of apnea or hypopnea events per hour of sleep. An apneic event is defined as a severe reduction (more than 90%) in the airflow during respiration for at least 10 s while a hypopneic event is a partial reduction of the airflow (at least 30%) of the same minimum duration, provided related oxygen desaturation or arousal is present [
2]. Despite its broad use, recent studies suggest that the AHI is considered rather insufficient in providing a clear view of the syndrome’s severity and the health state of the patient [
14,
15]. Due to the observed inter- and intra-night variation of the AHI [
16], sleep scientists suggest employing additional indices reflecting the severity of the syndrome such as: the Oxygen Desaturation Index (ODI) [
17]—defined as the number of events of blood oxygen drop greater than 4% per hour of sleep—the Respiratory Distress Index (RDI) [
14]—defined as the combined number of apnea and hypopnea events along with the respiratory event-related arousals per sleeping hour—and the Mean Apnea Duration (MAD) index [
18]—defined as the average duration of all apneic/hypopneic events. It is also highly recommended to focus on each specific apnea/hypopnea episode rather than their average count [
14]. The exact time localization of apnea/hypopnea events during the night offers the capacity to study their specific characteristics, e.g., their duration. Indeed, recent studies suggest a correlation link between the MAD with hypertension and mortality rate, for patients suffering from severe or moderate apnea syndrome [
18,
19,
20]. The MAD index, despite being extracted as the average over the entire set of apnea/hypopnea events, requires a precise detection of the onset and termination of each event.
The gold standard for apnea diagnosis—polysomnography (PSG)—is inconvenient for the patient while the cost for in-hospital examination of the syndrome renders remote diagnosis through accurate portable apnea detection systems imperative. Audio recording and particularly tracheal sound have been employed as a signal to detect SAS severity [
21,
22,
23,
24,
25]. The recent advances in neural networks render them particularly attractive also in this field of application, providing promising results [
26,
27]. Specifically, Nakano et al. developed a deep neural network capable of detecting sleep apnea events from tracheal sound with a detection resolution of 15 s by employing more than 1800 patients reporting accuracy of more than 90% in classifying the AHI in the four main severity groups [
26]. Research by Kim et al. based on machine learning techniques used 120 patients to classify the four severity groups with an accuracy of approx. 88%. In general, neural network approaches are highly dependent on the number of patients participating in the training and validation sets, while the reported accuracy is expected to be reduced when applied in the real-time processing of audio recordings. Additionally, the time detection of apnea/hypopnea events is rarely performed [
26,
28,
29] with the developed diagnostic tools focusing mainly on the SAS severity estimation [
21,
27]. Particularly, Saha et al. [
28] were the first to indicate the importance of detecting individual events and discussing the time detection accuracy to discriminate between false and true positive detections. They reported a mean error of approx. 5 s in individual events detection, however, they admitted the need for further investigation in the field due to restrictions related to small sample size (69 patients) and the use of additional sensing elements [
28]. Alshaer et al. [
29] also studied pattern recognition techniques to detect individual events rather than specific sound features for patient classification. However their study—employing only 50 patients—does not report on the time detection error of the episodes but only on the AHI reporting accuracy of ~88% [
29]. Therefore, the field remains open in more approaches that could enrich our understanding of the sound features that are able to identify apneic or hypopneic events during sleep.
In this context and based on the new requirements for deeper study of the characteristics of diagnosed SAS, as described above, we developed a simple algorithm for apnea/hypopnea events detection in patients diagnosed with severe or moderate SAS. The proposed algorithm is based on breathing sound recordings solely, by making use of the annotated events according to the interpretation of polysomnographic data. More specifically, we employ breathing sound recordings from tracheal and ambient microphones during the entire night and we investigate the hypothesis of using standard voice activity detection (VAD) algorithms for identifying breathing sound during sleep. Breathing sound detection, performed by VAD algorithms can provide information on the time detection of apnea and hypopnea events [
30]. The proposed algorithm profits from (a) a validation process relying on a multitudinous dataset of patients diagnosed with “severe” or “moderate” apnea syndrome, (b) increased sensitivity in the detection of apneic/hypopneic events, and (c) adequate accuracy in the detection of the MAD of a patient with severe or moderate SAS. Thus, although it cannot be used as a standalone process for SAS diagnosis, we envision the use of it for (a) further examination of the patients with SAS at home to improve the treatment they receive, as well as, (b) for deeper understanding of the breathing sound and its potential employment in apnea related applications, and (c) as part of automated systems for apnea scoring in a hospital, which take into consideration the sound characteristics of audio recordings from tracheal or ambient microphones.
2. Methods
2.1. Polysomnograms and Audio Data Collection and Storage
The presented study was based on the collection of polysomnographic data from 239 patients undergoing polysomnography (PSG) in the Sleep Study Unit of Sismanoglio General Hospital of Athens and diagnosed with “severe” or “moderate” SAS. In the supervised sleep study, the standard protocol for split PSG was followed [
2]. According to this protocol, if a patient exhibits severe apnea within the first approx. 4 h of sleep, then the second part of the (typically 8-h long) PSG study is used for titration to the optimal pressure level that should be used by a continuous positive airway pressure (CPAP) device. Eventually, the second part of the PSG study was excluded from the herein proposed algorithmic process. All the PSG recorded signals, comprising an electrocardiogram, an electrooculogram, an electromyogram, a flow rate signal recorded through pressure and thermal sensors, the respiratory effort signals monitoring the thoracoabdominal movement, and a low-quality tracheal contact microphone were stored in EDF files through the PSG monitoring software “Sleepware G3”.
Along with the PSG system, a separate system of dual-channel audio recording was installed (
Figure 1). A portable multitrack recorder (Tascam DR-680 MK II) was used to acquire and store audio signals from (a) a contact electret microphone (Clockaudio CTH100, 900 Ω input impedance) placed on the trachea of the patient and (b) an ambient, omnidirectional, condenser microphone (Behringer ECM8000, electret, 900 Ω input impedance) located approx. 1 m over the position of the patient’s head. The contact tracheal microphone exhibits a spectral response in the range of 350 Hz—8 kHz while the ambient microphone has a flat frequency response in the range of 15 Hz—20 kHz. The breathing sound signals by each one of these microphones were sampled at 48 kHz and were stored in dual-channel uncompressed WAV files. The maximum time length of each WAV file was approx. 2 h and the entire night recording was stored in consecutive files of the same duration.
The data collection process was in full compliance with the European regulations for personal data protection [
31]. The local ethics committee of Sismanoglio General Hospital of Athens approved the described data acquisition process, provided signed consent was given by the participating subjects. The patients participating in this study were asked to provide signed consent for the use of the recordings—both PSG and audio—during their sleep study, provided the removal of any data that might lead to their identification. All relevant information was removed by the clinical staff prior to any processing of the data for the purposes of this research study.
2.2. Data Annotation and Medical Diagnosis
The medical characterization of the PSG recordings was performed by the clinical team of the Sleep Study Unit of Sismanoglio General Hospital of Athens. The respiratory events and sleep stages labeling followed the standard protocol for Sleep Apnea Diagnosis [
2]. The respiratory episodes related to this research study included apnea and hypopnea events: Apneas are defined as events of at least 10 s duration (3 respiratory cycles) with a severe reduction in the airflow (more than 90%), while hypopneas are defined as events of the same duration that are characterized by a moderate (at least 30%) reduction in the airflow, provided the event is associated with a relative oxygen desaturation drop of at least 3% or with arousal [
2]. Further categorization of apneic events includes their characterization as “central”—in which no respiratory effort is present—, “obstructive”—in which respiratory effort is present during the entire event—, or “mixed” apnea, involving characteristics of both previous categories. According to the latest update of the Sleep Apnea Syndrome annotation protocol, hypopneas should not be accompanied by further subcategorization and are considered obstructive hypopneas.
The identification and classification of respiratory events lead to the extraction of the total count of apnea/hypopnea events per sleeping hour. This value defined as the AHI is the most frequently used for the extraction of a final diagnosis of sleep apnea syndrome, by characterizing the patient in one of the following classes: non-apnea (AHI < 5 episodes/h), mild apnea (5 episodes/h ≤ AHI < 15 episodes/hour), moderate apnea (15 episodes/h ≤ AHI < 30 episodes/hour), severe apnea (AHI ≥ 30 episodes/h). In this study, we kept and examined only the patients that suffered from severe or moderate SAS.
The acquired data and the annotations of sleep stages and apnea events are publicly available [
32]. The online dataset is part of an ongoing research project, thus, further expansion is expected in the following months/years.
2.3. Audio Recordings Synchronization with PSG Channels
The PSG and the audio recording systems were activated manually by the specialist supervising each PSG study, within a time difference of a few seconds. Precise calculation of the time delay in order to transfer the information of time occurrence of apnea events in the audio recordings was imperative. The synchronization was performed by the use of the low-quality contact microphone incorporated in the PSG system. This second contact microphone, recording tracheal sound through the entire night, had a sampling frequency of 500 Hz, which was too low to be able to provide high-quality breathing sound recording. The synchronization algorithm was applied to signal excerpts of a total length of 20 min and it included the steps described here: (a) Filtering and downsampling of the high-quality contact microphone so that the two sensors had a common sampling rate of 500 Hz. (b) Envelope extraction of the two signals based on Hilbert transformation. (c) Extraction of the cross-correlation of the two signal envelopes, and (d) Determination of the time point of maximum cross-correlation, which gave the time delay between the acquired signals. The accuracy of the time delay determination was important to assure optimum validation of any system for apnea/hypopnea events detection. Thus, manual observation of all signals was performed to verify that the maximum error in the time delay did not exceed 2 s.
2.4. Audio Signal Processing for the Detection of Breathing and the Extraction of Apnea/ Hypopnea Events
The rationale of our proposed methodology was based on the following assumptions: (a) a VAD is capable of detecting breathing sounds as well (an assumption that has already been made before, e.g., [
30], but is also supported by the findings reported herein) (b) the detection of breathing sounds (or their absence, for the subject matter) can be used to infer episodes of breathing that has completely stopped or has severely altered, and (c) completely stopped or severely altered breathing corresponds to apneic or hypopneic episodes, respectively.
The proposed algorithm was designed for application in real-time processing of audio recordings but its use was herein emulated with stored WAV files with a maximum duration of approx. 2 h each. A window length of 0.1 s and an equal slide step (i.e., no overlap) were used to parse each stored WAV file. For each window, the probability of breathing was calculated using a statistical model-based VAD that was efficient in low signal-to-noise environments [
33]. Particular emphasis was paid to estimating the noise power spectral density [
34] by extracting the power spectral density in all frequency bands, as described in [
34]. Then, the noise variance was used to estimate the posterior signal-to-noise ratio (SNR)—defined as the power of the observed noise signal to the noise power—and the prior SNR—defined as the power of the clean signal to the noise power—according to the minimum mean-square error (MMSE) short-time spectral amplitude estimator [
35]. A block diagram of our proposed algorithm is depicted in
Figure 2 Hidden Markov Model.
Next, we calculated the breathing probability within each frame using a hidden Markov model (HMM) that relies on the a priori breathing-to-silence and silence-to-breathing probabilities as defined [
33]. In our case, these probabilities were calculated within the time duration of a typical respiratory cycle that was assumed to be equal to 4 s. As such, the aforementioned a priori probabilities were calculated within 40 non-overlapping sliding windows of a duration of 0.1 s each. Among these frames, 2 were expected to represent a transition from silence to breathing and 2 a transition from breathing to silence. Thus, the two factors were both selected as equal to 5%.
Since the sound characteristics of breathing and background noise are expected to be altered throughout the entire night recording, it was expected that the average breathing probability (averaged over 40 consecutive 0.1 s windows) would also be volatile. Thereupon, we used a probability normalization step in order to detect the transitions from inhalation/exhalation to silence and vice versa. The threshold that determines the transition from the breathing to the non-breathing state and vice versa was selected to be equal to the mean breathing probability of the latest 150 s of the observed audio signal (
Figure 2a). The final detector derived by the breathing probability for the entire audio signal (
Figure 3a) was smoothed by rejecting all detected regions that lasted less than 0.7 s (
Figure 2a) since shorter durations are rather unlikely to represent a complete inhalation or exhalation process (
Figure 3b). The exact threshold value of 0.7 s was determined by taking into account the theoretical duration of a breathing cycle (3 to 5 s), combined with the expected imbalance between inhalation and exhalation duration. Thus, in normal cases, the minimum inhalation frame was expected to last approximately 1 s. To increase the precision in determining this threshold, we focused on extensive observation of the acquired audio signals and trial and error selections that led to the final value of 0.7 s.
The output of the breathing detector consisted of zeros and ones to denote breathing or non-breathing respectively and was used to mask the audio signal in order to provide an output with the original audio signal in fragments of detected breathing and zero elsewhere (
Figure 2a). For each detected breathing segment, the average root mean square (RMS) value was extracted and used as a representative metric of this segment (
Figure 2a). Therefore, we ended up with a weighted breathing detector, with each detected frame represented by the mean RMS value of the audio signal within the frame (
Figure 3c). We assumed that an apnea or hypopnea event presented a radically reduced breathing activity within the event. Thus, we selected all regions of the weighted detector in which the mean RMS value of the detected breathing frames within this region was reduced by at least 50% as compared to the RMS values of the previous and following detected breathing frames (
Figure 2a). The selected episodes should last at least 10 s but not more than 150 s. The upper threshold in apnea/hypopnea event duration was estimated based on statistics of previously reported datasets [
36] and the dataset used in this study, both resulted in a maximum scored apnea duration slightly above 2 min. The threshold of 150 s was considered a safe selection to include all events of the maximum expected duration. We also selected a separating value of at least 8 s between the detected events. Indeed, statistically, we observed two complete respiratory cycles separating adjacent apneic episodes (corresponding to 6–10 s). The optimum value of 8 s was selected through a trial and error process. Thus, all events detected closer than this value to previously identified episodes were rejected as false detections.
2.5. Methodology for Validating the Performance of the Algorithmic Process
To validate the performance of the developed algorithmic process, we quantified the sensitivity and precision in the detection of the total count of apneic or hypopneic episodes irrespectively of the patient they belong to. These two metrics were defined as:
where TP is the count of True Positive detections, FN is the count of False Negative detections, and FP is the count of False Positive detections. An episode was considered as TP if there was an overlap between the detected and annotated event. If an annotated episode did not present any overlap with the detected events, it was considered as non-detected, i.e., as FN. If a detected episode did not have any overlap with the annotated events, it was eventually counted as FP.
For each detected episode, we also extracted the relationship between the detected event duration and the duration of the corresponding reference event as annotated by the clinical team. The determination of the algorithmic performance in the level of the events was the first purpose of this work, focusing on the possibility to accurately detect in time the apneic episodes and extract information relevant to each event.
In addition to the above factors, we also examined the duration distribution of all detected events—both true and false positive—compared to the distribution of the reference annotated events and we compared the MAD index estimated for each patient with the reference MAD index, derived by the reference episodes.
4. Discussion
The developed algorithm is based on a well-established signal processing methodology for VAD [
33,
34,
35]. VAD algorithms have been widely used in breathing detection [
30] and their application, especially in audio recordings during the night, are believed to perform particularly well due to the absence of speech. The hypothesis examined here is the possibility to identify the exact time location of apnea/hypopnea events by the estimation of the mean RMS value of the audio signal during the detected breathing segments. The achieved sensitivity proves that the proposed pattern is indeed present in the majority of apneic episodes (81%) and also half of the hypopneas exhibited by a patient with moderate or severe apnea syndrome. The algorithm developed was tested also for the case of an ambient microphone where its performance is slightly reduced.
A comparison between the sensitivity provided by the two independently employed microphones can give a clear view of their performance in detecting apnea and hypopnea episodes. In the case of apnea events, a systematically increased sensitivity is observed for the tracheal microphone (approx. 11%) as compared to the ambient microphone. This is exclusively attributed to the fact that the tracheal microphone is in direct contact with the source producing the targeted sound—the patient’s trachea—and it is practically insensitive to any external noise. The ambient microphone is sensitive to environmental noise, particularly in the sleep study unit, the air condition sound, the sound of any other equipment in the room, and the speech from clinical staff supervising the study. However, in the case of hypopneas, the difference in the provided sensitivity is much lower. The developed algorithm does not categorize breathing sounds in different frequency bands or in classes of breathing and snoring. Thus, mild breathing and snoring may be monitored to have the same sound level, when recorded by the contact microphone. This is expected to be the case more often in hypopneas, in which some inhalation sounds are expected within the event. The ambient microphone may give a clearer view of these events, by performing better in the aforementioned discrimination.
Despite the good sensitivity, the algorithmic process presents moderate precision, providing a lot of false-positive detections per patient. Based on the extracted distribution of the estimated and reference events, it is clear that the low precision is attributed to an overestimation of the number of short-length episodes—10 s to 20 s—(
Figure 4) by both microphones. The medium-length events—25 to 35 s—are slightly underestimated. The detected duration of the events seems to present negligible error differences in the cases of events produced when the patient is in supine or lateral position (“right” and “left” side), while the same results are produced when we follow a categorization of the detected events with respect to sleep stages (REM, non REM) (see
Supplementary Material). However, future steps in a deeper examination of the factors affecting the detected duration error (time spent in ‘supine’ position in the hospital and at home, the ratio between REM/non-REM stages duration) are imperative for accurate interpretation of these results
Overall, the number of detected events is higher than the number of reference events. The problem of increased false detection rate is higher for patients with moderate SAS, compared to those with severe SAS. We have to mention here that the algorithm in its current version is not intended for use in the extraction of the AHI; this was beyond the scope of the present work. Such use would lead to severe overestimation of AHI (see also
Supplementary Materials). Further analysis of the properties of the detected episodes may result in effective discrimination between true and false positive episodes, which should be among future perspectives.
MAD is an index attracting increasing interest by the scientific community as it can be indicative of the severity of the syndrome and the general health status of the patients with severe or moderate SAS. Particularly, long average apnea duration is related to morning tiredness and hypertension in patients [
19], as well as with significantly worse blood oxygen levels and sleep parameters such as significantly shorter sleep latency, reduced duration of NREM 3 stage, and increased arousal index [
20]. The extraction of the MAD requires the accurate time localization of apneic and hypopneic events and not just an estimation of breathing sound features. In this paper, we provide an algorithmic process for an adequate estimation of the MAD of each patient. Based on tracheal sound recordings, the MAD index is underestimated for the cases of patients with a MAD greater than 20 s. This underestimation is worsened in the case of ambient microphone recordings. For only a few cases of patients with a low MAD index (10 to 20 s), there is an overestimation of the MAD index by the developed algorithm (see
Supplementary Materials). For the majority of the patients (160 out of 239, 66.5%), the absolute error in the estimation of the MAD index does not exceed 5 s. This error can be partially explained by the resolution in time scoring of reference apneas in the corresponding software used by the clinical team (Sleepware G3) and the error of synchronization of the audio signals with the signals from polysomnography—estimated to be at a maximum of 2 s.
Based on the aforementioned results, we believe that the conducted study contributes to a better understanding of the breathing sound features during an apnea/hypopnea event. The present study proves that VAD algorithms can be used with increased sensitivity in the identification of apnea/hypopnea events. The VAD algorithm employed in this study is one of the first and simplest algorithms, in terms of computational complexity, developed in the field of voice activity detection, profiting especially from the accuracy in the noise estimation. This is of crucial importance in the field of sleep breathing detection since breathing/snoring characteristics are much closer to noise than those of speech. Future steps in this field of application include testing of more sophisticated approaches, especially those depending on the latest advances in deep learning techniques [
38]. Concerning the possible clinical applications, this study can prove beneficial for: (a) the inclusion of automated sound analysis as part of the laboratory PSG interpretation, indeed, for the time being, breathing audio recordings are practically excluded from the apnea events scoring process; (b) home-based systems focusing on individual events detection to screen the progress of the disease in diagnosed patients, and; (c) home-based systems aiming at identifying the syndrome’s severity based on the MAD estimation. This is particularly important, taking into account that the MAD estimation is not included in home-based systems proposed in the literature, as well as the MAD is expected to be clearer when studied in the home environment than in the hospital, due to its dependence on sleep stages and sleep convenience.