Automated Detection of Sleep Apnea-Hypopnea Events Based on 60 GHz Frequency-Modulated Continuous-Wave Radar Using Convolutional Recurrent Neural Networks: A Preliminary Report of a Prospective Cohort Study

Radar is a promising non-contact sensor for overnight polysomnography (PSG), the gold standard for diagnosing obstructive sleep apnea (OSA). This preliminary study aimed to demonstrate the feasibility of the automated detection of apnea-hypopnea events for OSA diagnosis based on 60 GHz frequency-modulated continuous-wave radar using convolutional recurrent neural networks. The dataset comprised 44 participants from an ongoing OSA cohort, recruited from July 2021 to April 2022, who underwent overnight PSG with a radar sensor. All PSG recordings, including sleep and wakefulness, were included in the dataset. Model development and evaluation were based on a five-fold cross-validation. The area under the receiver operating characteristic curve for the classification of 1-min segments ranged from 0.796 to 0.859. Depending on OSA severity, the sensitivities for apnea-hypopnea events were 49.0–67.6%, and the number of false-positive detections per participant was 23.4–52.8. The estimated apnea-hypopnea index showed strong correlations (Pearson correlation coefficient = 0.805–0.949) and good to excellent agreement (intraclass correlation coefficient = 0.776–0.929) with the ground truth. There was substantial agreement between the estimated and ground truth OSA severity (kappa statistics = 0.648–0.736). The results demonstrate the potential of radar as a standalone screening tool for OSA.


Introduction
Obstructive sleep apnea (OSA) is the most common type of sleep-disordered breathing, characterized by the recurrent cessation of breathing during sleep due to complete or partial obstruction of the upper airway [1]. Repetitive episodes of hypopnea and apnea can result in sleep disruption and the alteration of neural activity due to intermittent hypoxemia, hypercapnia, microarousals, and fragmented sleep [2]. OSA is associated with an elevated risk of hypertension, stroke, and type 2 diabetes mellitus [3]. In addition, cognitive dysfunction, cardiovascular and cerebrovascular diseases, and increased mortality are highly prevalent in OSA patients [4,5]. Polysomnography (PSG) is the gold standard for diagnosing OSA [6]. However, PSG requires a specialized facility with trained technicians and various sensors directly attached to the patient's body, which causes inconvenience Scale [30]. Sleep quality and disturbances during the month prior to the study were evaluated using the Pittsburgh Sleep Quality Index [31]. The PSG recording was conducted with Twin-PSG software (Natus Neurology Incorporated, West Warwick, RI, USA) using a standard PSG routine with the addition of the radar sensor. The recordings of bioelectrical potentials included a 6-channel electroencephalogram, a 4-channel electrooculogram, an electromyogram, and an electrocardiogram. A thermistor, a nasal air pressure monitoring sensor, an oximeter, piezoelectric bands, and a body position sensor were also applied to the patients. PSG data, including sleep parameters and respiratory events, were scored according to the American Academy of Sleep Medicine (AASM) manual [32]. Apnea was defined as a ≥90% reduction in airflow lasting at least 10 s. Apnea was further classified as central or obstructive if inspiratory effort was absent or present, respectively, throughout the entire period of absent airflow. Hypopnea was defined as a ≥30% reduction in airflow lasting for at least 10 s and was associated with either a ≥3% oxygen desaturation or arousal. The apnea-hypopnea index (AHI) was defined as the number of apnea-hypopnea events divided by the total sleep time (hours) in the PSG scoring data. The severity of OSA was classified into four categories based on AHI: normal (AHI < 5), mild (5 ≤ AHI < 15), moderate (15 ≤ AHI < 30), and severe (AHI ≥ 30).

Radar Setup
A FMCW radar sensor (AU Inc., Daejeon, Korea) was placed 2 m from the patient's chest on the ceiling of the sleep laboratory ( Figure 1). The architecture of the radar sensor is illustrated in Figure 2. The general principles of radar signal acquisition are provided in the Supplementary Materials.  Figure 3 shows the overall flow of how the respiratory signals were extracted from the raw radar signals. First, the distance to the target was determined by applying a fast Fourier transform to the ADC samples. Then, to remove high-frequency noise, chirp-tochirp variation signals at the target distance were filtered using a low-pass filter. Finally, the respiratory signals were demodulated from the filtered signals. Most previous studies [33] used the phase demodulation method to extract respiration signals. The phase demodulation method suffers from the in-phase and quadrature (IQ) mismatch issues. The IQ mismatch changes the center coordinate of the trajectory, which is caused by respiration. In this study, we used the cumulative amplitude of a vector by subtracting two vectors in a complex domain for a unit of time, which corresponds to the time difference between chirps. Because our demodulation method considers the difference between the two vectors, the IQ mismatch effect can be suppressed [34]. Figure 4 shows the demodulation method. Respiratory signals in the complex domain are shown in Figure 4a. The respiratory signals were obtained by integrating the amplitudes of the vectors whose signs were determined by the direction. Figure 4b shows the demodulated respiratory signals.

Data Preparation and Preprocessing
A full-night radar recording was obtained regardless of sleep stages to emulate a real-world application, as the radar alone cannot distinguish the sleep status. The respiratory signals acquired from the radar were preprocessed using minimal signal-processing methods. The respiratory signals were first downsampled from the original 1000/33 Hz sampling frequency to 8 Hz using Fourier transformation. Next, the radar data were segmented into 1-min segments with a stride of 30 s. A segment-wise z-score normalization was then performed based on the mean and standard deviation of the signal values. Event labels were annotated with 1-s temporal resolution according to the reference PSG. Each segment was considered abnormal if it was labeled apnea or hypopnea for at least 10 consecutive seconds.

Model Development
A five-fold cross-validation was performed for model development and evaluation. First, the study population was randomly split into five groups. In each fold, the signal segments obtained from patients in the four subgroups served as the training set, and those from the remaining patients served as the validation set. In addition, two types of models were prepared: binary and multiclass. The apnea-hypopnea events were considered as a single abnormal class for the binary model and separated for the multiclass model. Inspired by SELDnet [28] for sound event localization and detection, a detection model with a CRNN architecture composed of convolutional neural networks (CNNs), recurrent neural networks (RNNs), and fully connected (FC) components was employed. The CNN component consisted of four convolutional blocks. Each block was implemented as a stack of two sets of 1-D convolution, batch normalization, and rectified linear unit (ReLU) layers, followed by a max-pooling layer and dropout layer. All convolutional layers had 64 filters with a kernel size of 3 and a stride of 1. The max-pooling layers had a stride length of 2 except for the last block, which had a stride length of 1. The RNN component was a bidirectional long short-term memory (LSTM) layer with 128 units. The FC component consisted of an FC layer with 128 filters, a ReLU layer, a dropout layer, and a final FC layer with the same number of filters as the number of classes. All the dropout layers in the CRNN had a dropout rate of 0.2.
The sum of cross-entropy and Dice loss was used as the training objective adopted from medical image segmentation tasks [35] and a batch size of 64. The Adam optimizer [36] with an initial learning rate of 0.001 was used, which was reduced by half if the validation loss did not improve for 10 epochs. The maximum number of training epochs was set to 100, and early stopping was applied if the validation loss did not improve for 25 epochs. The model development was implemented using Keras (version 2.8.0; https://keras.io/) with a Tensorflow (version 2.8.0; Google LLC, Mountain View, CA, USA) backend on a workstation with an NVIDIA GeForce RTX 3080 GPU (Nvidia, Santa Clara, CA, USA) and 31 GB RAM.

Performance Evaluation
The performance evaluation was based on gathering inference results from the validation sets from each fold in the cross validation. Following the clinical significance of AHI, all apnea-hypopnea events were considered as a single abnormal class, and model predictions for each time point were calculated as 1 − probability f or class Normal for both binary and multiclass models. Performance evaluation was conducted in three ways: per-segment classification, global event detection, and AHI estimation.
First, a receiver operating characteristic (ROC) curve was obtained for per-segment binary classification. Consistent with the ground truth labeling, the predicted probability of each 1-min segment was defined as the minimum threshold that yielded 10 consecutive seconds with above-threshold predictions. From the ROC curve, the area under the ROC curve (AUROC) was computed, along with 95% confidence intervals (CI) using the Delong method [37]. In addition, the optimal cut-off point yielding the maximum value of the Youden index [38] was obtained from the ROC curve. Based on the optimal cutoff point, the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated.
In addition to the per-segment evaluation, the global event detection performance was assessed by computing the abnormal respiratory event detection sensitivity and falsepositive detections per participant. The predictions from a patient's full-night PSG were first aggregated as averages of prediction probabilities from two adjacent overlapping segments, because the segments were 1-min long with intervals of 30 s. Estimated abnormal respiratory events were defined as consecutive above-threshold predictions of at least 10 s and determined true-positive or false-positive at an intersection over union (IoU) threshold of 0.5 compared with the ground truth. The optimal cutoff point yielding the highest F1 measure, defined as the harmonic mean between precision and recall, was used to obtain the results [39]. Based on the optimal cut-off point, sensitivity, PPV, and false-positive detections per patient were calculated. In addition, since sleep status was not accounted for in the predictions, the estimated abnormal respiratory events were further categorized depending on whether they occurred during sleep or wakefulness. To this end, in-sleep estimated events were defined as events in which the patient was asleep for more than half of the event time.
Moreover, AHI was calculated as the number of estimated abnormal respiratory events divided by total sleep time and OSA severity and was estimated in line with the standard PSG scoring system [32]. Similarly, the corrected estimated AHI and OSA severity was calculated based on the number of in-sleep estimated abnormal respiratory events. The estimated AHIs were compared with ground truths using the Pearson correlation coefficient (r), intraclass correlation coefficient (ICC), and Bland-Altman analysis. The estimated OSA severities were compared with the ground truths using linear weighted kappa statistics (κ). The AHI estimation results were based on the same binarization thresholds used for event detection evaluation.

Study Participants
Initially, 55 participants were enrolled in the study from July 2021 to April 2022. Among them, 11 were excluded from the study for the following reasons: withdrawal of consent (n = 9) and accidental failure to execute the radar data collection program (n = 2). Therefore, 44 participants underwent PSG integration using the FMCW radar sensor. Among them, there were nine normal participants, and the numbers of patients with mild, moderate, and severe OSA were 7, 15, and 13, respectively. The baseline demographic and sleep characteristics of the study population are presented in Table 1.

Per-Segment Classification Performance
The binary model showed an overall AUROC of 0  Figure 5 show the detailed classification performance. In the OSA severity groups, the AUROC, sensitivity, and specificity ranged from 0.796 to 0.859, 62.5% to 80.9%, and 76.5% to 86.8%, respectively. The AUROC in patients with severe OSA was higher than that in the rest of the patients for both the binary (0.859 vs. 0.814, p < 0.001) and multiclass models (0.857 vs. 0.809, p < 0.001).

Global Event Detection Performance
The sensitivities of the binary and multiclass models were 63.3% (95% CI [62.1%, 64.5%]) and 62.2% (95% CI [61.0%, 63.4%]), respectively, for the overall study population, with a range of 49.0-67.6% across OSA severity. The number of false-positive detections per participant ranged from 23.4 to 52.8, and it was 17.2-31.0 when considering only estimated in-sleep events.
Moreover, per-class sensitivities in the overall study population ranged from 53.9% to 87.0%, with lower values for hypopnea than for apnea. The sensitivity for hypopnea across OSA severity was 47.6-58.1%, whereas the sensitivity for obstructive and central apnea showed a wide range of 0.0-100.0% due to the small number of apnea cases in non-OSA and mild OSA participants. In patients with moderate or severe OSA, the sensitivities for obstructive and central apnea were 86.1-92.0% and 78.9-88.0%, respectively. The detailed event detection performance is presented in Table 3. Figure 6 illustrates the representative cases of different respiratory events with the model predictions.

AHI Estimation Performance
The estimated AHI (Figure 7) and corrected estimated AHI (Figure 8   The estimated OSA severities showed substantial agreement with the ground truth for both binary (κ = 0.715) and multiclass (κ = 0.648) models. The OSA severities estimated from the corrected estimated AHI showed a slightly higher substantial agreement with the ground truth for both the binary (κ = 0.736) and multiclass (κ = 0.699) models ( Figure 9).

Discussion
With the increasing interest in non-contact sensors for PSG, radar constitutes a promising technology. The radar for sleep assessment primarily targets respiratory efforts, usually measured by thoracoabdominal belts in conventional PSG. However, there is scarce literature regarding the use of radars in OSA. Therefore, this preliminary study investigated the feasibility of the automated detection of apnea-hypopnea events for OSA diagnosis based on FMCW radar using deep learning.
The diagnostic performance was the highest for patients with severe OSA in terms of AUROC for per-segment classification and sensitivity for global event detection, consistent with previous studies using thoracoabdominal belts or radar [40][41][42][43]. This may be attributed to the higher proportion of apnea in patients with more severe OSA. While the sensitivities for hypopnea were not significantly different among the OSA severity groups, the proposed models showed higher sensitivities for apnea than hypopnea. In addition, our dataset included more patients with moderate or severe OSA than those without or with mild OSA. Therefore, because a much higher number of events were included in patients with more severe OSA, these events would have contributed more to the model training.
The proposed models showed limited sensitivities for hypopnea, unlike apnea, with a range of 47.6-58.1%, even though hypopnea was the major event (71.0%, 4427/6239). This is expected and concordant with previous studies [40][41][42][43], since hypopnea is an incomplete form of respiratory abnormality between normal respiration and the complete cessation of the upper airway (obstructive apnea). The diagnostic criteria for hypopnea include a more subtle change in airflow than apnea and require additional information on oxygen desaturation or arousal [32], which was not provided in the single-channel input of our study. Furthermore, lower sensitivities for hypopnea can be linked to more vulnerable interference with non-pathologic respiratory signals from changing body positions, limb movements, or sensor noise.
Most studies on OSA exclude data from waking hours, even when the sensor (e.g., thoracoabdominal belt) cannot directly determine the sleep status [40][41][42][43]. However, we included the entire PSG recording to simulate sleep monitoring environments where only radar was used and where it was incorporated with PSG. In the former, the estimated AHI showed strong correlations (r = 0.805-0.829) and good agreement with the ground truth (ICC = 0.776-0.812), and there was substantial agreement between the estimated and ground truth OSA severity (κ = 0.648-0.715). These results demonstrated the potential of the standalone use of radar for OSA screening. When considering only in-sleep estimated events, there were improvements in the estimation of AHI (r = 0.937-0.949 and ICC = 0.916-0.929) and OSA severity (κ = 0.699-0.736), primarily due to the reduction in false-positive detections during wakefulness. This is clearly shown in the case with poor sleep efficiency (Figure 10), which is an outlier in Figure 7.
We employed a detection model based on a CRNN architecture instead of the classification models used in most existing deep learning methods for OSA diagnosis [44]. The classification model has certain disadvantages. First, it uses weak labels compared to detection model. Previous works converted each segment of approximately 30 s to 1 min into a single class by thresholding the length of the in-segment event annotations [40,41]. However, the events manually annotated in PSG are strong labels, usually with a high temporal resolution of approximately 1 s, which can be exploited more by a detection model without losing information than a classification model. Second, a detection model provides more intuitive localization results for event detection than a classification model ( Figure 6). The classification model requires postprocessing with a narrow sliding window to translate predictions into time-sequenced events (e.g., valid events if at least six consecutive segments are predicted to be abnormal) [43,45]. However, such postprocessing relies on a smoothing parameter that requires additional tuning after training the model. We experimented with binary and multiclass models to investigate the potential benefits of multiclass labels in detecting abnormal events. However, they showed similar results, which may be attributed to several factors. First, there was a class imbalance in respiratory events. Hypopnea accounted for the majority of the events (71.0%, 4427/6239), followed by obstructive apnea (26.6%, 1660/6239), with central apnea being the fewest (2.4%, 152/6239). Second, because a multiclass task is more complicated than a binary task, multiclass labels may not necessarily help the model to learn to distinguish between abnormal and normal respiration. The multiclass model may learn the distributions of normal respiration in the early training phase at a level similar to that of the binary model, with the rest of the training focusing on distinguishing different classes of events.
This study had several limitations that should be addressed in future work. First, the number of study participants, especially those with less than moderate OSA, is insufficient.
In addition, future studies should investigate more technical modifications of radar setups, such as signal processing or adding an under-the-mattress radar, and model development, including hyperparameter optimization and the comparison of different deep learning architectures. Specifically, advanced deep learning methods such as the Transformer [46] and deep domain adaptation [47] are potential alternatives. Moreover, regarding PSG, sleep stages (e.g., REM and non-REM), body position, and limb movements could be considered for model development and analysis. In particular, since sleep stages and body positions affect apnea-hypopnea frequency [48], future work could include the analysis of the effects of different sleep stages and body positions on diagnostic performances. In addition, a promising direction for future research is using an ensemble of the current model and models performing radar-based predictions of sleep stages and/or body movements.
In conclusion, we demonstrated that the automated detection of apnea-hypopnea events based on FMCW radar is feasible using a CRNN-based deep learning model. This preliminary study, which involved an OSA cohort still in recruitment and an indevelopment radar, showed the possibility of using the FMCW radar as a standalone screening tool for OSA. Integration with other non-contact sensors for sleep signals, such as oxygen desaturation sensors, is warranted to develop an improved non-contact OSA diagnosis.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/s22197177/s1. Figure S1: The signal processing flows in the FMCW radar to detect (a) distance and (b) velocity of the target. Informed Consent Statement: Written informed consent was obtained from all subjects involved in the study.