Detection of Sleep Apnea from Electrocardiogram and Pulse Oximetry Signals Using Random Forest

: Sleep apnea (SA) is a common sleep disorder which could impair the human physiological system. Therefore, early diagnosis of SA is of great interest. The traditional method of diagnosing SA is an overnight polysomnography (PSG) evaluation. When PSG has limited availability, automatic SA screening with a fewer number of signals should be considered. The primary purpose of this study is to develop and evaluate a SA detection model based on electrocardiogram (ECG) and blood oxygen saturation (SpO2). We adopted a multimodal approach to fuse ECG and SpO2 signals at the feature level. Then, feature selection was conducted using the recursive feature elimination with cross-validation (RFECV) algorithm and random forest (RF) classiﬁer used to discriminate between apnea and normal events. Experiments were conducted on the Apnea-ECG database. The introduced algorithm obtained an accuracy of 97.5%, a sensitivity of 95.9%, a speciﬁcity of 98.4% and an AUC of 0.992 in per-segment classiﬁcation, and outperformed previous works. The results showed that ECG and SpO2 are complementary in detecting SA, and that the combination of ECG and SpO2 enhances the ability to diagnose SA. Therefore, the proposed method has the potential to be an alternative to conventional detection methods.


Introduction
Sleep apnea (SA) is a common sleep disorder, also commonly known as obstructive sleep apnea (OSA) [1]. OSA occurs due to the abnormal function of the upper respiratory tract. When the hard palate muscles at the back of the throat that support the soft palate relax, the soft palate blocks the passage of air into the respiratory system. The clinical manifestation of SA is a cessation of nasal airflow or a decrease in airflow intensity by more than 30% compared to the base level, but the corresponding breathing movements are maintained [2]. At the same time, oxygen saturation decreases by more than 4% for more than 10 s. The prevalence of OSA in adults ranges from 9% to 38% and increases with age [3]. Low quality sleep accompanied by apnea usually leads directly to poor concentration, memory loss, slow response, and depression [4]. In addition, OSA is a potential threat to many physiological systems of the human body, especially the cardiovascular system. It can induce hypertension, heart failure, coronary artery disease, diabetes, and other diseases, which seriously threaten the health of patients [5]. If patients are identified and then treated at an early stage of OSA, the health risks can be reduced. Therefore, timely diagnosis of patients with OSA is essential.
Clinically, polysomnography (PSG) is the reference standard for the diagnosis of SA. PSG is effective in monitoring sleep conditions by collecting various physiological signals such as electrocardiogram (ECG), electroencephalogram (EEG), electromyogram (EMG), blood oxygen saturation (SpO2), airflow signals, respiratory effort, etc. [6]. However, wearing too many sensors during physiological signal collection can cause discomfort to the patient. In addition, the diagnosis of OSA requires sleep specialists to spend a lot of time manually analyzing PSG data [7]. Therefore, automatic detection of SA using fewer signals is necessary.
Researchers have typically developed SA detection algorithms using ECG signals. ECG is a non-invasive technique for recording the electrical activity of heart and the physiological activity of heart is regulated under the autonomic nervous system (ANS). Studies have shown that hypoxia caused by SA can lead to the dysregulation of the ANS. Clinically, heart rate variability (HRV) is an important indicator of the outcome of ANS regulation [8]. Therefore, it is feasible to screen for apnea by monitoring ECG during sleep [9]. Yet, ECG signals are easily influenced by cardiovascular disease status. This makes the diagnosis of SA more challenging. Apart from ECG signals, SpO2 signals are also widely used to detect SA as the lack of airflow due to SA events can lead to a decrease in SpO2. Repetitive oxygen desaturation is highly specific for apnea. However, the sensitivity of oximetry is usually low, as not all apnea events lead to discernible desaturations [7]. Thus, SpO2 alone or ECG alone can be used as a potential diagnostic means of SA, but not as a reliable means.
With technological advances in sensors and low-power embedded systems, the collection of physiological signals has become easier and more economical [10]. Therefore, we consider using multiple signals to develop a more reliable detection algorithm of SA, rather than being limited to a single signal.
This study explores the efficiency and reliability of a multimodal approach to the automated detection of SA events using a combined channel of ECG and SpO2. To this end, we extracted features from ECG signal and SpO2 signal separately, and then fused the features of the two different modalities. Feature selection was performed using the recursive feature elimination with cross-validation (RFECV) algorithm. Then, the selected features were fed to the RF classifier to identify sleep apnea events.
Our study provides three main contributions to research. First, we verify the complementarity of ECG and SpO2 signals to automatically detect SA. When the two signals are combined, the diagnostic ability is increased. Second, the RFECV algorithm is employed to select the most important features. The proposed SA detection technique uses a smaller number of features and is computationally inexpensive compared to most of the existing methods. Third, we enrich the method in the field of the automated detection of SA by applying a multimodal approach to fuse ECG and SpO2 signals at the feature level. So far, most of the extant literature primarily used SpO2 alone or ECG alone, but did not consider the combination of ECG and SpO2.
The rest of this paper is organized as follows. The related works of SA detection are explored in Section 2. The explanation of the dataset, preprocessing steps, and the introduced SA detection technique is presented in Section 3. The Results and Discussions are presented in Sections 4 and 5, respectively. Finally, Section 6 concludes the paper.

Related Works
Over past studies, various physiological signals (e.g., ECG, EEG, SpO2, snoring or airflow) have been used to develop SA detection algorithms [11], the most widely used of which are ECG signal and SpO2 signal.
For ECG signal-based methods, the shallow characteristic signals of the ECG are usually analyzed in the time domain, frequency domain or nonlinear domain. The time intervals between successive heartbeats are sequentially combined to form the RR interval signal [12]. HRV analysis refers to the analysis of changes in the RR interval signal. Nakayama et al. [9] proposed a method for detecting sleep apnea based on HRV analysis. Their method was successfully applied to clinical PSG data and the performance was comparable to portable monitoring devices in sleep laboratories. ECG-derived respiratory (EDR) signals reflecting respiratory activity can be used as complementary information to HRV [13]. Khandoker et al. [14] analyzed the EDR signal and RR interval with wavelet transform and used SVM classifier to identify OSA patients. In their work, more than 90% of subjects in the test set were correctly classified. Further, Bsoul et al. [15] extracted a complete feature set containing 111 features from RR and EDR time series using time-frequency analysis methods. Sharma et al. [16] developed a SA detection model using Hermite basis functions. Sharma mainly considered the morphological changes occurring in the QRS wave complex of the ECG.
The occurrence of apnea is usually accompanied by a decrease in oxygen saturation, hence the SpO2 signal has been used in several studies. Some of these studies employed statistical methods to quantify the variation in oxygen saturation over time. For example, Ulysses et al. [17] used time spent below a certain level saturation (TSA), the saturation variability index and other indicators to evaluate AHI, and compared the diagnostic performance of SA under different metrics. The oxygen desaturation index (ODI) is defined as the number of oxyhemoglobin desaturation below a certain threshold [18]. Ling et al. [19] found that the use of ODI improved the accuracy of moderate and severe OSA detection. However, the ODI index is more suitable for prolonged SpO2 signals. In addition, some studies have explored nonlinear parameters. Alvarez et al. [20] used central tendency measure (CTM) and Lempel-Ziv (LZ) complexity to identify OSA and showed that the sensitivity obtained using CTM and LZ complexity, respectively, was 90.1% and 86.5%. Hornero et al. [21] performed a time series analysis of the SpO2 signal by approximate entropy and obtained a sensitivity of 82.09% and a specificity of 86.96% on training set.
To conclude our brief review of SA detection algorithms, we have found that screening for SA using either ECG or SpO2 signals is effective, but the majority of the previous studies focused only on a single data modality. However, several machine learning tasks in other fields (e.g., medical image analysis, sentiment recognition, etc.) have demonstrated that fusing information from multiple data modalities can enhance the robustness of a model [22]. Therefore, our proposed multimodal approach for the detection of SA is more advanced.

Proposed Framework
This section is composed of six subsections. First, the Apnea-ECG dataset and the preprocessing step are described. In this step, the number of signals used, sampling frequency, denoising method, data segmentation, and the derivation of the RR interval and R-wave amplitude (RAMP) signals from the ECG segments are explained. Afterward, linear and nonlinear analysis methods are applied to extract features and fuse three different feature sets using an early fusion strategy. Then, the optimal features are selected from the fused feature vector. Finally, these features are used as input to the four different types of classifiers for discriminating normal and apnea events. The flow diagram of the proposed technique is illustrated in Figure 1. The occurrence of apnea is usually accompanied by a decrease in oxygen saturation, hence the SpO2 signal has been used in several studies. Some of these studies employed statistical methods to quantify the variation in oxygen saturation over time. For example, Ulysses et al. [17] used time spent below a certain level saturation (TSA), the saturation variability index and other indicators to evaluate AHI, and compared the diagnostic performance of SA under different metrics. The oxygen desaturation index (ODI) is defined as the number of oxyhemoglobin desaturation below a certain threshold [18]. Ling et al. [19] found that the use of ODI improved the accuracy of moderate and severe OSA detection. However, the ODI index is more suitable for prolonged SpO2 signals. In addition, some studies have explored nonlinear parameters. Alvarez et al. [20] used central tendency measure (CTM) and Lempel-Ziv (LZ) complexity to identify OSA and showed that the sensitivity obtained using CTM and LZ complexity, respectively, was 90.1% and 86.5%. Hornero et al. [21] performed a time series analysis of the SpO2 signal by approximate entropy and obtained a sensitivity of 82.09% and a specificity of 86.96% on training set.
To conclude our brief review of SA detection algorithms, we have found that screening for SA using either ECG or SpO2 signals is effective, but the majority of the previous studies focused only on a single data modality. However, several machine learning tasks in other fields (e.g., medical image analysis, sentiment recognition, etc.) have demonstrated that fusing information from multiple data modalities can enhance the robustness of a model [22]. Therefore, our proposed multimodal approach for the detection of SA is more advanced.

Proposed Framework
This section is composed of six subsections. First, the Apnea-ECG dataset and the preprocessing step are described. In this step, the number of signals used, sampling frequency, denoising method, data segmentation, and the derivation of the RR interval and R-wave amplitude (RAMP) signals from the ECG segments are explained. Afterward, linear and nonlinear analysis methods are applied to extract features and fuse three different feature sets using an early fusion strategy. Then, the optimal features are selected from the fused feature vector. Finally, these features are used as input to the four different types of classifiers for discriminating normal and apnea events. The flow diagram of the proposed technique is illustrated in Figure 1.

Dataset
In this study, the Apnea-ECG database provided by Dr. Tomas Penzel of Phillips University is used in our proposed method. The data set consists of 70 records, which are divided into a learning set of 35 records and a test set of 35 records. These records ranged from 7 to 10 h, and contained the ECG signals. Eight of the records (a01~a04, b01, c01~c03) contained four additional signals (Resp C and Resp A, the chest and abdominal respiratory effort signals; Resp N, nasal airflow; SpO2). All signals were digitized at 100 Hz with 16-bit resolution. Each record was labelled minute by minute by the sleep specialist as normal (N) or apnea (A) [23]. An example of 1-min apnea and normal segments are demonstrated in Figure 2.

Dataset
In this study, the Apnea-ECG database provided by Dr. Tomas Penzel of Phil University is used in our proposed method. The data set consists of 70 records, which divided into a learning set of 35 records and a test set of 35 records. These records ran from 7 to 10 h, and contained the ECG signals. Eight of the records (a01~a04, b01, c01~ contained four additional signals (Resp C and Resp A, the chest and abdominal resp tory effort signals; Resp N, nasal airflow; SpO2). All signals were digitized at 100 Hz w 16-bit resolution. Each record was labelled minute by minute by the sleep specialis normal (N) or apnea (A) [23]. An example of 1-min apnea and normal segments demonstrated in Figure 2. In order to satisfy the data conditions of this study, eight records (a01~b04, c01~c03) containing ECG and SpO2 signals were selected from the above data set as perimental data.

Preprocessing
For the noise in the ECG signal such as baseline drift and power frequency inter ence, we have used FIR bandpass filter with passband of 3~50 Hz to denoise the orig ECG signal [2]. Then, the entire ECG signal was segmented into 1-min segments by re ring to the annotations in the database. Based on the per-minute ECG segment, we u the Hamilton algorithm to locate the R peaks, and corrected the position of the R peak the maximum value, so as to ensure the accuracy of the R peaks detection. The RR inte signal was obtained by the interval between successive R peaks, and the RR interval liers were removed with reference to the method of [24]. The RAMP signal was obtai by the amplitude of R wave. In particular, one of the simplest approaches to obtain EDR (ECG-derived respiration) signal is by interpolating the RAMP signal [14], so RAMP signal is also called the EDR signal.
SpO2 and ECG recordings were collected simultaneously. Similarly, the entire Sp signal was split into 1-min segments, and segments that violated its physiological sig cance (SpO2 values less than 50%) were removed [25]. Then, the RR interval signal, RA In order to satisfy the data conditions of this study, eight records (a01~b04, b01, c01~c03) containing ECG and SpO2 signals were selected from the above data set as experimental data.

Preprocessing
For the noise in the ECG signal such as baseline drift and power frequency interference, we have used FIR bandpass filter with passband of 3~50 Hz to denoise the original ECG signal [2]. Then, the entire ECG signal was segmented into 1-min segments by referring to the annotations in the database. Based on the per-minute ECG segment, we used the Hamilton algorithm to locate the R peaks, and corrected the position of the R peaks to the maximum value, so as to ensure the accuracy of the R peaks detection. The RR interval signal was obtained by the interval between successive R peaks, and the RR interval outliers were removed with reference to the method of [24]. The RAMP signal was obtained by the amplitude of R wave. In particular, one of the simplest approaches to obtain an EDR (ECG-derived respiration) signal is by interpolating the RAMP signal [14], so the RAMP signal is also called the EDR signal.
SpO2 and ECG recordings were collected simultaneously. Similarly, the entire SpO2 signal was split into 1-min segments, and segments that violated its physiological signifi-cance (SpO2 values less than 50%) were removed [25]. Then, the RR interval signal, RAMP signal and SpO2 signal were used for subsequent feature extraction.

Feature Extraction and Fusion
In this study, linear (time domain and frequency domain) analysis and nonlinear analysis methods were used to extract features. We obtained three sets of features from ECG and SpO2 signals, which were RR intervals features, R-wave amplitudes features, and SpO2 features. The details of these features and fusion strategy are described below.

RR Intervals Features
Linear analysis of HRV is widely used in clinical studies due to its theoretical maturity. We calculated RRmean, RMSSD, SDNN, NN50, pNN50, HR from the time domain, while the VLF 1 , LF 1 , HF 1 , LF/HF 1 , LFnorm 1 , HFnorm 1 were extracted from the frequency domain. The detailed descriptions of these 12 features are shown in Table 1. In the process of frequency domain analysis, by following [26], we applied cubic spline interpolation to resample the RR interval signal to 4 Hz. Then, the power spectral density (PSD) was estimated using the FFT-Welch (s, n = 256) method.

R-Wave Amplitudes Features
It has been shown that the PSD of the RAMP signal has similar characteristics to the RR intervals and can serve as complementary information to HRV [15]. Therefore, we also extracted the above six frequency domain features (VLF 2 , LF 2 , HF 2 , LF/HF 2 , LFnorm 2 , HFnorm 2 ) based on the RAMP signal using the frequency domain analysis method of HRV. The detailed descriptions of these six features are shown in Table 1.

SpO2 Features
Six features were calculated from the SpO2 signal. These features are listed in Table 2. Based on statistical methods, Smin, Smean, and Svar were calculated from SpO2 segments. Three commonly used nonlinear features (ApEn, CTM, and LZC) were also added to the SpO2 feature set. Specifically, ApEn and LZC are suitable for small sample data and can reflect the complexity and chaos degree of the signal [27]. The optimal parameters for calculating ApEn were a tolerance of 0.25 and an embedding dimension of 2, while LZC is a nonparametric measurement. In addition, CTM calculates the ratio of the number of points falling into the center in the origin region with radius R to the total number of points through the second-order difference graph [20].
After feature extraction, in order to eliminate the distribution differences between various types of features and speed up the convergence of the model, we normalized the features with the following equation: where x is the unnormalized feature, x represents the mean of the feature, σ is the standard deviation of the feature, and x * is the normalized feature.

Feature Fusion
In the field of machine learning, multimodal fusion is a technique that integrates information from multiple modalities, including early, later, and hybrid fusion. Among them, early fusion, also known as feature-based multimodal fusion, refers to the connection of features from different modalities before model training [28].
In this study, the ECG and SpO2 signals collected by different sensors can be considered as two modalities. In order to combine the information from different modalities, we fused the above three feature sets using an early fusion strategy with the following steps: let In be the feature vector of RR intervals, let Rn be the feature vector of RAMP, and let Sn be the feature vector of SpO2; then, the concatenation of these three representations In, Rn, and Sn produced a feature vector of which the dimension is 24.

Feature Selection
In machine learning tasks, it is important to eliminate irrelevant or redundant features to improve the accuracy and reduce the complexity of the model. Therefore, we chose the RFECV algorithm to search for the optimal feature subset [29], where the estimator parameter was set as RF classifier. The procedure of the RFECV method is illustrated in Figure 3. Firstly, a RF classifier on the feature set to be filtered is trained. Then, the importance of each feature is calculated and the classification accuracy of that feature set is obtained using a cross-validation method. Lastly, the unimportant or irrelevant features are removed from the current feature set and the RF classifier is retrained using the updated feature set. This is an iterative process until the feature set is empty.
At the end, the p-value of the selected feature set was calculated using the Kruskal-Wallis one-way ANOVA (KW-ANOVA) test. KW-ANOVA is a non-parametric test for estimating the difference between two or more types of correlated data without assuming any particular data distribution [30].

Classfier
The appropriate classifier can lead to better diagnostic performance. Therefore, four different types of classifiers were pre-selected for experimentation in order to select the most suitable classifier for this study. Random forest (RF) belongs to ensemble learning,

k-nearest neighbor (KNN) is representative of lazy learning, logistic regression (LR) is a regression model that enables classification, and the support vector machine (SVM) is a functional model. A brief description of these four classifiers is presented below.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 7 of 15 At the end, the p-value of the selected feature set was calculated using the Kruskal-Wallis one-way ANOVA (KW-ANOVA) test. KW-ANOVA is a non-parametric test for estimating the difference between two or more types of correlated data without assuming any particular data distribution [30].

Classfier
The appropriate classifier can lead to better diagnostic performance. Therefore, four different types of classifiers were pre-selected for experimentation in order to select the most suitable classifier for this study. Random forest (RF) belongs to ensemble learning, k-nearest neighbor (KNN) is representative of lazy learning, logistic regression (LR) is a regression model that enables classification, and the support vector machine (SVM) is a functional model. A brief description of these four classifiers is presented below.

Random Forest
RF is an ensemble learning model consisting of a set of decision tree classifiers ( , )| = 1, 2, ⋯ , [31], and the specific implementation process is to use a randomized with put-back approach (Bootstrap method) to extract the training set from the original sample set ; then to use the sampled training set to train the decision tree ( , ). When a new sample is input to the random forest, all decision trees ( ) classify the new sample separately, and finally determine by voting the classification results: where is the final result of the classification, ( ) is the classification model, ( ) is a single decision tree classifier, is the result of a single decision tree classification, and I(°) is the characteristic function.

Random Forest
RF is an ensemble learning model consisting of a set of decision tree classifiers { f k (x, θ k )|k = 1, 2, · · · , n} [31], and the specific implementation process is to use a randomized with put-back approach (Bootstrap method) to extract the training set θ k from the original sample set θ; then to use the sampled training set θ k to train the decision tree f k (x, θ k ). When a new sample x is input to the random forest, all decision trees f (x) classify the new sample separately, and finally determine by voting the classification results: where Y is the final result of the classification, F(x) is the classification model, f k (x) is a single decision tree classifier, y is the result of a single decision tree classification, and I( • ) is the characteristic function. RF has the advantages of high prediction accuracy, fast training speed, strong resistance to noise and outliers, and generates training sets by random sampling to reduce overfitting and improve generalization ability.

K-Nearest Neighbor
KNN is a popular supervised learning algorithm. KNN is implemented by finding the k closest training samples in the training set based on a certain distance measure, and then predicting based on the information of these k samples (where k is a positive integer). Usually, a voting method is used in classification tasks, where the most frequent category marker among these k samples is selected as the prediction result.

Support Vector Machine
SVM is a classification algorithm. In binary classification tasks, SVM creates a separation hyperplane between two classes (y i ∈ {−1, 1}) of samples where x i is support vectors, {x i , y i } is training data, and i = 1, 2, · · · , n with x i ∈ R n . If x is the new feature vector, the result given by SVM is: where b is the threshold, α i are the Lagrangian coefficients which are calculated solving the dual Lagrangian form minimize: where C is the regularization parameter, that determines the trade-off between the maximum margin and the minimum classification error, and K( • ) is the kernel function.

Logistic Regression
LR allows estimation of the posterior probability of the occurrence of a certain event. In real cases, the dependent variable consists of positive class and negative class, while the predictors are input features. Therefore, LR allows us to estimate the posterior probability of the output, regardless of making any a priori assumption about the statistical nature of the data. The expression of LR is as follows [32]: where f (x) is the posterior probability of the output, a 0 is the compensation parameter, a i (i = 1, · · · , k) is the correlation coefficient, and k is the number of input features. LR estimates a 0 and a i by the maximum likelihood optimization method.

Performance Evaluation
In this study, accuracy, sensitivity, and specificity as defined in Equations (6)-(8) were used to evaluate the proposed model [33]. Here, accuracy describes the total number of SA segments and normal segments that were correctly identified among all of the samples, sensitivity reflects the number of correctly identified SA among all SA segments, and specificity reflects the number of correctly identified normal among all normal segments. In addition, the area under the receiver operating characteristic curve (AUC) is also the evaluation index of this model:

Results
After the preprocessing step, the published set consisted of 3903 1-min samples, of which 2308 were normal samples and 1595 were sleep apnea samples. Three sets of features extracted from each sample were fused and fed into a classifier for sleep apnea detection after feature selection. During the experiment, the dataset was divided into a training set (80%) and a test set (20%) by a stratified sampling method. On the training set, five-fold cross validation was used to select optimal features, optimize classifier parameters, and model training. Accuracy, sensitivity, specificity, and AUC were used on the test set to evaluate the model performance.
The experimental environment is based on the Windows 10 operating system, the software used to develop the algorithm is python 3.6, and the hardware configuration is Xeon E5-2640v4 CPU, Nvidia GeForce RTX2080Ti GPU, and 32GB RAM.

Feature Selection and KW-ANOVA Test
As mentioned before, the optimal subset of features was selected using the RFECV algorithm and the features were statistically analyzed by the KW-ANOVA test. The optimal subset of features reduces the complexity of the model while maintaining the classification accuracy. The relationship between the number of selected features and the classification accuracy is illustrated in Figure 4. In Figure 4, the cross-validation score fluctuates as the number of features decreases, which is caused by the change in the data distribution during the five-fold cross validation process. From Figure 4, the highest accuracy is obtained by selecting 13 features. The selected features are as follows: RMSSD, pNN50, HR, VLF 1 , HF 1 , LFnorm 1 , and HFnorm 1 in the ECG feature set; Smin, Smean, Svar, ApEn, LZC, and CTM in the SpO2 feature set. The number corresponding to each feature is presented in Table 3.  The results of the KW-ANOVA test shows that for all the 13 selected features, p << 0.01, which means that the selected features are statistically significantly different in discriminating between normal and SA classes. Furthermore, Figure 5 exhibits the box plots of the selected features, to verify that these features have significant differences.  The results of the KW-ANOVA test shows that for all the 13 selected features, p << 0.01, which means that the selected features are statistically significantly different in discriminating between normal and SA classes. Furthermore, Figure 5 exhibits the box plots of the selected features, to verify that these features have significant differences.

Using Conbined ECG and SpO2 Feature Set
The classification results of per-minute segment are shown in Table 4. According to Table 4, the proposed method provided an accuracy of 97.5%, sensitivity of 95.9%, specificity of 98.4%, and AUC of 99.2% using RF classifier. In addition, we used some other classical classifiers (including SVM, KNN, and LR) to compare with the RF classifier. Although these classifiers also achieved satisfactory results, the RF classifier still had the highest accuracy of 97.5%. Furthermore, sensitivity, specificity, and AUC using the RF classifier were also higher than the other classifiers. The ROC curves of the four classifiers are plotted in Figure 6. Thus, in this study, the RF classifier is more suitable for SA detection than the other machine learning algorithms mentioned above.

Using Conbined ECG and SpO2 Feature Set
The classification results of per-minute segment are shown in Table 4. According to Table 4, the proposed method provided an accuracy of 97.5%, sensitivity of 95.9%, specificity of 98.4%, and AUC of 99.2% using RF classifier. In addition, we used some other classical classifiers (including SVM, KNN, and LR) to compare with the RF classifier. Although these classifiers also achieved satisfactory results, the RF classifier still had the highest accuracy of 97.5%. Furthermore, sensitivity, specificity, and AUC using the RF classifier were also higher than the other classifiers. The ROC curves of the four classifiers are plotted in Figure 6. Thus, in this study, the RF classifier is more suitable for SA detection than the other machine learning algorithms mentioned above.

Using either ECG or SpO2 Feature Set
To compare the SA detection performance of different signals, the ECG features and SpO2 features from the optimal feature set were used for SA detection, respectively. From Table 4, it was clear that the RF classifier outperforms the other classifiers. Therefore, the experiments in this section were conducted using the RF classifier alone. Table 5 shows the accuracy, sensitivity, specificity, and AUC using either the ECG feature set or the SpO2 feature set.

Comparison among Different Signals
Reviewing the results of SA detection using either ECG signals or SpO2 signals alone in Section 4.3, we found that the SpO2 feature set had better accuracy, sensitivity, and specificity compared to the ECG feature set. In addition, some previous review works on SA detection also pointed out that SpO2 signals usually performed better than ECG signals [34]. This phenomenon can be attributed to the ability of the signal to characterize on sleep apnea syndrome. Because, when apnea occurs, a decrease in inhaled air flow can directly cause fluctuations in SpO2, and such fluctuations are significant. For the ECG

Using either ECG or SpO2 Feature Set
To compare the SA detection performance of different signals, the ECG features and SpO2 features from the optimal feature set were used for SA detection, respectively. From Table 4, it was clear that the RF classifier outperforms the other classifiers. Therefore, the experiments in this section were conducted using the RF classifier alone. Table 5 shows the accuracy, sensitivity, specificity, and AUC using either the ECG feature set or the SpO2 feature set.

Comparison among Different Signals
Reviewing the results of SA detection using either ECG signals or SpO2 signals alone in Section 4.3, we found that the SpO2 feature set had better accuracy, sensitivity, and specificity compared to the ECG feature set. In addition, some previous review works on SA detection also pointed out that SpO2 signals usually performed better than ECG signals [34]. This phenomenon can be attributed to the ability of the signal to characterize on sleep apnea syndrome. Because, when apnea occurs, a decrease in inhaled air flow can directly cause fluctuations in SpO2, and such fluctuations are significant. For the ECG signal, in addition to respiratory events, some cardiovascular diseases such as arrhythmias and heart block may also cause changes in HRV [34]. Therefore, it is challenging to detect apnea using the ECG signal.
Although excellent results were obtained using SpO2 signals alone, there are still some drawbacks. For example, chronic obstructive pulmonary disease or alveolar hypoventilation can also cause a decrease in oxygen saturation [35]. This means that some non-apnea-induced oxygen desaturations masquerade as apnea-induced oxygen desaturation, which can eventually lead to a decrease in the sensitivity of the model. However, upon comparing Tables 4 and 5, it can be seen that the accuracy and sensitivity were improved by approximately 1% and 2%, respectively, when using the combined signals compared to using the SpO2 signal alone. This suggests that the two channel signals provide richer SA representation information to the classifier when classifying SA events. Further, from the results of feature selection, the RFECV algorithm selected six features from the ECG feature set and seven features from the SpO2 feature set, respectively. Thus, these features are non-redundant, indicating the complementarity between the two signals.
Another advantage of using a combination of ECG and SpO2 is better applicability. To the best of our knowledge, SA automatic detection algorithms developed based on single-lead ECG signals are not suitable for cardiac patients, which limits the applicability of the algorithms to some extent. However, the proposed algorithm achieved feature-level fusion. The advantage of using multimodal fusion techniques is that multimodal systems can still operate when one of the modalities is missing [28]. In other words, the proposed algorithm still has the ability to diagnose SA when one of the signals does not work.

Comparison with Other Related Works
We compared the performance of our method with other studies. Table 6 summarizes the results of our work and related work on per-segment SA detection. As shown in Table 6, some studies such as [12,31,36] used ECG signals, where [36] used an autoregressive model and a spectral autocorrelation function to extract features from ECG segments with an accuracy of up to 93.9%. In addition, among the studies using SpO2 [25,37], the best performance was reported by [25]. According to Table 6, our proposed approach provided higher per-segment classification accuracy than other studies. Moreover, [38] also used a combination of ECG and SpO2 signals and extracted 39 features. However, this study achieved 97.5% accuracy using only 12 features, which reduces the complexity of the model while improving the accuracy.

Conclusions
In this study, an automated SA detection method was developed to accurately identify sleep apnea events using ECG and SpO2 signals. The best results in terms of accuracy, sensitivity, specificity, and AUC were obtained using the RF classifier after fusing the features of ECG and SpO2. The model takes full advantage of the complementary information of the two signals and outperforms the model developed based on a single signal in terms of diagnostic performance. In addition, experimental results on the Apnea-ECG database showed that the performance of our method has been further improved compared to previous studies. Although the evaluation results of the model met our expectations, there are still some limitations. The database provided by Dr. Tomas Penzel does not annotate hypoventilation events. Therefore, in future work, we will combine multiple datasets to distinguish apnea from hypoventilation events and further validate the proposed algorithm.