Sleep Apnea Detection Using Wavelet Scattering Transformation and Random Forest Classifier

Obstructive Sleep Apnea (OSA) is a common sleep-breathing disorder that highly reduces the quality of human life. The most powerful method for the detection and classification of sleep apnea is the Polysomnogram. However, this method is time-consuming and cost-inefficient. Therefore, several methods focus on using electrocardiogram (ECG) signals to detect sleep apnea. This paper proposed a novel automated approach to detect and classify apneic events from single-lead ECG signals. Wavelet Scattering Transformation (WST) was applied to the ECG signals to decompose the signal into smaller segments. Then, a set of features, including higher-order statistics and entropy-based features, was extracted from the WST coefficients to formulate a search space. The obtained features were fed to a random forest classifier to classify the ECG segments. The experiment was validated using the 10-fold and hold-out cross-validation methods, which resulted in an accuracy of 91.65% and 90.35%, respectively. The findings were compared with different classifiers to show the significance of the proposed approach. The proposed approach achieved better performance measures than most of the existing methodologies.


Introduction
OSA is a common sleep breathing disorder caused by a decrease in airflow during sleep (Hypopnea) or by a complete cessation of airflow (Apnea) due to an upper respiratory illness [1]. PSG is a widely used diagnostic tool to study sleep disorders. The test is conducted in the sleep clinic while the patient is sent to sleep with many electrodes placed on various body parts to record multiple physiological activities. This procedure involves connecting the patient to the PSG equipment and collecting biological signals overnight. A physician evaluates and diagnoses these signals under the established recommendations of the American Academy of Sleep Medicine (AASM). Unfortunately, this approach causes inconvenience to the patients due to the lengthy testing intervals. In addition, the increased cost of the lab setup is another difficulty in using this approach [2].
Various methods have been applied to study OSA, such as snoring sound analysis [3], pulse oximetry [4][5][6], and ECG signal [7]. Since sleep apnea disorder significantly affects heart rate, cardiovascular activity, and other related ECG characteristics, the ECG can capture these variations and then quantify the disrupted breath during the occurrence of the apnea. Moreover, the ECG provides an inexpensive and non-invasive alternative to the PSG. Authors and researchers have used a single-lead ECG signal, which also contains relevant information on the heart activity affected by the OSA. When an apnea event occurs, a drop in the heart rate is commonly observed, followed by a rise before the end of the event [8]. These apneic events will change the ECG signal frequency amplitude for a certain period. Therefore, analyzing the ECG signals for the classification and detection of OSA has gained much attention from several researchers [6,9].
A time-domain feature extraction within each RR peak to the peak time interval and SVM classifier was developed to distinguish the sleep apnea [10]. A framework was to reduce the original features into a compact group. A PCA dimension reduction was also applied to determine the best reduction method. Then, the obtained compact set was fed to the random forest classifier to detect and classify the OSA.

Dataset Description
This paper used the Physionet Apnea-ECG dataset for training, testing, and benchmarking [24]. The ECG signals were collected at 100 Hz, with 16 bits resolution, and the lead V2 electrode configuration was adjusted using an overnight PSG recording. The records varied in length from 7 to 10 h. According to the American Academy of Sleep Medicine guidelines, an expert affixed (minute-by-minute) apnea labels to the PSG signals. A minute of data is considered an apnea if at least one apnea or hypopnea event occurs during the minute. The dataset was categorized into three main groups: apnea (A), borderline (B), and normal (C), according to the value of the AHI as follows: There are 70 ECG recordings in the dataset divided into training and testing subsets with 35 subjects per class. The training subset includes 10,454 annotated segments as normal and 6511 segments annotated as apneic collected from 35 subjects. Figure 2 shows a random sample from the dataset to illustrate the difference between the healthy and apneic classes.

Preprocessing
The raw ECG signals are divided into 60 s intervals per patient, which proved to be the most effective interval durations [25]. The first step in the preprocessing phase is to apply a finite impulse response bandpass filter with cut-off frequencies from 3 to 45 Hz to remove the power line interference. After using the bandpass filter, the relative weight between each data segment for each patient was computed to identify the significance of each data segment and reduce noisy data. The ACF was calculated between any two segments. Then, the cosine similarity was applied between the obtained vectors to measure the similarity between the data segments as follows: where w st is the weight between two data segments s and t, X s and X t are the ACFs vectors obtained from the segments andX s ,X t are the mean values of X s and X t , respectively. After applying Equation (1), it generates a symmetrical matrix containing the similarities of each data segment for the patient. Then, the weight of each segment was obtained by calculating the average score of this segment. A threshold value λ defines the clean and noisy segments. When the segment is classified as noisy, it will be dropped from the computation. An example of the difference between noisy and clean segments with a 1-min interval for a random patient is shown in Figure 3. In this paper, the value of λ is set to 0.8, as suggested by [15].
(a) (b) Figure 3. An example of noisy and clean data segments from a random patient. (a) A clean sample ECG segment with a weight of 0.98. (b) A noisy sample ECG segment with a weight of 0.79.

Wavelet Decomposition Using Scattering Wavelet
The WST is an improved time-frequency analysis method based on wavelet transformation. The WST produces a stable translation-invariant and informative signal representation. The WST is very effective for classification problems because of its stability to deformations and preserves class discriminability. It consists of cascaded decomposition and convolution operations performed on the input signal [26]. Let the function f (t) be the signal under study. Then, a low-pass filter φ(t) and a wavelet mother function ψ are constructed to cover the signal's frequencies. The low-pass filter provides local translation-invariant descriptions of the input signal at a threshold scale of T. The family of wavelet indices is denoted by Λ k with a corresponding octave frequency resolution of Q k . The high pass filter banks are constructed by dilating the wavelet function over the input signal, which are denoted by ψ jk for each j, k ∈ Λ k . The WST is calculated using a convolutional network that iterates over the traditional wavelet transformation. The convolved signal oscillates at a scale of 2 j , and averaging such signal equals zero. A nonlinear (modulus/rectifier) operator was used for the convolved segment to eliminate such oscillations. The conventional operator S 0 ( f (t)) = f * φ J (t) produces the local invariant translation features of the input function f . However, the high-frequency information of the input signal was lost because of the convolution operation. Therefore, the modulus transformation was applied to recover the lost data as follows: The first order scattering coefficients are obtained by using the average of the modulus coefficient with φ J as follows: The information lost during the average process could be retrieved by extracting the high-frequency coefficients of f (t) * ψ j 1 as follows: Then, the second-order scattering coefficients are obtained similar to Equation (3): By generalizing, the wavelet modules convolutions are obtained as follows: The higher-order scattering coefficients are calculated by the average of U m f (t) with φ J to obtain S m f (t) as follows: The final form of scattering coefficients could be rewritten in a vector form as follows: where S m is the aggregates scattering coefficient of all orders to describe the features of the input signal, and l is the maximum level of decomposition. Figure 4 shows an illustration of the steps required to calculate the WST coefficients at various levels of decomposition, and outcome features are obtained as an aggeration of these coefficients. As the number of layers increases, the energy of scattering coefficients diminishes, while the first two levels contain 99% of the total power of the input signal [27]. Another investigation utilized WST to extract characteristics from ECG and found that the optimal value of scattering layers is l 2 [28]. The obtained scattering features S f (t) are stable to local deformation as an inherited property from the wavelet transformation. The scattering decomposition can capture slight variations in the amplitude and duration of the ECG signals. Therefore, the wavelet scattering tree was employed to generate a robust representation of ECG signals that minimize disparities. In this study, the characteristics of ECG signals were extracted using a second-order scattering network to reduce the computational complexity.

High-Order Moments Features
Higher-order moments are valuable for processing non-Gaussian and non-stationary signals; the variations among these features could improve the classification. The statistical mean, standard deviation, skewness, and kurtosis are defined as follows: Higher-order moments are essential in investigating non-Gaussian and non-stationary signals whose divergences are neither systematic nor predictable.

Shannon Entropy
Shannon Entropy is a fundamental measure of the complexity and uncertainty of any state. The concept of the Shannon entropy is the foundation concept of the information theory, which is computed as follows: where x i is the instance variable, P(x i ) is the probability of the variable, and N is the length of the variables. However, the Shannon entropy suffers from the following limitations: • The patterns in the time-series data must be complex enough to model the data. Therefore, it requires a lot of data to populate all histogram bins to obtain a dense histogram. • The calculation of the Shannon entropy is a time-consuming process.

Approximate Entropy
ApEn is a time-domain measure used to solve short data length and complex signal problems. It is also used for the quantization of physiological signal irregularity. The ApEn is a robust measure used to determine the HRV irregularity of the sleep apnea signals to perform detection and classification [29]. For a given u(i), N are the input signal and its corresponding length. The ApEn is determined as follows: let x(i) is a random segment where x(i) ∈ u(i), and m is the length of the segment such that: Then, the distance between two segments x(i), x(j) is determined as follows: Then, let C m i (r), ∀i = {1, 2, . . . , N − m} is defined as: where r is a similarity tolerance threshold defined as r = 0.25 * STD, and STD is the standard deviation of the data segment. The higher the value of r, the more information may be lost. When the value of r is low, the sensitivity to noise increases significantly. Then, the average randomness for the whole segments is calculated as follows: Finally, by generalization of Equation (16), the value of the ApEn is calculated as:

Sample Entropy
The SampEn is a complexity measure based on the concept of the ApEn. However, it is defined as the negative value of the logarithm of the conditional probability between two segments of data [29]. In addition, the Sample Entropy has two advantages when compared to the ApEn, which are: The value of the ApEn depends on the value of r, which does not guarantee consistency.

2.
The value of the ApEn depends on the length of the data segment.
The distance between two vectors is determined as shown in Equation (14). The SampEn is calculated as follows: where m is the value of the embedded dimension that determines the number of samples contained in each vector.

Spectral Entropy
SpEn is a complexity measure used to obtain the uncertainty and randomness of the signal's power spectrum [30]. The signal is converted to its corresponding power spectrum density, then, after normalization, the Shannon entropy is computed as follows: where ∑ P k = 1 is the probability distribution, and N is the total number of frequencies.

Attention Entropy
AttEn is a novel complexity measure focusing on discovering key observations (local maxima and local minima) from the input data series [31]. The main advantages of this measure are its linear time complexity and robust performance. Moreover, it can detect different patterns in the time series, which solves the Shannon entropy's limitations. Instead of counting the frequency of all observations, the AttEn evaluates the frequency distribution of the intervals between the most significant observations in a time series. The calculation of the AttEn for an input signal x(t) is performed as follows: if a data point x i belongs to the key pattern, then the interval between this key point and the previous key points x j is calculated as f (i, j) = x(i) − x(j). Then, the count of the interval values is stored in a variable F. Then, the Shannon entropy is computed over the frequency distribution of the intervals F using Equation (13).

Cumulative Residual Entropy
The CRE is an alternative measure to the Shannon entropy, which retains some of its properties and overcomes some limitations of the Shannon entropy [32]. Instead of using the density function as Shannon entropy, it uses the cumulative distribution function. The CRE is calculated as follows: where N is the length of the signal X.

OSA Classification
Ensemble learning is a robust approach that combines the predictions of various simple low-accuracy models instead of searching for a complicated high-accuracy learning model. The weak learner's main advantage is the training's fast speed and the less complexity in predicting initial predictions. RF is a bootstrap aggregation method that creates multiple replicas of the training samples. Each sample's clone is used to train the weak classifier. Thus, integrating multiple uncorrelated decision tree models improves prediction accuracy. The dataset contains 10,454 normal and 6511 apnea ECG segments of 1-min duration extracted from 35 annotated recordings used for this experiment. Then, the weight calculation was performed according to Equation (1), which detected 120 noisy segments. Then, the Gabor wavelet function was used to perform the wavelet decomposition, and the invariance scale was set to 60 s with a sampling frequency of 100 Hz. The grid search obtained the best values of the scattering network layers as Q1 = 8 and Q2 = 1 wavelet per octave. Figure 5 shows the used Gabor wavelet function and its low-pass filter φ J (t). The output of the constructed scattering wavelet is a 3D tensor with dimensions of [1227 × 6 × 489]. The scattering coefficients were then reshaped to [2934 × 1227] to fit the classifier's input, where each column and row represents a scattering path and a time window. Since there are six windows, the total number of samples is [6 × 489]. The difference between the healthy and OSA scattering coefficients is presented in Figures 6 and 7, respectively. Then, the features described in Section 2.4 were extracted from these coefficients to represent the feature vectors. Figure 8 shows the box-whisker plot of each feature obtained from WST coefficients to illustrate its significance. The proposed method includes two steps of classification to ensure its effectiveness and efficiency. The values of each feature were normalized using the z-score before forwarding to the classifier as formulated below: where x i represents the value of the feature vector X at the i-th location.X and σ(X) represent the mean and standard deviation of the same feature vector X, respectively. The first step performs classification and detection of each segment with a 1-min duration on the entire training dataset using 10-fold cross-validation. In this step, the complete set of features was used for the training and validation. In the second step, the classifier is performed on a subject-by-subject 1-min duration using the Hold-out selection method. In addition, different classifiers are used to compare the results with the RF and hence find the best classifier. The used classifiers are as follows: AdaBoost, ExtraTree, GaussianNB, KNN, LDA, LR, QDA, RF, SGD, SVM, and XgBoost.

Results
The proposed method used the Physionet dataset for validation and testing as mentioned in Section 2.1, which contains 16,965 ECG segments gathered from 35 subjects with a time duration of 1 min. These segments were annotated by a domain expert to distinguish the normal (healthy) and the apnea segments. A total of 120 noisy segments were detected from the weight calculation mentioned in Section 2.2 using a threshold of 0.8. Then, the WST was applied to the cleaned segments only for signal decomposition. A set of high orders and entropy-based features were extracted from the obtained scattering coefficients as mentioned in Section 2.4. Then, the dataset was given to a set of classifiers to perform the experiment and obtain the best results. The details of the experiment results and selection methods are given below.

Performance Measures
The performance of the proposed method is based on the analysis of the confusion matrix and additional statistical measures. The confusion matrix obtained the main terms of the classification performance TP, TN, FP, and FN. Using these measures: ACC, SEN, SPE, Precision, F1 score, ROC, and the Area under ROC are obtained. In this work, the ACC describes the total number of correctly classified ECG segments from both the apnea and normal conditions. Sensitivity describes the correctly identified apnea segments from the entire apnea segments, and specificity describes the correctly classified normal cases from the total normal segments. Precision represents the ratio of positive prediction. In case of conflict between sensitivity and precision, the F1 measure was used to measure the harmonic mean between them. The AUC measures the separation between the classification categories.

Experimental Results
The RF classifier was applied to the feature space extracted from the WST coefficients of the 1-min ECG signals to distinguish the normal and apnea segments. Additionally, the performance of the proposed method was studied using different classifier sets to compare their performance and obtain the best classifier. For example, Naive Bayes classifier, linear models such as LR and SGD, functional models such as SVM, LDA and QDA, lazy classifiers such as KNN, and ensemble methods such as RF, ExtraTrees, AdaBoost, and XGBoost are all used to benchmark the performance of the proposed method. Two cross-validations methods allow various testing and validation with existing methods, namely, hold-out and k-fold. The hold-out procedure is performed by partitioning the data set into 50% for training and 50% for testing. The k-fold approach divides the dataset into consecutive (k − 1) partitions and the remaining one for testing. The experiment was repeated ten times for each cross-validation method to obtain the average performance measure. Table 1 shows the experimental results using the hold-out method using different classifiers and performance measures which are written as (mean ± standard deviation) formula. It is observed from this table that the RF achieved the best average accuracy of 90.23%, a specificity of 91.99%, and a precision of 87.24%, which is higher than the ExtraTrees with 0.24%. The F1 score shows that ExtraTrees are better than the RF with a small amount. This experiment proved the effectiveness of RF and ExtraTrees when compared to other techniques. The experiment has also been performed using 10-fold cross-validation to validate its results with different classifiers, as shown in Table 2. It is observed from this table that the RF achieved the best accuracy of 91.5 and higher deviation than the ExtraTrees, which is the best classification algorithm for this problem. The RF also achieved the best results in the Precision measure. The ExtraTrees achieve the best F1 result with a slight deviation compared to RF. Moreover, Cohen's kappa coefficient was used to examine the agreement between categorical variables X and Y. For example, kappa can be used to compare the ability of various estimators to classify subjects into one of several classes. Cohen's kappa coefficient was performed on each classifier with hold-out and 10-fold cross-validation. The experiment was repeated ten times to obtain the average results. The RF and ExtraTrees outperformed the other classifiers with the same kappa value of 0.81 using the 10-fold method. Then, the XgBoost, KNN, and SVM placed the second level of agreement with kappa values of 0.78, 0.74, and 0.68, respectively. The remaining classifiers are in the lowest level of agreement with kappa values less than or equal to 0.5. The same results were obtained by repeating this procedure using the hold-out method, which indicates that the RF and ExtraTrees are the best classifiers. The AUC was used to evaluate the performance of different classifiers by computing the area under the ROC. As shown in Figure 9, the RF and ExtraTrees outperformed the remaining classifiers with a value of 0.91. Then, the XgBoost, KNN, and SVM achieved second place with values of 0.89, 0.87, and 0.84, respectively. The GaussianNB achieved the worst result with a value of 0.65. The PCA is a robust dimension reduction approach used to eliminate the feature space. The PCA was applied to the hold-out and 10-fold experiments to study the effect of dimensions reduction, as shown in Tables 3  and 4, respectively. From Table 3, the RF achieved the best results with an average accuracy of 86.83%, then ExtraTrees scored 86.54%. The KNN, XgBoost, and SVM results are 86.07%, 85.68, and 82.68%, respectively. Then, the AdaBoost, LR, SGD, and LDA achieved 73.16%, 72.33%, 72.08%, and 71.23%, respectively. Finally, the QDA and GaussianNB obtained 69.45% and 69.01%, respectively. It is observed from Table 4 that RF remains the best classifier. However, the overall accuracy was decreased to 88% with a low standard deviation value. Then, the ExtraTrees, KNN, XgBoost, and SVM scored 87.38%, 87.04%, 86.71%, and 84.05%, respectively. Then, the remaining classifiers achieved relative accuracy closer to 72.5%. The results show that the RF performs significantly better than the other classifier. Then, an SFS was implemented to reduce the feature space and obtain a compact subset of features using the RF regressor. The forward selection method reduced the feature space to 5 attributes instead of 10: spectral entropy, standard deviation, kurtosis, approximate entropy, and the mean. The significance of the spectral entropy and standard deviation equals 0.2. Then, the kurtosis, approximate entropy, and mean scored 0.13, 0.1, and 0.1, respectively. The usage of WST could reduce the computational cost in the training phase as the time complexity of the scattering wavelet could be computed as O(L × Nlog 2 (N)), where N is the length of the signal, and L is the number of decomposition levels.

Discussion
The results presented in the previous section demonstrated the details of the experiments and concluded that the ensemble RF classifier was the most significant. The WST was applied to the raw ECG signals using two filter banks with values of Q1 = 8 and Q2 = 4 obtained by the search grid. The scattering wavelet outperformed various transformation methods because of its behavior based on aggregation and maximization of features. This behavior is similar to the CNN operation. However, it reduces the time-consuming and processing time when compared with CNNs. The extracted features are shown in Figure 8 were obtained from the scattering wavelet coefficient to produce the search space. These features show the significance of classifying OSA apnea and healthy subjects through various techniques. The RF outperformed the other classifiers with significant outcomes using k-fold and hold-out cross-validation methods. Although, the difference between the k-fold and hold-out cross-validation was not substantial. The k-fold method improved the results of the hold-out with an average accuracy, sensitivity, specificity, precision, and F1 equal to 1.3%, 1.12%, 0.32%, 0.54%, and 0.91%, respectively. Cohen's kappa was performed on the full feature set with the k-fold and hold-out methods, and the most significant results were obtained for the RF and ExtraTrees classifiers. The higher value of Cohen's kappa measure indicates a higher agreement between the classifiers. Then, the AUC was calculated for the same experiment and produced the same significance for the RF and ExtraTrees. Furthermore, two feature reduction approaches were applied to validate the proposed method's performance using PCA and SFS. Firstly, PCA was applied to summarize the feature space into fewer dimensions using geometric projection. PCA is a powerful data reduction method when the hidden patterns increase the variance of the projections onto orthogonal components. However, the results obtained by the PCA are less significant when compared with the original feature space. The RF scored an average accuracy equal to 88% and 86.83% for the k-fold and the hold-out methods, respectively. Secondly, the SFS was applied to the feature space using the bi-directional method to select the most valuable features from the original features set. The SFS obtained spectral entropy, standard deviation, kurtosis, approximate entropy, and the mean as the most significant features.

Comparative Analysis
The proposed method achieved an average accuracy of 91.65% in the minute-byminute duration using 10-fold cross-validation, as shown in Table 2. The proposed approach provided an effective solution to detect and classify the OSA regarding the low quality of some recordings. Moreover, the feature extraction did not depend on any extracted parameters from the ECG signals, reducing the probability of incorrect classification. Another advantage of the proposed approach is its low computational cost, making it suitable for integrating with embedded systems and the Internet of Things [33][34][35][36]. The performance comparison with the existing investigations using the Physionet Apnea-ECG dataset was presented in Table 5. It can be observed that some related work outperformed the proposed approach, such as [10,12,15,19]. However, the proposed method used the complete ECG data segments to increase the overall accuracy. The relative weight of each segment was computed to filter each ECG segment based on its quality instead of dropping them [10]. In addition, the proposed method did not include any estimation of the ECG characteristics, such as the QRS complex, to reduce the probability of incorrect classification. The estimation of the QRS complex may lead to inaccurate findings if the algorithm is inadequate or if the signal is noisy [12]. The proposed approach outperformed some of the deep-learning methods, such as CNN as [9,18]. The Gabor spectrogram improved the significance of the deep learning model when applied. However, deep learning methods usually suffer from the high computational cost of training. The proposed approach outperformed the remaining techniques with a significant difference. Table 5. Comparison of the performance measures between the proposed method and previous works.

Conclusions
This paper proposed a novel approach for the automated classification and detection of OSA using single-lead ECG signals. This approach used wavelet scattering transfor-mation for signal decomposition into smaller segments. A set of high-order statistics and entropy-based features were calculated from the obtained scattering coefficients to obtain the hidden information patterns. A Random forest classifier was applied to the obtained features to classify the OSA. The classification performance was performed using accuracy, sensitivity, specificity, AUC, F1 measure, and the Kappa coefficient. The experiment results reported a classification accuracy of 91.65% for 60 s of ECG segments on the Physionet Apnea-ECG. The proposed approach provided better performance measures than most of the existing methodologies. In future work, we aim to improve this work by integrating it with zero-shot learning and training the model on different datasets.