A Deep Learning Framework for Automatic Sleep Apnea Classification Based on Empirical Mode Decomposition Derived from Single-Lead Electrocardiogram

Background: Although polysomnography (PSG) is a gold standard tool for diagnosing sleep apnea (SA), it can reduce the patient’s sleep quality by the placement of several disturbing sensors and can only be interpreted by a highly trained sleep technician or scientist. In recent years, electrocardiogram (ECG)-derived respiration (EDR) and heart rate variability (HRV) have been used to automatically diagnose SA and reduce the drawbacks of PSG. Up to now, most of the proposed approaches focus on machine-learning (ML) algorithms and feature engineering, which require prior expert knowledge and experience. The present study proposes an SA detection algorithm to differentiate a normal and apnea event using a deep-learning (DL) framework based on 1D and 2D deep CNN with empirical mode decomposition (EMD) of a preprocessed ECG signal. The EMD is ideally suited to extract essential components which are characteristic of the underlying biological or physiological processes. In addition, the simple and compact architecture of 1D deep CNN, which only performs 1D convolutions, and pretrained 2D deep CNNs, are suitable for real-time and low-cost hardware implementation. Method: This study was validated using 7 h to nearly 10 h overnight ECG recordings from 33 subjects with an average apnea-hypopnea index (AHI) of 30.23/h originated from PhysioNet Apnea-ECG database (PAED). In preprocessing, the raw ECG signal was normalized and filtered using the FIR band pass filter. The preprocessed ECG signal was then decomposed using the empirical mode decomposition (EMD) technique to generate several features. Several important generated features were selected using neighborhood component analysis (NCA). Finally, deep learning algorithm based on 1D and 2D deep CNN were used to perform the classification of normal and apnea event. The synthetic minority oversampling technique (SMOTE) was also applied to evaluate the influence of the imbalanced data problem. Results: The segment-level classification performance had 93.8% accuracy with 94.9% sensitivity and 92.7% specificity based on 5-fold cross-validation (5fold-CV), meanwhile, the subject-level classification performance had 83.5% accuracy with 75.9% sensitivity and 88.7% specificity based on leave-one-subject-out cross-validation (LOSO-CV). Conclusion: A novel and robust SA detection algorithm based on the ECG decomposed signal using EMD and deep CNN was successfully developed in this study.


Introduction
Sleep apnea (SA) is a common sleep disorder in which pauses in breathing or periods of shallow breathing during sleep occur more often than normal, which often remains undiagnosed and untreated [1]. Each pause possibly lasts between 10 and 20 s and happens Life 2022, 12, 1509 2 of 26 from 5 to over 100 times per hour during overnight sleep. SA can present with or without symptoms and is accompanied by major neurocognitive and cardiovascular sequelae. SA is characterized by increased collapsibility of the upper airway during sleep, resulting in significantly reduced (hypopnea) or absent (apnea) airflow at the nose and/or mouth, which is usually accompanied by oxyhemoglobin desaturation and is typically terminated by a brief micro-arousal [2]. Repeated apnea episodes result in a prolonged decrease in oxyhemoglobin saturation and sleep fragmentation, with less slow-wave and rapid eye movement (REM) sleep.
SA has a high prevalence, especially among middle-aged and elderly people. According to the American Academy of Sleep Apnea's diagnostic criteria, approximately 936 million adults aged 30 to 69 years have mild to severe SA syndrome and 425 million adults aged 30 to 69 years have moderate to severe SA (AASM) [3,4]. Further research into the relationship between the incidence of SA and age found that 88% of men aged 65 to 69 years had five or more events per hour, while that incidence increased to 90% in men aged 70 to 85 years. In addition, SA has been linked to an increased risk of other diseases such as hypertension, heart failure, a heart attack, and a cardiovascular event [5].
Among the causes, there is the use of different diagnostic tools. For AASM, the gold standard for a SA diagnosis remains comprehensive polysomnography [6], although alternative diagnosis methods have been studied. In recent decades, different SA detection methods have been proposed using a single-lead physiological signal to satisfy the nonintrusive demand and hardware conditions of wearable mobile equipment, such as pulse oximetry (SpO 2 ), nasal airflow, thoracic (THO) and abdominal (ABD) movement electroencephalography (EEG), and electrocardiography (ECG). Based on SpO 2 signal, Alvarez et al. built a home polysomnography-based SA screening system using machine learning [7], and Mendonca et al. designed a wireless home monitoring device for automatic diagnosis OSA patient [8]. Based on nasal airflow, Haidar et al. proposed an apnea-hypopnea events recognition system using convolutional neural network (CNN) [9] and Yue et al. developed a multi-resolution residual network for the diagnosis and classification of OSA [10]. Based on THO and ABD movement signals, Lin et al. identified various types of apneas using wearable piezoelectric bands [11]. Based on EEG signal, Zhao et al. developed an algorithm to automatically differentiate SA events using entropy-based EEG sub-band feature extraction and machine learning-based classifiers [12]. Bhattacharjee et al. aimed to detect SA based on Rician modeling of feature variation and in multi-band EEG [13,14], and Taran et al. introduced an apnea event detection algorithm using Hermite basis function adaptive decomposition for EEG [15].
ECG waveform analysis and heart rate variability (HRV) have recently been used and developed as an alternative method of detecting sleep disordered breathing, including apnea and hypopnea events [16,17]. The ECG signal is particularly interesting to study because it enables the physiological demonstration of SA occurrence and is easy to record with wearable devices. When an apnea event occurs, blood oxygen levels drop, and the cardiovascular system is prompted to keep the body's oxygen supply adequate. As a result, abnormal heart activity or high heart rate variability may indicate the presence of SA. The respiratory effort produces an alteration in the ECG electrode's position, which severally influences the ECG signal's amplitude [18]. Meanwhile, the HRV assesses the time interval variation between successive heartbeats, known as the R-R interval (RRI). The variation in the RRI is a symptom of apnea events, hence can provide the physiological basis to detect SA. The possibility of diagnosing SA by single-lead ECG was presented in 1984 by Guilleminault et al. [19], followed by an apnea ECG PhysioNet database publication after competition called the Computers in Cardiology 2000 Challenge [16], which provided minute-by-minute ECG data with apnea labeling. In recent years, more studies using single-lead ECG to detect apnea episodes have been proposed and validated using the PhysioNet database. The majority of them concentrated on machine learning (ML) and feature engineering, which necessitates the use of a specific feature extraction method to extract ECG features. Otherwise, the deep learning approach has been developed in recent years. The SA detection literature review based on single-lead ECG using ML and DL frameworks is summarized in Table 1. The novelty and significance of this study can mainly be described as follows: (1) The proposed method used EMD features generated from preprocessed overnight ECG signals to identify various symptoms of SA episodes. Due to external stimulation, biomedical signals are, in general, non-linear and non-stationary, thus EMD is ideally suited to extract essential components which are characteristic of the underlying biological or physiological processes. (2) The proposed deep learning framework can outperform SA classification state-of-theart methods with potential for real-time and low-cost hardware implementation. The deep CNN models have a simple and compact architecture; moreover, they reduce the need for feature extraction. (3) The proposed method applied the synthetic minority oversampling technique (SMOTE) to overcome the imbalance problem. The comparison performance with other resampling methods is also discussed.
The remaining focus of this paper is organized as follows. Section 2 describes the databases that were used as well as the proposed SA detection system that is based on EMD and deep CNN models. Section 3 contains the results of the experiments. Section 4 describes the study findings discussions. Finally, Section 5 brings the study to a close.

Overnight Apnea ECG Database
The Apnea ECG PhysioNet Database (AEPD) [40] is an open public database and was utilized to validate and enable the proposed algorithm performance to be evaluated and compared with the existing literature. The AEPD ECG recordings were sampled at 100 Hz, and the duration of each participant's sleep recording ranged between 7 and 10 h [16]. The database contained 35 participants' recordings with SA annotations on a minute-by-minute basis; each minute of ECG recording was annotated as either normal breathing or disordered breathing with apnea episode.
AEPD had three participant categories: A: apnea (Table 2), B: borderline (Table 3), and C: healthy control (Table 4), each with 20, 5, and 10 subjects. Group A participants were discovered to be related to people with OSA and had total apnea durations of more than 100 min for each recording. This group's participants ranged in age from 38 to 63, and their AHI ranged from 21 to 83. Group B participants were borderline, with total apnea episode durations ranging from 10-96 min. This group's age ranged from 42 to 53, and the AHI ranged from 0 to 25. Group C participants were healthy controls with no SA or very low levels of disease and total apnea durations ranging from 0 to 3 min. Group B recording b05 and Group C recording c05 were eliminated because b05 contained a grinding noise and c05 was identical to c06. Consequently, there were only 33 recordings included in this study, with totally 9160 normal events and 6019 apnea events (imbalanced dataset), after eliminating some contaminated events caused by the patient's movement, poor patch contact, electrical interference, measurement noise, or other disturbances. Tables 2-4 provide the details on each participant's characteristics and the sample size information used in this study.    Figure 1 shows the overall proposed sleep apnea detection system flowchart based on the deep learning framework and the ECG signal decomposition. This study proposes an automatic SA classification algorithm based on the deep learning framework and EMD derived from a single-lead ECG signal of apnea and normal breathing. Overnight single-lead ECG signals from 33 participants were the input with each participant's 7~8 h recordings, and then partitioned into subsequent 60-s ECG windows. The participants of the mixed group category comprised AEPD group A, B, and C. The raw ECG signal was preprocessed and included normalization, windowing, and band-pass filtering. The EMD algorithm was used to obtain decomposition features to discriminate SA and normal breathing episodes. Feature selection based on the neighborhood component analysis (NCA) was then applied to reduce the number of EMD features (IMFs). In total, 4 IMFs were used in this study, specifically IMF 1 , IMF 2 , IMF 12 as the result of the addition IMF 1 and 2 and IMF 123 as the result of the addition IMF 1 , 2 , and 3 . Thereafter, the synthetic minority oversampling technique (SMOTE) was applied to deal with the data imbalance problem. Finally, a 1D deep CNN model was established to obtain the classification results using cross-validation.

ECG Signal Preprocessing and Filtering
The zero-means computation and windowing processing parameters were used in the ECG signal preprocessing. In this method, zero-means subtracts the mean from the ECG signals to remove the baseline wandering effects, and then nocturnal ECG spectra are divided into 60-s intervals, with large noise windows excluded for algorithm development. The AEPD used in this study labeled each 60-s window as having an apnea or normal episode.

Empirical Mode Decomposition (EMD)
Empirical mode decomposition (EMD) [41] is a time-frequency domain signal analysis that adaptively and locally decomposes any non-stationary time series in a sum of oscil-Life 2022, 12, 1509 6 of 26 latory waveforms known as Intrinsic Mode Functions (IMF), which represent zero-mean amplitude and frequency modulated components. The main idea of this method is to empirically identify these intrinsic oscillatory modes by their characteristic time scales in the data, and then decompose the data accordingly. The step by step of EMD generation is as follows: 1.
Correlate all the local maximum and minimum by applying a cubic spline to generate the upper and lower envelope.

2.
Specify the average of upper and lower envelopes, which is denoted as m 1 (t).

3.
Calculate the difference between the signal y(t) and the average m Verify if i 1 (t) satisfies the two requirements of IMF: • In the entire data set, the number of extremum and zero-crossings must be equal or differ by no more than one.

•
The mean value of the envelope indicated by the local maximum and local minimum is zero at any point.

ECG Signal Preprocessing and Filtering
The zero-means computation and windowing processing parameters were used in the ECG signal preprocessing. In this method, zero-means subtracts the mean from the ECG signals to remove the baseline wandering effects, and then nocturnal ECG spectra are divided into 60-s intervals, with large noise windows excluded for algorithm development. The AEPD used in this study labeled each 60-s window as having an apnea or normal episode.

Empirical Mode Decomposition (EMD)
Empirical mode decomposition (EMD) [41] is a time-frequency domain signal analysis that adaptively and locally decomposes any non-stationary time series in a sum of oscillatory waveforms known as Intrinsic Mode Functions (IMF), which represent zeromean amplitude and frequency modulated components. The main idea of this method is to empirically identify these intrinsic oscillatory modes by their characteristic time scales in the data, and then decompose the data accordingly. The step by step of EMD generation is as follows: 1. Correlate all the local maximum and minimum by applying a cubic spline to generate the upper and lower envelope.
2. Specify the average of upper and lower envelopes, which is denoted as ( ).
3. Calculate the difference between the signal ( ) and the average ( ), ( ) = ( ) − ( ) (potential first IMF). 4. Verify if ( ) satisfies the two requirements of IMF: • In the entire data set, the number of extremum and zero-crossings must be equal or differ by no more than one. • The mean value of the envelope indicated by the local maximum and local minimum is zero at any point. If i 1 (t) meets both requirements, then i 1 (t) is the first IMF of the y(t) signal. If i 1 (t) does not meet the two IMF requirements, the sifting process will be reiterated by treating the i 1 (t) as original until it meets the requirements.

5.
After subtracting the y(t) signal from the IMF, the sifting process is reiterated to break down the data into n IMFs.
Finally, the signal y(t) can be expressed as: where i j (t) represents the IMFs of the original signal y(t) with j = 1, 2, 3, . . . , n and r n (t) is a residue of y(t). The differentiation of ECG decomposed signals, IMF 1 to 6 , between normal and apnea episodes is clearly described in Figures 2 and 3.

Feature Transformation Using Continuous Wavelet Transform (CWT)
The continuous wavelet transform (CWT) [42] is a technique for extracting timelocalized information from a signal by calculating the time convolution of the signal with the analyzing wavelet. Convolution between the signal x(t) and a family (set) of wavelets ψ f (t) defined for a specific frequency range is required to compute the spectro-temporal representation of a signal using CWT, assigned as: To extract amplitude and phase, wavelet functions must be complex-valued and welllocalized in time and frequency. The complex Morlet wavelet is the most commonly used wavelet for complex spectro-temporal signal characterization. It is made up of a complex oscillation with a fixed frequency (frequency localization) that is constrained by a Gaussian window (time localization). The mother wavelet for the complex Morlet wavelet family is formulated as: where e j2π f 0 t is the complex oscillatory component with frequency of f 0 Hz, e −t 2 /2σ 2 t is the Gaussian window with a temporal standard deviation of σ t , A(σ t ) is applied to confirm that the wavelet energy is equal to one, and the parameter n c generally specifies the number of cycles at the frequency f 0 inside the Gaussian bell. The CWT spectrograms of normal and apnea episode from the EMD preprocessed ECG signal are clearly presented in

Feature Transformation Using Continuous Wavelet Transform (CWT)
The continuous wavelet transform (CWT) [42] is a technique for extracting time-localized information from a signal by calculating the time convolution of the signal with the analyzing wavelet. Convolution between the signal ( ) and a family (set) of wavelets ( ) defined for a specific frequency range is required to compute the spectro-temporal representation of a signal using CWT, assigned as: To extract amplitude and phase, wavelet functions must be complex-valued and well-localized in time and frequency. The complex Morlet wavelet is the most commonly used wavelet for complex spectro-temporal signal characterization. It is made up of a complex oscillation with a fixed frequency (frequency localization) that is constrained by a Gaussian window (time localization). The mother wavelet for the complex Morlet wavelet family is formulated as:

Feature Selection Using Neighborhood Component Analysis (NCA)
Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modelling and, in some cases, to improve the performance of the model. Neighborhood component analysis (NCA) is a non-parametric method for selecting features with the goal of maximizing prediction accuracy of classification algorithms [43].
The regularization parameter (λ) was tuned for the best features. NCA discards irrelevant features which reduces the features dimensionality and improves the algorithm performance. The steps involved in NCA are as follows [44]: • Data is divided into training and testing sets. The training data are then partitioned into 10-folds in which the classifier leaves out one-fold for testing and trained on the other nine folds.

•
The regularization parameter, λ value, is tuned and the NCA model was trained for each λ using each fold in the training set. This process is repeated for all folds and all λ values.

•
The average loss obtained from each fold for each λ value was computed and the best λ value was found, which corresponds to the minimum average loss. • Finally, the features with feature weight greater than the threshold (T) were extracted.
where, τ is tolerance fixed to 0.02 and w is updated features weight. where is the complex oscillatory component with frequency of Hz, is the Gaussian window with a temporal standard deviation of , ( ) is applied to confirm that the wavelet energy is equal to one, and the parameter generally specifies the number of cycles at the frequency inside the Gaussian bell. The CWT spectrograms of normal and apnea episode from the EMD preprocessed ECG signal are clearly presented in Figures 4 and 5.

Feature Selection Using Neighborhood Component Analysis (NCA)
Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modelling and, in some cases, to improve the performance of the model. Neighborhood component analysis (NCA) is a non-parametric method for selecting features with the goal of maximizing prediction accuracy of classification algorithms [43].
The regularization parameter ( ) was tuned for the best features. NCA discards irrelevant features which reduces the features dimensionality and improves the algorithm performance. The steps involved in NCA are as follows [44]: • Data is divided into training and testing sets. The training data are then partitioned into 10-folds in which the classifier leaves out one-fold for testing and trained on the other nine folds.

•
The regularization parameter, value, is tuned and the NCA model was trained for each using each fold in the training set. This process is repeated for all folds and all values.

•
The average loss obtained from each fold for each value was computed and the where is the complex oscillatory component with frequency of Hz, is the Gaussian window with a temporal standard deviation of , ( ) is applied to confirm that the wavelet energy is equal to one, and the parameter generally specifies the number of cycles at the frequency inside the Gaussian bell. The CWT spectrograms of normal and apnea episode from the EMD preprocessed ECG signal are clearly presented in Figures 4 and 5.

Feature Selection Using Neighborhood Component Analysis (NCA)
Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modelling and, in some cases, to improve the performance of the model. Neighborhood component analysis (NCA) is a non-parametric method for selecting features with the goal of maximizing prediction accuracy of classification algorithms [43].
The regularization parameter ( ) was tuned for the best features. NCA discards irrelevant features which reduces the features dimensionality and improves the algorithm performance. The steps involved in NCA are as follows [44]: • Data is divided into training and testing sets. The training data are then partitioned into 10-folds in which the classifier leaves out one-fold for testing and trained on the other nine folds.

•
The regularization parameter, value, is tuned and the NCA model was trained for each using each fold in the training set. This process is repeated for all folds and all values. As the result, the selected EMD features were IMF 1 , 2 , and 3 based on the NCA evaluation of mean, standard deviation, and variance of each feature. The selected IMFs were also combined and generated new features, such as IMF 12 and IMF 123 , to remove most of the noises and artifacts and construct denoised ECG signals. The combination IMFs were defined as follows: In total, there were 4 IMF features, IMF 1 , IMF 2 , IMF 12 , and IMF 123 , which were processed to the classification stage ( Figures 6 and 7 provide details of the IMF features between normal and apnea episodes).
In total, there were 4 IMF features, IMF , IMF , IMF , and IMF , which were processed to the classification stage ( Figures 6 and 7 provide details of the IMF features between normal and apnea episodes).

Data Augmentation Using Synthetic Minority Oversampling Technique (SMOTE)
The SMOTE algorithm [45] generates artificial examples based on the feature space, rather than data space, similarities between existing minority examples, by the reason of data augmentation. These synthetic examples are generated by connecting a portion or all of the minority class's K nearest neighbors. Neighbors from the K nearest neighbors are chosen at random depending on the amount of oversampling required.
In more detail, let characterize the class with minority condition. For each data point find the k-nearest neighbor, given a specified K. The k-nearest neighbors are defined as the K elements of whose Euclidian distance between itself and have the smallest magnitude in the feature space X. To generate a new data point, choose one of the k-nearest neighbors at random, and then calculate the difference between the chosen data point and its nearest data point. Multiply this difference by a number λ, that ranges from 0 to 1. Finally, add this vector to the selected data point :

Data Augmentation Using Synthetic Minority Oversampling Technique (SMOTE)
The SMOTE algorithm [45] generates artificial examples based on the feature space, rather than data space, similarities between existing minority examples, by the reason of data augmentation. These synthetic examples are generated by connecting a portion or all of the minority class's K nearest neighbors. Neighbors from the K nearest neighbors are chosen at random depending on the amount of oversampling required.
In more detail, let S min S characterize the class with minority condition. For each data point x i S min find the k-nearest neighbor, given a specified K. The k-nearest neighbors are defined as the K elements of S min whose Euclidian distance between itself and x i have the smallest magnitude in the feature space X. To generate a new data point, choose one of the k-nearest neighbors at random, and then calculate the difference between the chosen data point and its nearest data point. Multiply this difference by a number λ, that ranges from 0 to 1. Finally, add this vector to the selected data point x i : where x i S min is the selected instance from minority class,x i S min is one of the k-nearest neighbors of x i and λ [0, 1] is random generated number.

One-Dimensional Deep Convolutional Neural Network (1D Deep CNN)
Deep-learning (DL) is the most recent achievement of the machine-learning (ML) era, displaying near-human [46], and now super-human abilities in a variety of applications such as voice-to-text translations, object detection and recognition, anomaly detection, emotion recognition from audio or video recordings, and so on.
The proposed 1D deep CNN architecture, adapted from a study by Chang et al. [25], comprises input, feature extraction, classification, and output stages, as shown in Figure 8 and Table 5. To initialize the weights, the CNN and FC layers both use the He normal initialization method. The weights are initialized with the previous layer of neurons in mind, which helps the cost function reach the global minimum faster and more efficiently. The batch normalization layers, which are added after the CNN and FC layers in both the feature extraction and classification layers, normalize the data before it enters the ReLU activation layer, which improves the neural network's speed, performance, and stability. By selecting the maximum activation about a neuron in a feature map, the max pooling layers in the feature extraction layers reduce network complexity and the possibility of overfitting. When the pooling size is set to 2, the size of each feature map is cut in half. Dropout layers with a dropout rate of 0.5 are used to reduce overfitting by randomly omitting 50% of the nodes during the proposed CNN model's training process. Overfitting results in a high training accuracy but a low testing accuracy. The proposed 1D deep CNN model was trained to minimize cross entropy using the Adam optimizer, which is a stochastic gradient descent extension that computes individual adaptive learning rates for different parameters based on estimates of the gradient's first and second moments. To initialize the weights, the CNN and FC layers both use the He normal initialization method. The weights are initialized with the previous layer of neurons in mind, which helps the cost function reach the global minimum faster and more efficiently. The batch normalization layers, which are added after the CNN and FC layers in both the feature extraction and classification layers, normalize the data before it enters the ReLU activation layer, which improves the neural network's speed, performance, and stability. By selecting the maximum activation about a neuron in a feature map, the max pooling layers in the feature extraction layers reduce network complexity and the possibility of overfitting. When the pooling size is set to 2, the size of each feature map is cut in half. Dropout layers with a dropout rate of 0.5 are used to reduce overfitting by randomly omitting 50% of the nodes during the proposed CNN model's training process. Overfitting results in a high training accuracy but a low testing accuracy. The proposed 1D deep CNN model was Life 2022, 12, 1509 12 of 26 trained to minimize cross entropy using the Adam optimizer, which is a stochastic gradient descent extension that computes individual adaptive learning rates for different parameters based on estimates of the gradient's first and second moments.

Two-Dimensional Deep Convolutional Neural Network (2D Deep CNN)
Transfer learning is the enhancement of learning in a new task by the transfer of knowledge from a previously learned network. Fine-tuning with transfer learning is typically considerably faster and easier than training a network with randomly initialized weights from beginning. There were three kinds of pretrained CNN models that were employed for transfer learning in this study, i.e., AlexNet, GoogLeNet, and ResNet-50. Table 6 explains the main differences among the pretrained CNN models, AlexNet, GoogLeNet, and ResNet-50.
AlexNet [47] architecture consists of 8 layers, including 5 convolution layers (with 2 Convolution 2D layers, 3 Grouped Convolution 2D layers, 5 Rectified Linear Unit (ReLU) layers, 2 Cross-Channel Normalization layers, 3 Max-Pooling 2D layers) and 3 fully connected layers (with 3 Fully Connected layer, 2 ReLU layers, 2 Dropout layers for regularization, a Softmax layer using a normalized exponential function). The network's originality is the effective implementation of the ReLU activation function, as well as the usage of the Dropout mechanism and data augmentation technique to prevent overfitting. A Cross-Channel Normalization layer is used in the network to increase model generalization. Furthermore, maximum overlap pooling is employed to eliminate the blurring effect generated by average pooling. GoogLeNet [48] is a pretrained CNN that has 22 layers with 9 inception layers. In a convolutional vision network, an inception layer determines the optimal local sparse structure, which can be approximated and covered by readily available dense components. The network includes an initial structure to expand the network's width and depth while eliminating the fully connected layer and replacing it with average pooling to prevent the gradient disappearance.
ResNet-50 [49] denotes a residual network (ResNet) built with 50 layers. The central concept of a ResNet is the presentation of an "identity shortcut connection" that bypasses one or more layers. A shortcut (or skip) connection is used to solve the problem of vanishing or exploding gradients by re-routing the input and adding to the previous layer's concept. A layer learns the concepts of the previous layer and merges with inputs from that previous layer during learning.

Cross-Validation
Cross-validation is a statistical method for evaluating and comparing learning algorithms that divides data into two groups: a training set (for model training) and a testing set (for model testing or validation) [50]. The training and testing sets must cross over in successive rounds so that each data point can be validated. Cross-validation is used for two main reasons: First, the performance of the learned model from available data can be investigated using a single algorithm. In other words, it is used to assess an algorithm's generalizability. The second goal is to compare the performance of two or more different algorithms to find the best algorithm for the available data or, alternatively, to compare the performance of two or more parameterized model variants. Leave-one-subject-out cross-validation (LOSO-CV) and k-fold cross-validation (kfold-CV) were performed at segment-level and subject-level validation, respectively.
In the kfold-CV (Figure 9a), the first step is to a create division of the data samples randomly into k subgroups. Afterward, each subgroup can be chosen as the testing set with the remaining (k-1) subgroups as the training set. In this manner, kfold-CV repeats training and testing k times, and the final accuracy is the average of the k accuracy values for each iteration. Meanwhile, in the LOSO-CV, the records of only one subject are excluded for each training process. In other words, consider N the number of people in the dataset, training is finished on N-1 subjects and validation is finished on a single subject. This process is repeated N times (Figure 9b). randomly into k subgroups. Afterward, each subgroup can be chosen as the testing set with the remaining (k-1) subgroups as the training set. In this manner, kfold-CV repeats training and testing k times, and the final accuracy is the average of the k accuracy values for each iteration. Meanwhile, in the LOSO-CV, the records of only one subject are excluded for each training process. In other words, consider N the number of people in the dataset, training is finished on N-1 subjects and validation is finished on a single subject. This process is repeated N times (Figure 9b).

Experimental Results
This study was conducted using MATLAB 2021b software (developed by MathWorks, Natick, MA, USA) on several computers with 24 GB installed RAM, Intel ® Core TM i5-8400 CPU ∼ = 2.80 GHz, and NVIDIA GeForce GTX 1060 6 GB installed graphic card. The commonly used performance evaluation parameters for the apnea detection system are defined as follows: where TP indicates the number of positive events predicted correctly (apnea events predicted as apnea events), FP indicates the number of actual negative events (normal events) predicted as positive events (apnea events), TN indicates the number of negative events predicted correctly (normal events predicted as normal events), FN indicates the number of actual positive events (apnea events) predicted as negative events (normal events). The receiver operating characteristic (ROC) then plotted and computed the area under the ROC curve (AUC) to globally specify the apnea detection performance [51]. Youden's index is commonly used to assess the effectiveness of an overall diagnostic test when deciding between two or more diagnostic tests [52]. In other words, this index was used to choose the best classification result from a variety of input systems, such as dataset setting and feature. Youden's index is a function of sensitivity and specificity and ranges from 0 to 1, with a value close to 1 indicating that the diagnostic test is meaningful and a value close to 0 indicating that the diagnostic test is useless. The Youden's index (J) is calculated as the sum of the measurement correctly diagnosed for the diseased group (sensitivity and healthy group (specificity). J = (sensitivity + specificity) − 1

Per Segment Classification
Per segment classification was performed using 5-fold cross-validation (5fold-CV). The detailed classification results are given in Table 7 and Figures 10-13. The best per segment classification performance of mixed group SA imbalanced dataset was from IMF 12 of ResNet-50 with 92.5% accuracy, 91.4% sensitivity, 93.1% specificity, and 0.9758 AUC value. The best per segment classification performance of mixed group SA balanced dataset based on SMOTE was from IMF 123 with 93.8% accuracy, 94.9% sensitivity, 92.7% specificity, and 0.9833 of AUC value.

Per Subject Classification
A more applicable approach is so-called per-subject classification, which is required to test a new unseen participant into the trained model. Per-subject classification was performed using leave-one-subject-out cross-validation (LOSO-CV). The detailed classification results are given in Tables 8-11. The best per subject classification performance of mixed group SA was from IMF ResNet-50 with 83.5% accuracy, 75.9% sensitivity, and 88.7% specificity.

Per Subject Classification
A more applicable approach is so-called per-subject classification, which is required to test a new unseen participant into the trained model. Per-subject classification was performed using leave-one-subject-out cross-validation (LOSO-CV). The detailed classification results are given in Tables 8-11. The best per subject classification performance of mixed group SA was from IMF 12 ResNet-50 with 83.5% accuracy, 75.9% sensitivity, and 88.7% specificity.  Table 12 compares the proposed SA detection algorithm with several latest bestpracticed DL and ML algorithms from the existing literature using AEPD. In recent years, a DL framework in SA classification has been developed. Qin et al. [21] proposed a dualmodel deep learning method to perform representation learning and long-term temporal dependance with the adaptive synthetic (ADSYN) sampling method based on the R-R interval to classify a SA event (91.1% accuracy, 88.9% sensitivity, and 92.4% specificity). Yeh et al. [22] developed filter bank decomposition to decompose the ECG signal and a 1D-CNN to extract and classify each sub band decomposed ECG signal (88.6% accuracy, 83.8% sensitivity, and 91.5% specificity). Feng et al. [23] developed a SA detection model based on frequential stacked sparse auto-encoder (FSSAE) and time-dependent cost-sensitive (TDCS) classification (85.1% accuracy, 86.2% sensitivity, and 84.4% specificity). Sheta et al. [24] proposed an automated computer-aided diagnosis system to diagnose apnea events based on ECG using deep learning models (86.3% accuracy and 88.8% sensitivity). Chang et al. [25] developed a 1D deep CNN model using single-lead ECG recordings to construct a sleep apnea detection system (87.9% accuracy, 92% sensitivity, 81.1% specificity). Singh et al. [30] proposed a CNN-based deep learning approach using the time-frequency scalogram transformation of an ECG signal (86.22% accuracy, 90% sensitivity, and 83.8% specificity). Li et al. [53] proposed a SA detection system using a deep neural networkbased and Hidden Markov model (HMM) based on single-lead ECG (84.7% accuracy, 88.9 sensitivity, and 82.1 specificity). The ML framework development for SA classification is very dependable with feature engineering. Lin et al. [20] developed a SA detection method using the ML framework and bag-of-features derived from an ECG spectrogram (91.4% accuracy, 89.8% sensitivity, and 92.4 specificity). Bozkurt et al. [27] aimed to develop new machine learning-based respiratory scoring in the SA diagnostic process with feature engineering of HRV and ECG signal (85.1% accuracy, 85% sensitivity, 86% specificity). Surrel et al. [34] used the R-R intervals and R-S amplitude series with apnea scoring energy feature generation and SVM (85.70% accuracy, 81.40% sensitivity, and 88.40% specificity). Hassan et al. [35] addressed the automated sleep apnea detection system using a data-adaptive signal decomposition tunable-Q factor wavelet transform (TQWT) for single-lead ECG and a random under sampling boosting (RUSBoost) classifier (88.9% accuracy, 87.6% sensitivity, and 91.5% specificity). The proposed method achieved significantly better performance in accuracy, sensitivity, specificity, and AUC compared to all other considered methods.

Dealing with Imbalanced Dataset Problem
The imbalanced dataset problem exists when an uneven distribution of elements across the decision classes occurs: one class of data (typically the diseased class) is smaller than the other classes. In the study of SA classification, most events are normal and only a minority of apnea events occur, since not only participants with apnea were included but also the borderline and healthy control participants. This problem causes a condition where the trained classifier's performance for normal events data, expressed as the specificity, is much higher than the sensitivity. The sensitivity metric represents the ability of the classifier to recognize patients (apnea events class) and if a classifier is only accurate for normal people, it is not valuable. To overcome this problem, certain synthetic data balancing methods need to be applied to balance the dataset (see Table 13 for some existing research in dealing with imbalanced data of SA classification). The main objective of balancing classes is to either increase the frequency of the minority class or decrease the frequency of the majority class by re-sampling techniques. Random over sampling (ROS) aims to increase the number of instances in the minority class by randomly replicating them to present a higher representation of the minority class in the sample. Random under sampling (RUS) balances class distribution by randomly eliminating majority class examples. Table 6 presents the mixed group SA classification performance comparison using AEPD for several balancing methods, including ROS, RUS, and SMOTE (the proposed method).
As shown in Table 14, all methods could improve the SA classification performance, especially the sensitivity, which was much closer to the specificity. SMOTE and ROS could achieve superior performance and RUS had a slightly lower performance for all IMF features. Even though the best classification performance was also achieved by the ROS, SMOTE is still favored. This method can mitigate the problem of overfitting caused by ROS as synthetic examples are generated rather than replication of instances.

Intrinsic Mode Decomposition Feature Phenomenon of Apnea ECG Signal
Many studies have described the ECG signal to be highly correlated with respiratory variation [38,56,58,59]. During apnea onset, some key features can particularly be observed on the ECG signal, such as a decline in heart rate and an increase in heart rate followed by the end of the apnea episode. By decomposing the ECG signal using EMD, these key features are enhanced in certain IMFs. EMD deconstructs a signal into its key components (IMFs), which may then be used in evidence-specific dynamics to provide a more complete description of the signal [41,60]. EMD directly extracts energy connected with distinct intrinsic time scales.
The key features phenomenon of the apnea IMF ECG signal is clearly shown in Figure 14. This observation was only performed on three IMFs of the preprocessed apnea ECG (IMF 1 , IMF 2 , and IMF 3 ) since these IMFs were also the selected features input of the proposed method. In the normal event, all IMFs showed almost stable magnitude and frequency (relatively high); this also indicates that the heart rate does not change. Meanwhile, in the apnea event, the key features of heart rate were obviously emphasized. The apnea onset frequency of preprocessed ECG and IMF 1 were not stable (tended to be higher in the end of episode) and lower than the apnea offset. Furthermore, on the IMF 2 and IMF 3 , there were some explicit magnitude and frequency fluctuations.
The key features phenomenon differences between normal and apnea IMF ECG are emphasized and also possible to be observed from the CWT time-frequency spectrogram patterns visualization, as shown in Figures 4 and 5. In the normal IMF 1 ECG spectrogram, the dominant frequency range (yellow color region) was approximately 2-3 Hz, whereas in the apnea IMF 1 ECG spectrogram, the dominant frequency range was around 2-4 Hz with unstable frequency power from 3 to 4 Hz. Additionally, the dominant frequency range of normal IMF 2 ECG spectrogram was 1-2 Hz and apnea IMF 2 ECG spectrogram was 1-3 Hz with lower frequency power in 2-3 Hz (light blue color region). These two spectrogram patterns indicate that the apnea episode had unstable frequency components compared to the normal episode.
in the apnea IMF ECG spectrogram, the dominant frequency range was around 2-4 Hz with unstable frequency power from 3 to 4 Hz. Additionally, the dominant frequency range of normal IMF ECG spectrogram was 1-2 Hz and apnea IMF ECG spectrogram was 1-3 Hz with lower frequency power in 2-3 Hz (light blue color region). These two spectrogram patterns indicate that the apnea episode had unstable frequency components compared to the normal episode.

Study Limitations and Future Developments
Despite the fact that the proposed algorithm performed exceptionally well, this study has several limitations. First, the proposed method was validated using a small sample size from AEPD. Second, insufficient physiological data and subject disease history were available for further analysis because AEPD did not provide such information. More data collection from the NCKUH Sleep Center, as well as the combination of multiple databases, may be a solution to these drawbacks. Third, the proposed algorithm could not be applied to patients with cardiovascular disease (CVD) complications because their ECG signal was somewhat irregular and not only affected by SA.
This study's future developments include using explainable AI (XAI) to automatically identify physiological meanings for OSA features, testing the proposed method's robustness in a large-scale group of participants from several databases, and developing an algorithm to discriminate SA and CVD using ECG data.

Conclusions
This paper proposed a novel algorithm to classify SA and normal breathing episodes using the signal decomposition method. Four different EMD features were considered along with 1D and 2D deep learning classifiers. When the ECG signal of SA and normal breathing episodes was decomposed, high accuracy was obtained. In addition, the algorithm outperformed the current best-practiced approaches in terms of accuracy.

Study Limitations and Future Developments
Despite the fact that the proposed algorithm performed exceptionally well, this study has several limitations. First, the proposed method was validated using a small sample size from AEPD. Second, insufficient physiological data and subject disease history were available for further analysis because AEPD did not provide such information. More data collection from the NCKUH Sleep Center, as well as the combination of multiple databases, may be a solution to these drawbacks. Third, the proposed algorithm could not be applied to patients with cardiovascular disease (CVD) complications because their ECG signal was somewhat irregular and not only affected by SA.
This study's future developments include using explainable AI (XAI) to automatically identify physiological meanings for OSA features, testing the proposed method's robustness in a large-scale group of participants from several databases, and developing an algorithm to discriminate SA and CVD using ECG data.

Conclusions
This paper proposed a novel algorithm to classify SA and normal breathing episodes using the signal decomposition method. Four different EMD features were considered along with 1D and 2D deep learning classifiers. When the ECG signal of SA and normal breathing episodes was decomposed, high accuracy was obtained. In addition, the algorithm outperformed the current best-practiced approaches in terms of accuracy.
By decomposing the ECG signal using EMD, the key features of normal and apnea episode are enhanced in certain IMFs. Furthermore, transforming the IMFs into the timefrequency spectrogram using CWT, pattern visualization and recognition of the key features made it possible to successfully differentiate the normal and apnea episodes.