Assessment of Airflow and Oximetry Signals to Detect Pediatric Sleep Apnea-Hypopnea Syndrome Using AdaBoost

The reference standard to diagnose pediatric Obstructive Sleep Apnea (OSA) syndrome is an overnight polysomnographic evaluation. When polysomnography is either unavailable or has limited availability, OSA screening may comprise the automatic analysis of a minimum number of signals. The primary objective of this study was to evaluate the complementarity of airflow (AF) and oximetry (SpO2) signals to automatically detect pediatric OSA. Additionally, a secondary goal was to assess the utility of a multiclass AdaBoost classifier to predict OSA severity in children. We extracted the same features from AF and SpO2 signals from 974 pediatric subjects. We also obtained the 3% Oxygen Desaturation Index (ODI) as a common clinically used variable. Then, feature selection was conducted using the Fast Correlation-Based Filter method and AdaBoost classifiers were evaluated. Models combining ODI 3% and AF features outperformed the diagnostic performance of each signal alone, reaching 0.39 Cohens’s kappa in the four-class classification task. OSA vs. No OSA accuracies reached 81.28%, 82.05% and 90.26% in the apnea–hypopnea index cutoffs 1, 5 and 10 events/h, respectively. The most relevant information from SpO2 was redundant with ODI 3%, and AF was complementary to them. Thus, the joint analysis of AF and SpO2 enhanced the diagnostic performance of each signal alone using AdaBoost, thereby enabling a potential screening alternative for OSA in children.


Introduction
Childhood Obstructive Sleep Apnea (OSA) syndrome is a sleep disorder in which airflow is intermittently interrupted or decreased during sleep, mainly due to the obstruction of the upper airway [1,2]. Events of absence (apnea) or reduction (hypopnea) in air exchange caused by these obstructions reduce the oxygenation of blood, and disturb the normal progression of sleep stages, which leads to restless sleep, daytime sleepiness and behavioral problems [1,2]. Untreated pediatric The main hypothesis of this study is that an adequate combination of the information from AF and SpO 2 signals can yield higher diagnostic performance than each of these signals separately. Therefore, our primary goal is to compare the information of AF and SpO 2 signals to detect OSA in children and evaluate their complementarity. Additionally, the secondary objective is to evaluate the diagnostic ability of AdaBoost classifiers using features from each signal separately and combined.

Database
The database used in this study comprised 974 pediatric subjects referred to the Comer Children's Hospital, University of Chicago Medicine (Chicago, IL, USA), with clinical suspicion of OSA. This study was conducted according to the Declaration of Helsinki. The legal caretakers of each subject provided the informed consent and the Ethics Committee of the University of Chicago Medicine approved the study protocol (#11-0268-AM017, # 09-115-B-AM031, and # IRB14-1241). In-laboratory sleep studies were performed with a digital polysomnography device (Nihon Kohden America Inc., Irvine, CA, USA). Subjects were evaluated according to the rules defined by the American Academy of Sleep Medicine (AASM) [14], including the computation of the AHI. The subjects of the database were classified according to OSA severity in four groups: No OSA (AHI < 1 e/h), Mild OSA (1 ≤ AHI < 5 e/h), Moderate OSA (5 ≤ AHI < 10 e/h) and Severe OSA (AHI ≥ 10 e/h). These severity groups were chosen in accordance with previous studies [8,9,26]. Table 1 summarizes sociodemographic-Age, number of males and females-And clinical data-Normalized body mass index (BMI z-score), AHI, number of patients with OSA-Of the subjects involved in this study. They were randomly split into a training set (60%) and a test set (40%). No significant differences were found in age, sex, BMI z-score and AHI between the two sets (p > 0.01, Mann-Whitney U test). The training set was used to fix the optimum values of the method parameters using a bootstrap approach and train the classifiers. The test set was used to evaluate the diagnostic performance of our algorithm. Table 1. Sociodemographic and clinical data of the subjects involved in the study. Subjects distributions represented as N • (%). Age, normalized body mass index (BMI z-score) and apnea-hypopnea index (AHI) represented as the median (interquartile range). Normalized body mass index (BMI z-score); apnea-hypopnea index (AHI); Obstructive Sleep Apnea (OSA).

Methods
In this study, AF and SpO 2 signals extracted from 974 PSG recordings were analyzed. AF signals were sampled at f s = 100 Hz, while SpO 2 signals were obtained from the pulse oximeter at f s = 25 Hz, as recommended by the AASM [14]. Figure 1 shows the workflow of the proposed methodology. After preprocessing, features were extracted using time and frequency-based analyses. This study was intended to assess the complementarity of the features extracted from AF and SpO 2 signals, and therefore different feature sets were assessed: AF-derived features, SpO 2 -derived features and both AF and SpO 2 features. We also split the experiments in two situations: with and without ODI 3%. Six settings were thus investigated, namely: 'AF', 'SpO 2 , 'AF + SpO 2 , 'AF + ODI', 'SpO 2 + ODI' and 'AF + SpO 2 + ODI'. Feature selection was conducted in these feature sets independently to establish optimum subsets of features before the classification stage. Finally, the selected features were used to train and evaluate six independent AdaBoost classifiers.

Preprocessing
AF and SpO2 signals were preprocessed in order to remove artifacts and signal loss intervals, as well as to normalize the amplitude values. In the case of SpO2 signals, samples with values lower than 50% of saturation and intervals with abrupt changes of oxygen saturation greater than 4% per second were removed [26,29]. AF signals were filtered using a low-pass filter (cutoff frequency of 1.5 Hz) and subsequently normalized [31,32]. Artifacts in the AF signal were removed using a method based on the standard deviation and the kurtosis of 30 s segments, as in previous studies [32].

Feature Extraction: Time and Frequency Domain Analyses
The feature extraction stage comprised the characterization of AF and SpO2 signals using automatic signal processing algorithms. In this study, analyses were performed in time and frequency domains. Extracted features summarize the information about the alterations of the signals properties and the recurrence of apneic events, and have been widely assessed in previous studies dealing with automatic detection of adult and pediatric OSA [19,20,26,31].
CTM is a measure of the variability of a signal [47]. It is based on plots of first order differences: given a signal x(n) of length N, the values x(n + 2) − x(n + 1) are represented against x(n + 1) − x(n) in a scatter plot. CTM is the rate of differences that lie inside a circle of fixed radius r [20,47]: The computation of CTM relies on the parameter r. The values of r were independently set for AF and SpO2 signals by maximizing the absolute value of the Spearman's correlation coefficient (ρ) between CTM and the AHI in the training set [26,31].
LZC is a nonparametric measure of the complexity of a time series. The LZC of a sequence increases as more subsequences are contained in it. To analyze the subsequences, the signal is converted to binary by applying a threshold, usually the median value of the samples [20,26,48].

Preprocessing
AF and SpO 2 signals were preprocessed in order to remove artifacts and signal loss intervals, as well as to normalize the amplitude values. In the case of SpO 2 signals, samples with values lower than 50% of saturation and intervals with abrupt changes of oxygen saturation greater than 4% per second were removed [26,29]. AF signals were filtered using a low-pass filter (cutoff frequency of 1.5 Hz) and subsequently normalized [31,32]. Artifacts in the AF signal were removed using a method based on the standard deviation and the kurtosis of 30 s segments, as in previous studies [32].

Feature Extraction: Time and Frequency Domain Analyses
The feature extraction stage comprised the characterization of AF and SpO 2 signals using automatic signal processing algorithms. In this study, analyses were performed in time and frequency domains. Extracted features summarize the information about the alterations of the signals properties and the recurrence of apneic events, and have been widely assessed in previous studies dealing with automatic detection of adult and pediatric OSA [19,20,26,31].
CTM is a measure of the variability of a signal [47]. It is based on plots of first order differences: given a signal x(n) of length N, the values x(n + 2) − x(n + 1) are represented against x(n + 1) − x(n) in a scatter plot. CTM is the rate of differences that lie inside a circle of fixed radius r [20,47]: The computation of CTM relies on the parameter r. The values of r were independently set for AF and SpO 2 signals by maximizing the absolute value of the Spearman's correlation coefficient (ρ) between CTM and the AHI in the training set [26,31].
LZC is a nonparametric measure of the complexity of a time series. The LZC of a sequence increases as more subsequences are contained in it. To analyze the subsequences, the signal is converted to binary by applying a threshold, usually the median value of the samples [20,26,48]. Then, the binary data is scanned and a counter c(n) is increased as more different sequences are found in the data. LZC is the coefficient [20,48]: In (2), the normalizing factor b(n) is equal to the theoretical upper bound of c(n). [48].
SampEn is a statistic used to measure irregularity in biomedical signals [49]. It has been widely employed to characterize fluctuations of AF and SpO 2 signals [18,20,27,49]. Given a signal of length N, SampEn is defined as the negative logarithm of the conditional probability of two similar sequences of length m remaining similar (distance lower than r) after the length of the sequence increases in one sample (length m + 1) [20,49]: where A m (r) and B m (r) are the average number of similar sequences of length m + 1 and m, respectively. For each signal, we fixed the optimum values of parameters m and r to those which maximized the absolute value of Spearman's ρ of SampEn with the AHI in the training set [26,31]. The trials were set with m in the range 1-3 and r in the range 0.05-0.3 times the standard deviation of the signal [50].

Spectral Analysis
Frequency domain features were obtained after the estimation of the Power Spectral Density (PSD) of the signals using the non-parametric Welch method [51]. The signals were segmented in epochs with 50% overlap using a Hamming window of 2 14 and 2 16 samples for SpO 2 and AF, respectively. Window lengths were defined as the minimum power of two that encompasses a segment duration greater than 10 min, a tradeoff between spectral resolution and number of segments [51]. The PSD estimation was then obtained by averaging the PSDs of segments [51].
Once the signals PSDs were estimated, we defined the spectral band of interest (BOI) of the AF signal as the band where the amplitude of the PSD of AF differs among severity groups. In the case of SpO 2 signal, we employed the spectral BOI between 0.020-0.044 Hz defined in previous studies [26]. Following the same methodology, we sought a spectral BOI for the AF signal. The Mann-Whitney U test was used to compare the values of the PSD and find frequency ranges showing the highest statistically significant differences among OSA severity groups [18,26,33]. Figure 2a shows the average PSDs of No OSA, Mild, Moderate and Severe OSA subjects in the training set. The p-values obtained for each frequency are shown in Figure 2b. Only frequencies between 0-0.36 Hz are displayed to allow a proper visualization of the BOI. As four severity groups were involved in the comparison, a total of six pairwise comparisons were conducted. A spectral BOI was found in 0.134-0.176 Hz, where the maximum number of comparisons showed statistically significant differences (p < 0.05/6, Bonferroni correction).
Seven features were obtained from the PSD values in the spectral BOI [26,30]: first to fourth statistical moments (M1F-M4F), median (MedF), maximum (MaxF) and minimum (MinF). Additionally, the full spectrum of the signals was characterized by obtaining four features: the median frequency (FreqM), the spectral entropy (SpecEn) and the quadratic and cubic spectral entropies (SpecEn 2 and SpecEn 3 , respectively) [20,31]. FreqM is defined as the frequency (f ) that accomplishes that 50% of the total power is below that frequency. Seven features were obtained from the PSD values in the spectral BOI [26,30]: first to fourth statistical moments (M1F-M4F), median (MedF), maximum (MaxF) and minimum (MinF). Additionally, the full spectrum of the signals was characterized by obtaining four features: the median frequency (FreqM), the spectral entropy (SpecEn) and the quadratic and cubic spectral entropies (SpecEn 2 and SpecEn 3 , respectively) [20,31]. FreqM is defined as the frequency (f) that accomplishes that 50% of the total power is below that frequency.
SpecEn i is defined as the Shannon entropy of the frequency distribution provided by the normalized i-th power of the PSD (PSD i n) [31]: where N is the number of samples of PSDn in 0 -fs/2. SpecEn indirectly estimates the irregularity of a signal since higher SpecEn values are expected from flatter PSDs, with no dominant frequencies [31].

Oxygen Desaturation Index
Finally, ODI 3% was computed from the SpO2 signal as the number of desaturations greater than or equal to 3% from the baseline per hour of recording. This oximetric index has been found useful in previous approaches focused on the detection of childhood OSA [25][26][27]34].

Feature Selection: Fast Correlation-Based Filter
In the feature extraction stage, 19 signal processing-derived features were extracted for each signal: five time-domain statistics, three nonlinear measures, seven statistics from the BOI and four from the full spectrum measures. ODI 3% was added to these 38 features, so the total number of features was 39. Feature selection was implemented in this study to identify relevant and complementary features of AF and SpO2 signals and derive simpler models with reduced chances of overfitting [52]. We employed the Fast Correlation-Based Filter (FCBF) method [53] prior to the feature classification stage. This classifier-independent method identifies the most relevant features SpecEn i is defined as the Shannon entropy of the frequency distribution provided by the normalized i-th power of the PSD (PSD i n ) [31]: where N is the number of samples of PSD n in 0 − f s /2. SpecEn indirectly estimates the irregularity of a signal since higher SpecEn values are expected from flatter PSDs, with no dominant frequencies [31].

Oxygen Desaturation Index
Finally, ODI 3% was computed from the SpO 2 signal as the number of desaturations greater than or equal to 3% from the baseline per hour of recording. This oximetric index has been found useful in previous approaches focused on the detection of childhood OSA [25][26][27]34].

Feature Selection: Fast Correlation-Based Filter
In the feature extraction stage, 19 signal processing-derived features were extracted for each signal: five time-domain statistics, three nonlinear measures, seven statistics from the BOI and four from the full spectrum measures. ODI 3% was added to these 38 features, so the total number of features was 39. Feature selection was implemented in this study to identify relevant and complementary features of AF and SpO 2 signals and derive simpler models with reduced chances of overfitting [52]. We employed the Fast Correlation-Based Filter (FCBF) method [53] prior to the feature classification stage. This classifier-independent method identifies the most relevant features and removes redundant ones to obtain the optimum subset of features [53]. FCBF is based on measures of the symmetrical uncertainty (SU) between features X i and X j . It is defined as [53]: where H(X i ) is the Shannon entropy of X i , and H(X i |X j ) is the Shannon entropy of the feature X i when X j is observed. The relevance of X i is defined as SU (X i |Y)-Being Y the AHI-And redundancy is defined as SU(X i |X j ). The criteria to remove X i is the following [53]: FCBF was combined with bootstrapping to reduce dependency on the training data and improve generalization [44,52]. We obtained 1000 bootstrap replicates from the training data and the FCBF algorithm was applied to each one [26,29,44]. Features selected at least 500 times formed the optimum subset of features [26,29,52].

Classification: Multiclass AdaBoost
The classification stage was aimed at predicting the severity of OSA using the features selected in the previous stage. As we described in Section 2, subjects were classified in four groups according to the severity of OSA. We employed the multiclass AdaBoost classifier, an ensemble learning method based on boosting [44,46]. The main idea behind ensembles is to combine several classifiers to build a robust one with an increased generalization ability [44]. Base classifiers used to construct ensembles are usually weak and simple decision rules, such as decision trees or LDA classifiers [46]. The crucial rule of ensembles is diversity, that is, weak classifiers need to be trained with different representations of the training set [46]. This way, each weak classifier becomes an expert in a certain area of the feature space and the ensemble makes its predictions based on a committee of diverse and complementary classifiers [46]. Boosting methods are ensembles characterized by sequential training of base classifiers. At each iteration, a new base classifier is trained giving higher weights to instances in which the previous base classifier failed to make a prediction. After a sufficient number of base classifiers are trained, final predictions are obtained by weighted vote of base classifiers [46].
AdaBoost is the most widespread boosting method [46]. In this study, we employed the algorithm AdaBoost.M2, which allows multiclass classification [54]. We used LDA as base classifier since it was proven useful in previous OSA-related studies [19,22]. The training data comprised N feature vectors, x i , with labels, y i (i = 1, . . . , N). The number of iterations, L, was experimentally tuned. For each iteration t (t = 1, . . . , L), a single base classifier is trained using a version of the training data with weights w i (t). First, the distribution D t (i) is calculated as [54]: where with W i : The base classifier is then trained with the distribution D t (i). The trained base classifier generates a weak prediction h t (x,y) and the pseudo-loss ε t is calculated [54]: Then, the weight update coefficient of base classifier t, β t , is obtained [54]: Additional regularization was added using a modified β t with a learning rate parameter ν: where with ν in the range 0-1. Then, w i of the instances x i for the next iteration t + 1 are computed as [54]: The final prediction of AdaBoost H(x) is obtained by means of weighted vote [54]:

Model Optimization and Training
Our database was split into a training set and a test set. The training set was used to derive the optimal number of iterations (L) and the learning rate (ν) of the AdaBoost algorithm, while the test set was intended to evaluate the models in new data. The hyperparameters L and ν are involved in the number of base classifiers to be trained and the calculation of the weight update coefficient β t , respectively. We set trials to estimate the performance of AdaBoost in the training set using the Cohen's Kappa (κ) [55], with L in the range between two and 10,000 classifiers and ν in the range 0.1-1. Cohen's κ is less sensitive to class imbalance in comparison with the error rate [55]. We used the 0.632 bootstrap validation method to estimate κ with reduced chances of overfitting [44]. We obtained 1000 new bootstrap replicates from the training data and trained a model for each one. Repeated instances are frequent in a bootstrap replicate, whereas other instances are not selected [44]. Unselected instances formed a validation set used to evaluate the trained model. The estimate of κ using 0.632 bootstrap, κ B (i) (i = 1, . . . , 1000), is [44]: where κ BValidation (i) and κ BTraining (i) are the values of κ obtained when the model is evaluated in the validation and training bootstrap datasets, respectively [44]. The final κ B is the average of κ B (i) over i [44]. AdaBoost classifiers were trained using the overall training set with optimum L and ν fixed.

Statistical Analysis
A correlation analysis was conducted in the training set to evaluate the relationship between extracted features and the AHI using the Spearman's ρ. Statistically significant differences between severity groups were also examined in the training set using the Kruskal-Wallis test (p < 0.01/6, Bonferroni correction), since features did not pass the Lilliefors normality test. Results obtained in the test set were summarized in a four-class confusion matrix. The agreement between the predicted severity and the gold standard was assessed using the four-class accuracy (Acc 4 ) and κ. Diagnostic ability in the common AHI cutoffs was evaluated using Sensitivity (Se, percentage of diseased subjects correctly classified), Specificity (Sp, percentage of healthy subjects correctly classified), Accuracy (Acc, percentage of subjects correctly classified), Positive and Negative Predictive Value (PPV, NPV, percentage of subjects correctly classified as positives/negatives) and Positive and Negative Likelihood

Preprocessing. Parameters Optimization in the Training Set
Artifacts in both AF and SpO 2 signals were removed in the preprocessing stage. The rates of rejected data-Median [interquartile range]-Were 5.65% [1.76%, 10.30%] and 5.36% [1.51%, 9.17%] of the total recording time for AF and SpO 2 signals, respectively. The amount of discarded data was low comparing to the length of the overnight recordings and both signals were similarly affected by artifacts (ρ = 0.5394, Spearman's rank correlation). In addition, no substantial differences were found between the rates of rejected data of AF and SpO 2 (p = 0.4175, Wilcoxon signed rank test). Figure 3 shows the absolute value of the Spearman's ρ of CTM with AHI for varying r in the training set. The maximum values of |ρ(r)| were reached using r = 0.0004 in AF and r = 0.025 in SpO 2 . Following the same criteria for SampEn, the optimum parameters were m = 2 and r = 0.05 for AF, and m = 3 and r = 0.05 for SpO 2 ( Table 2).  (Table 2).
(a) (b)  Table 3 summarizes the results of the correlation and statistical differences between severity groups in the training set. Several features extracted from both signals showed significant differences between severity groups. Two nonlinear features from AF as well as some time and frequencydomain measures from both signals showed no statistically significant differences. These features were generally associated with the lowest correlations obtained in this study. In general, SpO2 features obtained the highest correlations with the AHI, whereas correlations of several AF features were weaker but significant. It is also remarkable that both time and frequency domain analyses showed statistically significant correlations with the AHI. CTM obtained the highest correlation among AF features and the highest correlations among nonlinear features in both signals. Regarding the spectral analysis-derived features, correlations were also higher in the SpO2 signal. Nevertheless, SpecEn-derived features showed higher correlations when they were applied in the AF signal in comparison with SpO2. Overall, ODI 3% achieved the highest correlation with the AHI.  Table 3 summarizes the results of the correlation and statistical differences between severity groups in the training set. Several features extracted from both signals showed significant differences between severity groups. Two nonlinear features from AF as well as some time and frequency-domain measures from both signals showed no statistically significant differences. These features were generally associated with the lowest correlations obtained in this study. In general, SpO 2 features obtained the highest correlations with the AHI, whereas correlations of several AF features were weaker but significant. It is also remarkable that both time and frequency domain analyses showed statistically significant correlations with the AHI. CTM obtained the highest correlation among AF features and the highest correlations among nonlinear features in both signals. Regarding the spectral analysis-derived features, correlations were also higher in the SpO 2 signal. Nevertheless, SpecEn-derived features showed higher correlations when they were applied in the AF signal in comparison with SpO 2 . Overall, ODI 3% achieved the highest correlation with the AHI.  Figure 4 shows the histograms of the number of times each feature was selected using different groups of features in the training set: 'AF', 'SpO 2 ', 'AF + SpO 2 ', 'AF + ODI', 'SpO 2 + ODI' and 'AF + SpO 2 + ODI'. Results of feature selection without ODI 3% are shown in Figure 4a. Selected features from the 'AF' (CTM AF , SpecEn 2 AF ) and the 'SpO2' (CTM SpO2 , M4F SpO2 ) sets were selected again using the 'AF + SpO 2 ' set. In this case, no redundant features were found when both signals were combined. Results with ODI 3% are shown in Figure 4b. In these three cases, ODI 3% was found to be the most relevant feature and made SpecEn 2 AF and CTM SpO2 redundant. Furthermore, CTM AF , and M4F SpO2 were nonredundant.   Figure 4 shows the histograms of the number of times each feature was selected using different groups of features in the training set: 'AF', 'SpO2', 'AF + SpO2', 'AF + ODI', 'SpO2 + ODI' and 'AF + SpO2 + ODI'. Results of feature selection without ODI 3% are shown in Figure 4a. Selected features from the 'AF' (CTMAF, SpecEn 2 AF) and the 'SpO2' (CTMSpO2, M4FSpO2) sets were selected again using the 'AF + SpO2' set. In this case, no redundant features were found when both signals were combined. Results with ODI 3% are shown in Figure 4b. In these three cases, ODI 3% was found to be the most relevant feature and made SpecEn 2 AF and CTMSpO2 redundant. Furthermore, CTMAF, and M4FSpO2 were nonredundant.

Feature Selection in the Training Set
(a)

Model Optimization in the Training Set
We trained an independent AdaBoost ensemble model for each of the six subsets of selected features using training data. Hence, different optimum values of L and ν were obtained in each case to optimize the performance. Figure 5 shows the bootstrap estimate of κ in the training set for the corresponding trials. AdaBoost models trained with features from the AF signal did not yield higher κ as L increased, as shown in Figure 5a. In this case, a large value of L was not necessary to retrieve the most useful information from AF. The remaining experiments showed increasing κ as L became higher until the maximum was reached. In general, the optimum κ was reached combining intermediate values of values of L and ν, except for the AF + SpO2 subset. This setting reached the maximum κ with a large L and the lowest ν- Figure 5c. Nevertheless, differences between the maximum κ for different values of ν were not high.

Model Optimization in the Training Set
We trained an independent AdaBoost ensemble model for each of the six subsets of selected features using training data. Hence, different optimum values of L and ν were obtained in each case to optimize the performance. Figure 5 shows the bootstrap estimate of κ in the training set for the corresponding trials. AdaBoost models trained with features from the AF signal did not yield higher κ as L increased, as shown in Figure 5a. In this case, a large value of L was not necessary to retrieve the most useful information from AF. The remaining experiments showed increasing κ as L became higher until the maximum was reached. In general, the optimum κ was reached combining intermediate values of values of L and ν, except for the AF + SpO 2 subset. This setting reached the maximum κ with a large L and the lowest ν- Figure 5c. Nevertheless, differences between the maximum κ for different values of ν were not high.

Model Optimization in the Training Set
We trained an independent AdaBoost ensemble model for each of the six subsets of selected features using training data. Hence, different optimum values of L and ν were obtained in each case to optimize the performance. Figure 5 shows the bootstrap estimate of κ in the training set for the corresponding trials. AdaBoost models trained with features from the AF signal did not yield higher κ as L increased, as shown in Figure 5a. In this case, a large value of L was not necessary to retrieve the most useful information from AF. The remaining experiments showed increasing κ as L became higher until the maximum was reached. In general, the optimum κ was reached combining intermediate values of values of L and ν, except for the AF + SpO2 subset. This setting reached the maximum κ with a large L and the lowest ν- Figure 5c. Nevertheless, differences between the maximum κ for different values of ν were not high.  Tables 4 and 5 show the confusion matrices along with their respective κ and Acc 4 values obtained on the test set. Besides, the classification results of ODI 3% in the test set are shown in Table 6. Regarding multiclass classification, both Acc 4 and κ increased when features from both signals were combined. The highest performances were obtained when ODI 3% was also included. The highest overall Acc 4 and κ were achieved using the AF + SpO 2 + ODI subset, although the same Acc 4 but slightly lower κ were reached using AF + ODI. It is important to note that AdaBoost models were more accurate than ODI 3%-Except for the AF model.  Table 7 shows the diagnostic performance in the test set for each setting in terms of their ability to predict the presence of OSA using the reference AHI cutoffs. Despite the lower κ in the four-class classification task, the AF + ODI subset reached the maximum Acc in all AHI cutoffs: Acc = 81.28% (Se = 92.06%; Sp = 36.00%), Acc = 82.05% (Se = 76.03%; Sp = 85.66%), and Acc = 90.26% (Se = 62.65%; Sp = 97.72%) in 1, 5 and 10 e/h, respectively. These results were the same for the AF + SpO 2 + ODI subset in 5 e/h and 10 e/h, but it reached lower diagnostic performance in 1 e/h. Therefore, the AF + ODI subset showed the highest diagnostic ability in all AHI cutoffs, outperforming the SpO 2 + ODI and AF + SpO 2 + ODI subset in 1 e/h. Nevertheless, the SpO 2 + ODI model also reached high diagnostic performance. Table 7. Diagnostic performances of AdaBoost models and ODI 3% in the test set in the apnea-hypopnea index cutoffs 1, 5 and 10 events/hour (e/h).

Discussion
This study aimed to assess AF and SpO 2 signals in the context of pediatric OSA and to evaluate whether these signals can provide complementary information to predict OSA severity in children. Furthermore, the diagnostic ability of multiclass AdaBoost classifiers was evaluated using six different combinations of features extracted from AF and SpO 2 signals. Feature selection revealed that the relevant features from each signal remained non-redundant when both signals were combined, thus suggesting their complementarity. Moreover, the diagnostic ability increased when both signals were combined. Two novel contributions have been introduced in this paper. First, we have compared the diagnostic ability of the automatic signal processing of AF and SpO 2 signals in the context of pediatric OSA. Second, we have designed and validated multiclass AdaBoost classifiers to predict the severity of OSA in children. To the best of our knowledge, this is the first time that AF and SpO 2 signals are jointly evaluated in the context of pediatric OSA detection.

Feature Extraction and Selection
We characterized AF and SpO 2 signals using time-domain statistics, nonlinear measures and spectral analysis. We defined a BOI in the AF signal between 0.134-0.176 Hz. Previous studies also focused on the analysis of specific BOIs in the context of pediatric OSA. Gutiérrez-Tobal et al. found two spectral BOIs (0.119-0.192 Hz and 0.784-0.890 Hz) using an AHI cutoff of 3 e/h [30]. In our study, however, 1, 5 and 10 e/h cutoffs were used. Our BOI is consistent with the first BOI defined in that work [30] and may be related to the presence of apneic events. Intermittent disruptions of at least two cycles in the normal respiratory flow define apneas and hypopneas and can increase the power in frequencies around and below one half of the normal respiratory frequency. Both BOIs are centered in 0.155 Hz, which is approximately half of the central frequency of the normal respiratory band in children [30]. In contrast, no significant differences between severity groups were found in the PSDs in higher frequencies. This might be due to the use of three different AHI cutoffs and the analysis of a larger cohort.
Previous studies focused on the spectral analysis of AF signals in the context of pediatric OSA found relevant features from their respective BOIs [30,33]. In our study, features from the BOI were discarded in the feature selection stage due to redundancy. CTM AF and SpecEn 2 AF were the most relevant and complementary AF features while time domain statistical moments were found redundant. Previous studies addressing irregularity and variability of AF signals in the context of pediatric OSA reported the positive association of CTM and SpecEn with OSA severity [31]. In this study, CTM AF and SpecEn 2 AF were found relevant and nonredundant among AF features, thus reinforcing previous findings.
The correlations with the AHI were higher in SpO 2 -derived features in comparison with AF-derived ones, suggesting that features from the SpO 2 signal were more relevant. However, the majority of SpO 2 features were removed due to redundancy with ODI 3%. Only M4F SpO2 was found non-redundant with CTM SpO2 and ODI 3%. These results confirm that the most useful information of SpO 2 to detect OSA is summarized in ODI 3%. This finding is also supported in previous studies. Hornero et al. [26] assessed a similar set of features from SpO 2 recordings, resulting in ODI 3% and M3F SpO2 being selected. Besides, Vaquerizo-Villar et al. [29] found one SpO 2 -derived feature from Detrended Fluctuation Analysis complementary with ODI 3%.
A novel contribution of this study is the joint assessment of AF and SpO 2 signals using signal processing algorithms. It is remarkable that features from both signals were selected when AF and SpO 2 features were combined. However, two different situations need to be analyzed. When feature selection was conducted on the AF + SpO 2 set, selected features matched the features selected from AF and SpO 2 sets separately. These features were thus non-redundant and may indicate complementarity between both signals. Conversely, SpecEn 2 AF and CTM SpO2 were found redundant in settings with ODI 3%. Overall results of feature selection suggest that the information of AF and SpO 2 signals could be complementary. These findings are in accordance with previous studies combining AF-derived features with ODI 3%, which reported not only their complementarity, but also an increase in the diagnostic performance when used together [30,32]. Accordingly, the complementarity of the information from AF and SpO 2 signals in the context of adult OSA [34] is also confirmed in this study using a pediatric population.

Diagnostic Ability and Comparison with Previous Studies
In this study, novel multiclass AdaBoost classifiers have been introduced to predict OSA severity in children. The highest four-class accuracies were reached using the AF + ODI and the AF + SpO 2 + ODI subsets. These results, together with the low accuracies reached using AF, SpO 2 and ODI 3% separately, suggest that AdaBoost was able to take advantage of the information of AF and SpO 2 signals. Moreover, the most useful information of the SpO 2 seems to be summarized in ODI 3%. It is necessary to note that AF + SpO 2 + ODI reached the highest κ but AF + ODI obtained the same Acc 4 . This slight difference can be related to the calculation of κ, that gives more importance to class imbalance [47]. The AF + SpO 2 + ODI setting was slightly more accurate than AF + ODI classifying actual No OSA and Moderate OSA subjects, which were the least represented groups in our database. Thus, κ was higher in the AF + SpO 2 + ODI subset. Both Acc 4 and κ were slightly lower using SpO 2 + ODI, showing that oximetry alone can also achieve high diagnostic ability by means of AdaBoost. Nevertheless, the number of Moderate and Severe OSA subjects misclassified as No OSA was lower using AF + ODI. Another difference between these settings was observed in the number of overestimated subjects (the predicted severity of OSA was higher than the actual severity), which was also higher using SpO 2 + ODI. The rates of underestimated and overestimated subjects were the most balanced in the AF + SpO 2 + ODI setting: 20.77% and 21.28%, respectively. Using ODI 3% only, 40.26% and 14.62% of the subjects were underestimated and overestimated, respectively. Previous studies reported that ODI 3% alone underestimates the severity of OSA [29,32]. In this study, this tendency was also observed. On the other hand, the MLP neural networks used in previous approaches tended to overestimate the severity of OSA [29,32,42]. Vaquerizo-Villar et al. reported 12.75% of underestimated subjects and 27.30% of overestimated patients [29], while in Xu et al. the rates of underestimated and overestimated severity were 15.05% and 31.25%, respectively [42]. In our study, this behavior was not observed, since AdaBoost achieved a more balanced ratio of underestimated and overestimated subjects. These AdaBoost models were aimed at predicting OSA in a pediatric population. PSG data from boys and girls up to 13 years old was equally distributed in training and test sets. In general, no significant differences (p > 0.01) were found in age, sex or BMI z-score between patients correctly and incorrectly classified. We only found some differences in age and BMI z-score between rightly and incorrectly classified patients, which were limited to Mild OSA patients. Overall, diagnostic performances seem not to be biased towards any specific age, sex or BMI subgroup.
Regarding the results of binary classification, the top performing subset in 1 e/h was AF + ODI, reaching the highest Acc and NPV as well as the lowest LR-. These results suggest that AF + ODI is more suitable to discard the presence of OSA in 1 e/h since it was able to reduce false negatives. In comparison with AF + SpO 2 + ODI, Se was higher and Sp slightly lower in AF + ODI. Nevertheless, differences were not high. AF + ODI and AF + SpO 2 + ODI obtained the same diagnostic performance in 5 and 10 e/h. These settings obtained the highest Acc and the most balanced PPV and NPV in both cutoffs. Moreover, the value of LR+ in 10 e/h is remarkable since it indicates a very high likelihood when the AdaBoost model predicts a subject as Severe OSA. The differences between AF + ODI, SpO 2 + ODI and AF + SpO 2 + ODI were not high, which might suggest that the benefits of including AF are minor. Nevertheless, the diagnostic performance of models combining ODI 3% and AF reflects the complementarity of both signals. The contribution of AF reduced the number of false positives in 1 e/h using AF + ODI, which may compensate the added complexity and inconvenience of recording AF in children. Previous studies have successfully evaluated the usefulness of simplified devices to detect pediatric OSA using AF and SpO 2 [10,11]. To the best of our knowledge, this is the first study that jointly evaluates the diagnostic ability of AF and SpO 2 signals in children using signal processing methods. It would be convenient to enhance the diagnostic ability of these signals using signal processing methods alternative to those used in this study [29,41]. Table 8 summarizes the results achieved in previous studies focused on OSA detection in children. Simple and widespread binary classifiers (e.g., LR, LDA) were assessed in shorter cohorts, while MLP neural networks were proposed in studies comprising a larger number of subjects and using holdout validation (i.e., training and test sets). Most of the studies employed the 5 e/h AHI cutoff for binary classification, with Acc in the range 76.0-86.6%. Our proposal reached the highest Acc in 5 e/h among approaches using three AHI cutoffs. Moreover, it was close to the highest Acc among binary classifiers. In this study, Sp in 5 e/h was close to the highest ones in comparison with both binary and multiclass approaches, while Se was similar to those reached by MLP-based methods. Nevertheless, methods with higher Sp also exhibited lower Se. Fewer studies assessed their diagnostic ability in 1 e/h and 10 e/h cutoffs. Some of them developed independent binary LR models for each cutoff, while others relied on MLP neural networks. The former group reached more balanced Se-Sp pairs in both cutoffs, while the latter achieved higher Acc. The AdaBoost classifiers evaluated in this study also reached high Acc in all cutoffs. Other MLP-based approaches tended to overestimate OSA severity, resulting in low Sp in 1 e/h [29,32,42]. Our multiclass AdaBoost classifier achieved a higher Sp while maintaining high Se in 1 e/h using AF + ODI. On the other hand, Acc in 1 e/h was close to those reached using MLP networks. Therefore, a smaller proportion of symptomatic children without polysomnographically diagnosed OSA would be incorrectly diagnosed as suffering from OSA in comparison with other studies. Overall, the results of this study suggest that our ensemble learning-based approach succeeded in achieving high diagnostic ability. The performance of our AdaBoost-based approach strengthens the usefulness of ensemble learning as a valid alternative to other machine learning algorithms.

Limitations and Future Work
In spite of the promising performance of our proposal, some limitations and future investigations have to be pointed out. The database employed in this study comprised 974 subjects. Although this cohort is large, all subjects were recruited in the same center. It would be desirable to expand our database including new subjects from different sleep laboratories to further generalize our results. Secondly, we successfully evaluated AF and SpO 2 signals separately and jointly in the context of childhood OSA detection. Future investigations may rely on potential incorporation of other useful signals included in the PSG. In this sense, the AF signals employed in this study were recorded using a thermistor. Comparison between nasal pressure sensor and thermistor AF signals would also constitute a future goal. In addition, the AF-derived inter breath interval series can be considered for future studies to enhance the diagnostic ability of AF. Signals have been characterized using widespread signal processing methods in the context of OSA. Future work may comprise alternative approaches like bispectrum and wavelets, as well as other nonlinear analyses. Finally, although AdaBoost classifiers yielded high diagnostic performance, other ensemble learning methodologies like bagging or stacking can also be assessed to compare their diagnostic performance using SpO 2 and AF signals.

Conclusions
The results of this study showed the usefulness of the joint analysis of AF and SpO 2 signals in the context of pediatric OSA. A remarkable diagnostic performance was achieved using a multiclass AdaBoost classifier fed with a combination of relevant and complementary information from both signals. The most accurate AdaBoost model successfully combined CTM AF with ODI 3%, which was found the most useful parameter of the SpO 2 signal. This joint model outperformed the diagnostic ability of each of these signals separately. Furthermore, we derived an accurate and unbiased AdaBoost model able to decrease the underestimation of the OSA severity previously observed in ODI 3%. Our dual-channel approach is thus a potential alternative to single-channel methodologies, one that might be useful to deploy in the context of simplified screening methods aimed at detecting OSA in children. Funding: This work was supported by the 'Ministerio de Ciencia, Innovación y Universidades' and 'European Regional Development Fund (FEDER)' under projects DPI2017-84280-R and RTC-2017-6516-1, by 'European Commission' and 'FEDER' under projects 'Análisis y correlación entre el genoma completo y la actividad cerebral para la ayuda en el diagnóstico de la enfermedad de Alzheimer' and 'Análisis y correlación entre la epigenética y la actividad cerebral para evaluar el riesgo de migraña crónica y episódica en mujeres' ('Cooperation Programme Interreg V-A Spain-Portugal POCTEP 2014-2020 ), and by 'CIBER en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN)' through 'Instituto de Salud Carlos III' co-funded with FEDER funds. J.J.-G. was in receipt of a 'Ayudas para la contratación de personal técnico de apoyo a la investigación' grant from the 'Junta de Castilla y León' funded by the European Social Fund and Youth Employment Initiative. A.M.-M. was in receipt of a "Ayudas para contratos predoctorales para la Formación de Doctores" grant from the Ministerio de Ciencia, Innovación y Universidades (PRE2018-085219). D.G. and L.K.-G. are supported by US National Institutes of Health grants HL130984 (L.K.-G.) and HL140548 (D.G.).