You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

10 October 2025

Assessing Obstructive Sleep Apnea Severity During Wakefulness via Tracheal Breathing Sound Analysis

and
1
Biomedical Engineering Program, University of Manitoba, Winnipeg, MB R3T 5V6, Canada
2
Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, MB R3T 5V6, Canada
*
Author to whom correspondence should be addressed.
This article belongs to the Section Biomedical Sensors

Highlights

What are the main findings?
  • This study demonstrates a significant correlation between tracheal breathing sounds (TBS) recorded during wakefulness, anthropometric features, and the apnea–hypopnea index (AHI).
  • A machine learning model trained on these features can form the basis of classifications of OSA severity in standard clinics.
  • Categories (Non-, Mild-, Moderate-, and Severe-OSA) are formed without the need for sleep-based recordings.
What is the implication of the main finding?
  • The proposed method enables the rapid, low-cost, and accessible estimation of OSA severity using brief, wakefulness-based TBS and basic anthropometric data.
  • This approach can serve as a reliable screening and triage tool in clinical settings, helping reduce perioperative risks by informing earlier intervention and referral for full diagnosis.

Abstract

Obstructive sleep apnea (OSA) is a commonly underdiagnosed condition that not only increases the risk of accidents but also significantly contributes to a wide range of health complications, including heightened perioperative morbidity and mortality risks during surgeries under general anesthesia. Polysomnography (PSG), which is the diagnostic gold standard, is costly, requires skilled technicians, is time-consuming, and is not always accessible. This study presents a fast, objective, and non-invasive method for detecting OSA severity by analyzing tracheal breathing sounds (TBS) recorded during wakefulness in supine position. Features were extracted from six binary (1-vs-1) severity comparisons—Non-OSA, Mild, Moderate, and Severe—and combined with anthropometric characteristics for classification. The data of 199 subjects (74 Non-OSA, 35 Mild, 50 Moderate, and 40 Severe) were analyzed, the data of 169 and 30 was used for training and blind testing, respectively, and the training dataset was shuffled 10 times to avoid any bias during training. Multiple machine learning models were evaluated, and the best-performing model for each was saved. Across six experimental models comparing OSA severity levels, the most balanced performance was achieved by the Base Model of Non-OSA vs. Severe-OSA using the support vector machine algorithm, with 88.2% accuracy, 83.3% sensitivity, and 90.9% specificity. While Random Forests in the Base Model of Non-OSA vs. Mild-OSA achieved 100% sensitivity, its accuracy was lower (81.2%). The results confirm the reliability and robustness of the proposed approach, providing a basis for OSA severity screening in under 10 min during wakefulness.

1. Introduction

Obstructive sleep apnea (OSA) is a common but underdiagnosed sleep-related breathing disorder, affecting nearly 20% of adults in Canada and the United States [,]. Alarmingly, up to 90% of cases remain undiagnosed, with affected individuals often unaware of their condition or left untreated []. The absence of diagnosis and treatment carries substantial healthcare and economic consequences; in the United States, the added direct and indirect costs of untreated OSA are estimated at USD 65–165 billion annually [,,].
OSA accounts for more than 75% of sleep apnea cases and is caused by recurrent collapse of the upper airway during sleep, leading to complete (apnea) or partial (hypopnea) airflow obstruction [,]. Events lasting longer than 10 s with an oxygen desaturation of at least 3% are classified as apneas or hypopneas [,]. Clinically, OSA presents with both nighttime symptoms (e.g., loud snoring, gasping, frequent awakenings) and daytime symptoms (e.g., fatigue, morning headaches, depression, excessive sleepiness) []. The severity of OSA is defined by the apnea–hypopnea index (AHI), with thresholds of 0–5 (Non-OSA), 5–15 (Mild), 15–30 (Moderate), and >30 (Severe) events per hour [,]. The diagnostic gold standard is overnight polysomnography (PSG), but PSG is costly, resource-intensive, and often associated with waiting times of 3–12 months []. Portable monitors offer a more accessible alternative, but they still require overnight use and physician confirmation [,].
Identifying OSA severity prior to surgery is particularly important for perioperative risk stratification, as undiagnosed OSA significantly increases the risk of adverse outcomes [,,]. Current alternatives to PSG often rely on screening questionnaires (e.g., STOP-Bang, Berlin) and anthropometric measures (e.g., age, BMI, gender), which are highly sensitive but limited by low specificity (~10%) [,]. Given the limitations of overnight PSG and questionnaire-based tools, there is a pressing need for objective, wakefulness-based methods that can directly assess OSA severity.
Our group and others have pioneered the use of tracheal breathing sounds (TBS) recorded during wakefulness to screen for OSA in a binary manner with high accuracy [,,,,,,,]. However, existing studies have not addressed OSA severity classification, despite its critical role in perioperative planning. The risk of complications varies significantly across severity levels, with severe OSA associated with increased rates of respiratory failure and cardiovascular events [,,]. Accurate severity detection could therefore guide anesthetic management, postoperative monitoring, and preoperative interventions, ultimately improving surgical safety.
In this study, we introduce a novel algorithm for multi-class OSA severity classification during wakefulness, using features extracted from TBS combined with anthropometric data. We further interpret the extracted features from both physiological and feature-importance perspectives, laying the groundwork for a non-invasive and practical screening framework.

2. Literature Review of Tracheal Breathing Sounds Analysis

As this research is based on tracheal breathing sound (TBS) analysis during wakefulness, it is important to review prior studies in this field. Spectral and bispectrum features of the TBS have been the focus of several studies to classify OSA and non-OSA groups [,,,,]. Early works applied power spectral density, kurtosis, and fractal dimension of tracheal sounds during wakefulness for OSA severity classification, achieving up to 91.7% accuracy in distinguishing severe OSA (AHI > 30) from non-OSA (AHI < 5) using LDA and QDA classifiers []. Combining anthropometric and TBS features with support vector machines (SVMs) yielded 83.9% accuracy in detecting OSA at an AHI ≥ 10 []. A subsequent ensemble framework based on subgroup-specific anthropometric models improved robustness, achieving 81.4% accuracy, 80.9% sensitivity, and 82.1% specificity for detecting OSA at the clinically relevant threshold of AHI > 15 []. More recently, combining spectral and bispectrum features with anthropometric data enabled prediction of PSG-derived parameters such as arousal index and mean SpO2 with 88.8% accuracy in blind testing [].
Machine learning has further advanced this field. Logistic regression with LASSO-based feature selection achieved 79.3% ± 6.1% accuracy in blind testing []. Comparative studies later showed that Random Forest (RF) outperformed regularized logistic regression in both sensitivity and specificity for OSA detection using TBS and anthropometric data, at thresholds of AHI < 5 (Non-OSA), 5 ≤ AHI < 15 (Mild OSA), and AHI ≥ 15 (Moderate-to-Severe OSA) []. Beyond spectral measures, formant features extracted from tracheal breathing sounds showed significant group differences, with a sensitivity of 88.9% and specificity of 84.6%, when combined with anthropometrics []. Speech-based approaches have also been explored, where a composite system analyzing breathing segments, vowels, and continuous speech achieved 77.1% accuracy for distinguishing OSA at an AHI threshold of 15, offering a complementary alternative to TBS [].
A recent review has summarized these methodologies, highlighting the strong diagnostic potential of TBS analysis during wakefulness as a cost-effective and accessible screening tool []. Advanced acoustic and anthropometric-aware machine learning methods show particular promise, but nearly all studies to date focus on binary classification at thresholds such as AHI ≥ 15 or AHI ≥ 10 versus ≤ 5. Multi-class severity classification (mild, moderate, severe) remains a significant challenge during wakefulness. While previous works have focused mainly on binary OSA detection during wakefulness, our study uniquely advances severity classification by integrating image-based morphological features and SHAP-guided feature selection. To the best of our knowledge, no study has yet addressed this gap. Given the importance of OSA severity detection, especially for perioperative risk stratification, this study proposes a non-invasive, multi-class wakefulness-based framework that could support earlier diagnosis and reduce reliance on overnight sleep assessments.

3. Materials and Methods

The present study aims to classify OSA severity into three classes—Mild, Moderate, and Severe OSA—and include healthy subjects (non-OSA) by utilizing features from different domains and representations. The proposed technique is comprehensively detailed in the following subsections, and Figure 1 shows a block diagram of the proposed methodology.
Figure 1. Block diagram of the proposed methodology. The process includes (1) preprocessing of tracheal breathing sounds (segmentation, filtering, normalization), (2) extraction of spectral, temporal, nonlinear, and morphological features, (3) feature selection using t-test, SHAP ranking, and RFE, and (4) classifier training with bootstrap aggregation and OOB validation.

3.1. Tracheal Breathing Sounds Dataset

In this work, the dataset used was adopted from our team’s previous works []. The data were collected from 199 subjects, and the recording was made while the subjects were awake in a supine position with a pillow. Then, each individual’s TBS were recorded using a Sony ECM-77B, Tokyo, Japan omnidirectional condenser microphone (sensitivity: −52 dB ± 3.5 dB, frequency response: 40 Hz–20 kHz) positioned at the suprasternal notch via a custom 2 mm plastic chamber []. This setup minimized ambient noise and ensured consistent skin-to-microphone coupling. A schematic of microphone placement is provided in Figure 2. Then, each subject completed five cycles of deep breathing through the nose with the mouth closed and another five breaths through the mouth while wearing a nasal clip.
Figure 2. Experimental setup for TBS recordings. A Sony ECM77B condenser microphone was positioned at the suprasternal notch using a 2 mm custom plastic chamber to ensure consistent skin coupling and minimize ambient noise. Signals were sampled at 10,240 Hz using the Biopac DA100C (Biopac, Goleta, CA, USA), while participants were in a supine position during controlled wakefulness breathing maneuvers.
In this research, unlike our previous works, the AHI of subjects was grouped into four categories: Non-OSA (n = 109, AHI < 5), Mild-OSA (n = 109, 5 ≤ AHI < 15), Moderate-OSA (n = 109, 15 ≤ AHI < 30), and Severe-OSA (n = 90, AHI ≥ 30). Table 1 presents the total number of subjects in each severity group, along with their corresponding anthropometric data, for the dataset used in this study.
Table 1. Participants’ severity groups and anthropometric information.

3.1.1. Splitting Dataset for Training and Testing

Then, the data was split once into training and testing sets with ratios of 85% and 15%, respectively. These percentages were chosen to balance model training efficiency and evaluation reliability for the six distinct experiment models, as explained below. The training was repeated 10 times with a shuffled training dataset to avoid any bias during training. The 85% training portion provided enough samples to support the learning needs of both individual experiment models without leading to overfitting. Meanwhile, the 15% testing subset was carefully curated to maintain class balance and preserve the distributional characteristics of the original dataset across key variables, including sex, Mallampati score (MPS), apnea–hypopnea index (AHI), body mass index (BMI), age, and neck circumference (NC). For instance, the AHI averages in the testing datasets closely match those in the overall dataset for each OSA class (e.g., Non-OSA: 1.2 vs. 0.86; Mild: 8.7 vs. 6.7; Moderate: 21.5 vs. 20.7; Severe: 69.5 vs. 80.0), indicating that disease severity is well-represented in the testing data.
Similarly, it was ensured that the distributions of BMI, age, and NC in the testing set fell within one standard deviation of the overall means, reflecting a non-biased sample selection. This stratified approach ensures that the trained model was evaluated on a representative, diverse, and clinically meaningful subset of patients, thereby enhancing the generalizability and robustness of the findings. Moreover, this data split allows for sufficient subgroup representation, even within smaller classes (e.g., Mild-OSA), thereby avoiding skewed model evaluation due to under-sampling or class imbalance. The chosen ratio also aligns with standard medical machine learning studies practices, where datasets are typically limited, and a larger training set can significantly improve model convergence and stability. Table 2 and Table 3 show the distribution of the anthropometric data of the training and testing subjects for one split.
Table 2. Participants’ severity groups and anthropometric information for training set.
Table 3. Participants’ severity groups and anthropometric information for the testing set.

3.1.2. Splitting Dataset for K-Fold

For this problem, the regular stratified K-fold cross-validation was not suitable for this research due to the need to maintain strict stratification across clinically significant anthropometric factors in addition to the severity classes, and to avoid variability in class representation within smaller OSA classes. Therefore, a custom stratified K-fold approach was designed, where the folds were balanced simultaneously across both the OSA severity groups and the key confounding anthropometric thresholds, including age (<50 vs. ≥50), BMI (<35 vs. ≥35), neck circumference (≤40 vs. >40), sex (male vs. female), and Mallampati scores. This ensured that each fold preserved the joint distribution of clinically relevant subgroups, thereby reducing the risk of bias and making the training and evaluation processes more representative of the real-world population’s heterogeneity.
By enforcing these stratification rules, each training and test split captured not only the proportional distribution of OSA severity classes but also the underlying demographic and anatomical risk factors. This level of control was essential for producing generalizable and reliable models, especially in subgroups with limited sample sizes that might otherwise be underrepresented in conventional splitting strategies. Table 4 shows the distribution of subjects’ anthropometric data of the k-fold splits.
Table 4. Participants’ severity groups and anthropometric information for K-folds.

3.2. Tracheal Breathing Pre-Processing

A series of pre-processing steps was applied to prepare the raw audio recordings for analysis. First, all recorded signals went through a check in the time and frequency domains to check if there was any background noise or vocal noise; then, the signals underwent segmentation into the inspiratory and expiratory phases and the signal to noise ratio (SNR) was calculated between each phase and background to remove any phase with a very low SNR [,]. This separation was crucial, as upper-airway obstructions often manifest differently in each respiratory phase, particularly in patients with OSA []. The segmentation was achieved using a log(var) of the signal with a thresholding approach to isolate the breath cycles [,]. Following segmentation, a 4th-order Butterworth bandpass filter with cutoff frequencies of 75–3000 Hz was applied to reduce the effect of heartbeats, microphone artifacts, muscle motion, 60 Hz harmonics, and ambient noise [,]. Finally, all filtered signals were normalized using two methods: through their variance envelope (a smoothed version of the sample moving average) and then using their standard deviation (energy) to eliminate the effect of plausible airflow fluctuation between the breathing cycles [,]. Figure 3 shows the results of the preprocessing techniques on a sample breathing phase.
Figure 3. The results of preprocessing techniques on a sample breathing phase.

3.3. Anthropometric Missing Value Imputation

Missing anthropometric values were imputed using a severity-specific k-nearest neighbors (k-NN) method to maintain internal group distributions and minimize bias [,]. The full imputation methodology and implementation steps are detailed in Appendix A, Figure A1.

3.4. Feature Extractions

The feature selection methodology spans multiple analytical domains, including spectral, temporal, nonlinear, and cross-domain analyses, ensuring a holistic and multidimensional representation of linear and nonlinear signal dynamics. The extracted features are grouped and optimized specifically for each base model (binary classifiers). This model-specific feature selection process enables the creation of personalized feature sets that enhance model robustness, improve classification accuracy, and support high-performance modeling for diagnostic and predictive applications. The parameters for power spectrum and bispectrum gaps from confidence intervals are calculated from the training dataset and then applied to the testing dataset. Figure 4 illustrates the steps involved in feature extraction.
Figure 4. Flow chart of feature extraction from training data.
Finding gaps in power spectrum and bispectrum using confidence intervals—identifying meaningful deviations in frequency-domain representations, such as the power spectrum and bispectrum, is critical for understanding the underlying dynamics of non-stationary signals and pinpointing the regions to focus on during feature extraction. Traditional spectral analysis often relies on peak detection or energy thresholds, which can overlook subtle but statistically significant features. To enhance this process, we employed confidence interval-based gap detection, allowing for the quantification of spectral features that deviate meaningfully from expected background fluctuations.
For the power spectral density (PSD), we first estimated the mean spectrum across subjects or trials and computed the standard deviation at each frequency bin. Assuming normality in the spectral estimates, a valid approximation under the Central Limit Theorem for large sample sizes, a 95% confidence interval was constructed as follows:
C I f = μ f ± 1.96 σ f
where μ ( f ) and σ ( f ) represent the mean and standard deviation of spectral power at the frequency f , respectively. Frequencies where the spectral power of a subject or a class-specific average exceeded or fell below this confidence range were marked as spectral “gaps” or “anomalies,” depending on the context. The same principle was extended to the bispectrum, which captures quadratic phase coupling between frequency components and reveals nonlinear interactions not evident in the power spectrum alone. Due to the higher dimensionality and more complex distribution of Bispectrum estimates, we employed a bootstrap resampling method [] to compute empirical confidence intervals for the Bispectrum magnitude and phase at each frequency pair ( f 1 , f 2 ) . This approach avoids Gaussianity assumption, which is often violated in higher-order spectral domains.
Significant Bispectrum gaps were identified where the observed Bispectrum values lay outside the bootstrapped 95% confidence bounds. These gaps may indicate regions of suppressed nonlinear interactions or phase-coupling loss and can be critical in distinguishing pathological signal dynamics from normal states [,]. By focusing on confidence-interval-defined deviations, this method provides a statistically principled framework for highlighting underexplored or weakly represented frequency components in both linear and nonlinear spectral representations. Figure 5 shows samples of detected gaps on both PSD and Bispectrum.
Figure 5. A sample of the detected gaps regions of both PSD and bispectrum where (a) shows the PSD detection gaps, highlighted in yellow, using the proposed method; (b) shows the regions containing bispectrum detection gaps, highlighted as red boxes, using the proposed method.
Initially, PSD was estimated using Welch’s method to identify frequency bands with significant differences across groups []. Within these bands, a set of representative spectral features was extracted, including mean power, spectral centroid, bandwidth, and spectral entropy. To capture nonlinear interactions and higher-order harmonics, bispectrum analysis (higher-order spectral domain) [] was performed. Features such as bispectrum magnitude, total bispectrum energy, and symmetry metrics were extracted from statistically significant regions identified using confidence intervals.
Time-domain descriptors were also included to account for amplitude dynamics and signal complexity, such as zero-crossing rate, root mean square (RMS), fractal dimension [], waveform length, shimmer, and jitter []. Measures such as Noise-to-Harmonic Ratio (NHR) [] and correlation coefficients were also used to quantify voice quality and signal regularity. Complementary time-frequency features were extracted using wavelet transforms [], Mel-Frequency Cepstral Coefficients (MFCCs) [,], and Constant-Q Transform (CQT) analysis [], which capture transient, perceptual, and frequency-localized aspects of the signal, respectively.
To assess underlying chaotic dynamics and recurrence properties, we extracted features based on Lyapunov Exponents [], recurrence quantification analysis (RQA) [], entropy metrics [], and cycle-based statistics were incorporated to enhance discriminability and robustness. A dedicated set of pattern-based features was also included to represent the structural characteristics of the signals. These included Local Binary Patterns (LBP), Probabilistic Binary Patterns (PBP) [,], and texture-based descriptors such as contrast, correlation, homogeneity, and energy []. These features captured spatial and temporal regularities, symmetry, and variation in the time-frequency representations, providing discriminative power in distinguishing between classes. Additional features were extracted from the Bispectrum bounding box geometry (e.g., area, perimeter, aspect ratio) [], spectral shape (e.g., spectral flux, centroid, bandwidth) [], and signal dynamics (e.g., amplitude modulation, autocorrelation, CDF metrics) []. Features from the tunable Q-factor wavelet transform (TQWT) further captured oscillatory patterns across resolutions. A complete list of extracted features (including formulation and rationale) is provided in Appendix D, Table A1.

3.5. Automatic Feature Normalization

To ensure optimal preprocessing, an adaptive algorithm selected the most appropriate normalization method based on information shared between features and labels [,,,]. Details of the normalization methods evaluated, mutual information calculation, and flow chart are provided in Appendix B, Figure A2.

3.6. Selecting Best Features

A three-stage pipeline integrating filter (t-test), embedded (SHAP-based ranking), and wrapper (Recursive Feature Elimination with RUSBoost) methods was used to select informative and stable features [,,,,,,]. The full description, algorithmic workflow (Figure A3), and associated figure (Figure A4) are included in Appendix C.

3.7. Experiments Models Training

The training of experiments models follows a systematic methodology that combines hyperparameter optimization, bootstrap aggregation (bagging), and ensemble-based validation to ensure robust model selection. This comprehensive approach is carefully designed to evaluate various classifiers under various configurations while addressing common issues such as overfitting, high variance, and unreliable performance estimates. By integrating multiple techniques into a structured pipeline, the methodology aims to produce models that generalize well to unseen data and provide reproducible, high-quality results. Six binary base experiments were trained to capture pairwise distinctions between different severity levels of OSA for the experiment’s models. These include the following:
  • Non-OSA vs. Mild-OSA;
  • Non-OSA vs. Moderate-OSA;
  • Non-OSA vs. Severe-OSA;
  • Mild-OSA vs. Moderate-OSA;
  • Mild-OSA vs. Severe-OSA;
  • Moderate-OSA vs. Severe-OSA.
The complete training methodology was applied to every pairwise comparison. This ensured that base classifiers were explicitly optimized for the discriminative characteristics relevant to each subset of the data. The process is divided into several key stages, as described below.

3.7.1. Classifier Configuration and Hyperparameter Optimization

The first stage of base model training involves configuring a diverse set of classifiers and systematically optimizing their hyperparameters. A total of eighteen classifier configurations were explored to capture a broad spectrum of modeling paradigms []. These included the following:
  • Traditional models: Decision Trees, Naïve Bayes, and Logistic Regression (L1 and L2 regularized).
  • Distance- and projection-based models: k-Nearest Neighbors (kNN), Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA).
  • Margin-based models: Support Vector Machines (SVMs) with linear, radial basis function (RBF) and polynomial kernels.
  • Ensemble methods: Random Forests, Bagged Trees, Gradient Boosting Machines (GBM), RUSBoost, and Subspace kNN.
  • Neural networks: Both shallow and deep architectures.
The heterogeneous nature of breathing sound features and associated clinical data guided the selection of these classifiers. To address the varying degrees of complexity in the input space, the methodology balanced interpretable models (e.g., Logistic Regression, Decision Trees) with nonlinear learners (e.g., Neural Networks, SVMs, GBMs) [,]. Special emphasis was placed on ensemble methods, which are known for their robustness and ability to reduce overfitting, particularly in imbalanced and moderately sized datasets.
Each classifier was paired with a custom-defined hyperparameter search space tailored to its critical tuning variables (as detailed in Table 1). Using the Expected Improvement Plus (EI+) acquisition function, Bayesian optimization was employed to navigate these spaces. This method efficiently balances the exploration of high-dimensional parameter spaces with the exploitation of high-performing regions, leading to faster convergence than grid or random search methods []. The optimization objective was to minimize the misclassification rate under 5-fold stratified cross-validation, using the following loss function:
L ( θ ) = 1 K k = 1 K I ( y i y ^ i ( θ ) )
where θ represents the hyperparameters, K is the number of cross-validation folds, y i is the actual label, and y ^ i ( θ ) is the predicted label given the hyperparameters and I ( ) is the indicator function [,]. Depending on classifier complexity, convergence was typically achieved with 50 iterations. This optimization method was selected because it efficiently balances exploration and exploitation in high-dimensional spaces, making it well-suited for complex models. Using 5-fold stratified cross-validation ensured that each model’s performance estimate was reliable and not biased by a particular subset of the data. By covering a diverse set of modeling strategies, the methodology maximizes the likelihood of identifying experimental models with complementary strengths, which is critical for the subsequent ensemble learning and modeling phases [,].

3.7.2. Bootstrap Aggregation with OOB Validation

Following hyperparameter optimization, each classifier underwent bootstrap aggregation (bagging) to enhance generalization and reduce model variance []. For each base model, B = 50 bootstrap samples were generated by resampling the training data with replacement, where each sample D b had size N , equal to the number of original training instances. Using each D b , a base learner f b was trained with its respective optimized hyperparameters θ []. For the b t h bootstrap sample D b :
1.
A resample D b was generated with replacement D b = N , where N is the number of training instances.
2.
Classifier f b was trained on D b using optimized θ .
3.
OOB samples D oob ( b ) = D D b were retained for validation.
An essential advantage of this strategy is its natural support for out-of-bag (OOB) validation, which enables unbiased performance estimation without requiring a separate validation set. For instance, x i had an OOB prediction computed by aggregating predictions from all base learners for which x i D b , i.e., from models that did not see that instance during training []. Formally, the OOB prediction is given by
y ^ i OOB = mode ( { f b ( x i ) x i D ( b ) oob } )
where D ( b ) oob = D D b represents the set of OOB samples for the b -th bootstrap. This mechanism yields a robust estimate of each classifier’s generalization performance while fully utilizing the available training data. For each classifier configuration, the following OOB-based metrics were computed from the ensemble’s predictions:
  • OOB Accuracy: Overall correct classification rate.
  • OOB Sensitivity: True positive rate, capturing the ability to detect positive cases.
  • OOB Specificity: True negative rate, reflecting the ability to identify negative cases correctly.
By averaging predictions across 50 model instances and leveraging OOB samples, this ensemble approach not only improves model stability and robustness but also provides reliable, data-efficient validation suitable for imbalanced or limited-size datasets [,,,].

3.7.3. Class Imbalance Mitigation

For datasets with skewed class distributions [,]:
  • Cost-sensitive learning: Class-weighted loss functions scale misclassification costs inversely to class frequencies [].
  • Stratified bootstrapping: Maintains original class ratios in bootstrap samples [].
  • OOB-balanced metrics: Performance evaluation weights classes by inverse frequency [].
This three-pronged approach prevents bias toward majority classes while preserving detection capability for rare categories [,].

3.7.4. Robust Model Selection via Repeated Trials

Recognizing that machine learning training processes are inherently stochastic due to factors like data shuffling, bootstrap sampling, and optimization randomness, the methodology incorporated multiple independent training trials [,]. For each classifier configuration, five independent trials were conducted. Each trial involved reinitializing Bayesian hyperparameter optimization to ensure that different regions of the search space could be explored, thus avoiding convergence to suboptimal local minima []. Additionally, 50 new bootstrap models were generated in each trial to introduce variability into the ensemble learning process. After training, performance metrics were evaluated using OOB samples and a held-out test set. Among the five trials, the one achieving the highest OOB accuracy was selected for final model comparison. This process ensured that the best-performing model was chosen based on generalizable results rather than random fluctuations in performance.

3.7.5. Final Model Selection

For each pairwise model of the six models, the classifier with the highest OOB accuracy was selected as the final classifier. This selection criterion favors models that generalize well during training without overfitting, a benefit inherent to bootstrap aggregation and OOB validation []. Additionally, by selecting the best classifier for each experiment based on OOB performance rather than test performance, the methodology avoids the risk of overfitting the test set, preserving its integrity for unbiased final evaluation []. All results, including detailed accuracy, sensitivity, and specificity metrics, were compiled into a unified table to facilitate cross-dataset comparisons and meta-analyses. This organized approach enables robust insights into the relative performance of different models across varying datasets and conditions.

3.8. Model Evaluation

To assess the effectiveness of the models, the evaluation framework relies on standard classification performance metrics. This approach emphasizes both predictive accuracy and the balanced assessment of model performance across different classes, particularly in the presence of class imbalance.

3.8.1. Evaluation Protocol

(a)
Out-of-Bag (OOB) Validation
During ensemble training, each base learner was evaluated on the subset of training samples excluded from its bootstrap resample. These out-of-bag predictions were used to estimate performance without requiring a separate validation set. The resulting OOB metrics provide an unbiased estimate of generalization error.
(b)
Independent Test Evaluation
After model training, performance was assessed using the same metrics on a held-out test set. To account for randomness in training (e.g., bootstrap sampling or stochastic optimization), the evaluation was repeated 25 times over independent trials with different random seeds.

3.8.2. Performance Metrics

Models were evaluated using three core metrics—accuracy, sensitivity, and specificity—on both out-of-bag (OOB) samples and independent test sets [].

4. Results

The performance of the proposed OSA severity screening framework was evaluated across multiple classification tasks using both tracheal breathing sounds (TBS) and anthropometric features. Results are reported for six pairwise comparisons of OSA severity levels (Non-OSA, Mild, Moderate, Severe), emphasizing classifier robustness, generalization, and clinical relevance. Model evaluation was conducted using out-of-bag (OOB) validation during training and a fully independent blind test set to assess real-world applicability. Performance metrics, including accuracy, sensitivity, and specificity, were computed to reflect diagnostic balance. The results of the experiment’s models support the utility of detecting OSA severity during wakefulness using these models, providing a rapid, non-invasive method. The following sub-sections show the results of the proposed methodology.

4.1. Feature Selection Results

The feature selection process identified distinct and informative sets of features for each model, capturing the essential aspects of breathing sounds during the mouth and nose inspiration and expiration phases. Table A2 provides a summary of selected features for each base model.
The best-performing model (Model 1) leveraged a combination of spectral and image-based features to effectively classify breathing patterns. Key features included spectral entropy and crest values extracted from specific frequency bands during mouth expiration gaps, wavelet-based spectral bandwidth, and kurtosis metrics derived from nose inspiration and expiration phases. Image-derived features, such as the number of holes, bounding box area, and texture contrast across the mouth and nose regions further enhanced the model’s discriminatory capability. Additional contributing features included fractal dimension estimates, peak frequency, and statistical measures derived from Mel-Frequency Cepstral Coefficients (MFCCs). A detailed list of selected features for all models (Models 1–6) is provided in Appendix D, Table A2.
The feature selection process across all six models identified a diverse set of acoustic, spectral, fractal, and image-based characteristics that effectively capture the nuances of breathing sounds. Most of these features were extracted predominantly from the mouth inspiration segments, which provided rich spectral and fractal information, while some were derived from expiration and combined inspiration–expiration phases. The selected features include spectral measures such as spectral centroid, entropy, skewness, flux, power spectral density statistics, fractal dimensions, and wavelet-based coefficients. Statistical descriptors of MFCCs, zero crossing rates, Bispectrum entropy, and harmonic–percussive source separation features further enrich the dataset. Additionally, morphological features extracted from image representations of the respiratory signals, such as bounding box area, number of holes, connected components, and Euler numbers, contributed to capturing structural variations in the signal. Together, these features comprehensively represent both the temporal and frequency-domain properties of the respiratory signals, enabling robust discrimination between subject classes.
These comprehensive acoustic and morphological features were combined with seven key anthropometric variables: body mass index (BMI), age, sex, smoke history, neck circumference (NC), and Mallampati score (MPS). Integrating these physiological and demographic factors with the rich features of breathing sounds enhances the models’ ability to reflect intrinsic body characteristics and breathing dynamics. This fusion yields 41 features for each model, resulting in a robust, multidimensional dataset for subsequent classification and analysis tasks.

4.2. Experiments Models Results

4.2.1. Training and Testing Results

The selected base classifiers, Random Forest, Support Vector Machines (SVMs) with polynomial kernels of degrees 3, 5, and 7, Subspace K-Nearest Neighbors (Subspace KNN), and Linear Discriminant Analysis (LDA), demonstrate consistently strong performance according to their out-of-bag (OOB) estimates. These classifiers were chosen for their reliability and robustness across different datasets. The OOB accuracies remain high across all models, typically exceeding 80%, indicating that the classifiers are well-calibrated and effectively generalized during internal validation. Additionally, the OOB sensitivity and specificity values are balanced, suggesting that these models strike a good balance between identifying true positives and minimizing false positives. Table 5 presents the out-of-bag (OOB) performance metrics for these classifiers.
Table 5. OOB results for experimental models.
Then, the test results corresponded to the models that achieved high OOB performance across each dataset. These independent evaluations further validate the generalization capability of the selected classifiers. The test accuracies, sensitivities, and specificities closely mirror the trends observed in the OOB evaluations, with deviations generally remaining within 10%, an acceptable range in practical classification tasks. Several models, including Random Forest and SVMs with polynomial kernels, achieved perfect sensitivity or specificity on specific datasets, highlighting their potential for robust classification in real-world applications. Table 6 shows the test performance metrics for these classifiers.
Table 6. Test results for experimental models.
To provide a comprehensive overview, we additionally report performances from all evaluated classifiers in Figure 6 and Figure 7. These figures illustrate the distribution of accuracy, sensitivity, and specificity across classifiers, complementing the summary in Table 5 and Table 6.
Figure 6. Expanded classifier performance metrics across six binary experiments. For each experiment, the top six classifiers were selected based on their mean performance. Grouped bar plots display both out-of-bag (OOB) and test set results for accuracy, sensitivity, and specificity. This visualization highlights the differences between training (out-of-bag, OOB) and generalization (test) performance.
Figure 7. Receiver Operating Characteristic (ROC) curves of the top six classifiers for representative binary experiments. The curves illustrate the discrimination ability of each classifier across sensitivity–specificity trade-offs, complementing Table 4 and Table 5 by providing a visual comparison of performance beyond single-value metrics.

4.2.2. K-Fold Results

The selected base classifiers demonstrated consistently strong performance across 3-fold cross-validation. The OOB accuracies remained high, generally exceeding 80%, indicating effective generalization and stability across folds. Sensitivity and specificity values were also well-balanced, suggesting that the models achieved a good trade-off between detecting true positives and minimizing false positives. Table 7 presents the 3-fold OOB performance metrics, which are consistent with the trends observed in the previous training–testing evaluations.
Table 7. OOB 3-fold cross-validation results of experimental models.
The test results correspond to models that achieved strong performance during internal validation. These independent evaluations further confirm the generalization capability of the selected models. The test accuracies, sensitivities, and specificities closely reflected the patterns observed in cross-validation, with deviations generally remaining within 10%, an acceptable range for practical classification tasks. Table 8 presents the test performance metrics, which are consistent with the trends observed in previous evaluations.
Table 8. Test 3-fold cross-validation results of experimental models.

5. Discussion

The proposed OSA severity screening during wakefulness framework demonstrates a promising advance in non-invasive, wakefulness-based diagnostic tools, particularly by integrating tracheal breathing sound analysis and anthropometric data. The structured evaluation across six pairwise OSA severity classifications offers robust performance metrics and insights into the physiological and acoustic distinctions among severity levels. This discussion synthesizes these findings, emphasizing the methodology’s strengths and implications for clinical practice.
Incorporating SHAP into the feature selection pipeline provided an interpretable and data-driven mechanism to quantify each feature’s contribution toward OSA severity classification. Unlike traditional ranking approaches, SHAP integrates cooperative game theory principles to assign fair importance values to features based on their marginal contributions across multiple model predictions. This allowed us to confirm that physiologically relevant features, such as spectral entropy, bispectrum-derived texture measures, and anthropometric variables like BMI and neck circumference were consistently influential across models. Significantly, SHAP enhanced transparency by linking acoustic and morphological variations in tracheal breathing sounds with interpretable physiological correlations, thereby bridging the gap between clinical understanding and algorithmic decision-making. When combined with Recursive Feature Elimination (RFE), the SHAP-guided ranking ensured that only the most stable and clinically meaningful features were retained, which improved model robustness while reducing dimensionality. This integration strengthens the interpretability and reliability of the proposed wakefulness-based OSA severity screening framework.
The feature selection process revealed diverse spectral, temporal, fractal, and morphological characteristics extracted from mouth- and nose-breathing segments during different respiratory phases. Notably, features such as spectral entropy, crest, kurtosis, fractal dimension, and MFCC-based statistics emerged repeatedly across models. These features are physiologically meaningful, as they capture underlying variations in airway obstruction, turbulence, and breathing effort associated with different severities of OSA. Notably, morphological features derived from bispectrum image representations of the respiratory signals, such as bounding box area, Euler number, and connected components, provide a novel dimension to acoustic analysis. These image-based descriptors translate subtle acoustic changes into quantifiable structural patterns, enhancing interpretability and model transparency. Furthermore, the consistent contribution of mouth inspiration segments as primary sources of discriminative features underscores their diagnostic richness, likely due to greater variability in the upper-airway resistance during inspiration in OSA patients. This is because mouth-breathing bypasses the nasal passages and exposes the more collapsible pharyngeal airway to direct airflow. During inspiration, this region is more prone to dynamic narrowing and turbulence in individuals with OSA, resulting in acoustic patterns that better reflect underlying structural abnormalities compared to nasal breathing. This aligns with the prior literature highlighting mouth-breathing as a compensatory mechanism in individuals with compromised nasal airflow and may reflect airway collapse dynamics during wakeful states [,].
The selected features encompass a broad range of physiological representations of breathing sound dynamics, each reflecting essential aspects of upper-airway structure and function that are affected by obstructive sleep apnea (OSA). Spectral features, such as spectral entropy, skewness, kurtosis, crest, centroid, and bandwidth, quantify the distribution and organization of energy across frequencies. In patients with OSA, upper-airway obstruction during inspiration and expiration leads to increased turbulence, reflected in broader spectral distributions (higher entropy), asymmetric power distribution (skewness), and heavier spectral tails (kurtosis). These features can capture abnormal airflow patterns due to pharyngeal collapse, especially during inspiration, which is more sensitive to airway resistance [].
Fractal dimensions and nonlinear dynamics measures such as the Hurst exponent, Lyapunov exponent, and Katz fractal dimension reflect the irregularity and complexity of breathing signals, which tend to increase with the severity of OSA due to variable airflow and compensatory muscle activity [,,]. MFCCs (Mel-Frequency Cepstral Coefficients), though commonly used in speech processing, effectively capture spectral envelope variations that correlate with airway resonance characteristics, which are particularly altered in OSA due to anatomical and functional airway changes []. Bispectrum features and bicoherence quantify quadratic phase coupling between frequencies, providing insights into the nonlinear and harmonic interactions typical of turbulent breathing in OSA.
Time-domain features such as zero-crossing rate, peak frequency, and signal energy characterize oscillatory behavior and airflow strength []. These are sensitive to inspiratory effort and upper-airway resistance []. Additionally, wavelet-based features offer a multi-resolution analysis of signal transients, making them suitable for identifying events such as partial obstructions or arousals during respiration. CQT (Constant-Q Transform) and entropy-based features derived from wavelet or CQT domains reflect subtle changes in airflow rhythm and complexity [], which may not be detectable in standard spectral measures.
Morphological and image-based features such as bounding box area, number of holes, Euler number, and texture measures are extracted from spectrogram or bispectrum image representations and serve as indirect quantifiers of structural variation in acoustic patterns. These are physiologically linked to airway geometry and dynamic obstruction events [,], as turbulent flow often generates unique spatial textures in time–frequency representations. Finally, anthropometric features such as BMI, neck circumference, Mallampati score, and age are directly related to anatomical risk factors for OSA, including fat deposition around the neck, airway collapsibility, and tongue size.
Together, these features form a multidimensional physiological signature of breathing under different severities of OSA. Their combined use enhances the ability to non-invasively and objectively screen OSA severity during wakefulness with high reliability. While anthropometric data improved classification accuracy, reliance on such features may limit usability in home-based screenings. A direct comparison with tools like STOP-Bang and Berlin questionnaires was not conducted in this study; future work should benchmark performance against these established screening methods to better contextualize the advantages.
For the training testing results, the base classifiers demonstrated strong internal validation performance through out-of-bag (OOB) evaluation, with accuracies generally exceeding 80%, and a balanced sensitivity and specificity. These results indicate that the models are well-calibrated and generalize effectively within the training data. The SVM with polynomial kernels (particularly degrees 3 and 5) and Random Forests emerged as consistently high performers, achieving high accuracy and balanced diagnostic metrics across severity comparisons. Such consistency highlights their ability to model nonlinear interactions and feature dependencies in complex acoustic-physiological data. External validation using a blind test set reinforced the models’ real-world applicability. The test performance observed mirrored OOB results, with deviations typically within 10%. For instance, Model 1 (Random Forest) achieved 100% sensitivity, indicating its ability to accurately capture actual OSA-positive cases, which is crucial for clinical screening scenarios. Similarly, Model 4 (SVM Poly 7) achieved 100% specificity, emphasizing its strength in confidently identifying non-OSA subjects. These extreme yet balanced outcomes across classifiers suggest complementary strengths ripe for ensemble integration.
The 3-fold cross-validation further confirmed the robust performance of the base classifiers, with accuracies generally remaining above 80%, and sensitivities and specificities showing balanced values across folds. These results indicate that the models are stable under repeated resampling and generalize effectively within the training data. Notably, SVMs with polynomial kernels (particularly degrees 3 and 9) and Random Forests consistently achieved high accuracy and well-balanced diagnostic metrics, reflecting their ability to capture nonlinear relationships and complex interactions in the data. External validation using the test sets reinforced these findings, with deviations from OOB results typically within 10%. For example, Model 2 (Random Forest) achieved 87% sensitivity, demonstrating reliable detection of true positives, while Model 3 (Gradient Boosting) reached 93% specificity, emphasizing the accurate identification of true negatives. This complementary performance across classifiers underscores their potential value in ensemble modeling for enhanced predictive reliability.
An essential next step is to investigate whether predicted OSA severity correlates with perioperative complication rates, which would reinforce the translational value of this method for surgical risk stratification.

6. Limitations

Although the system shows strong potential, several limitations must be acknowledged. First, all data in this study were collected under controlled conditions at a single center using one microphone type (Sony ECM77B). This may limit generalizability, as real-world environments, such as different clinics or home settings, introduce variability in background noise, microphone type, and user behavior. Future work should therefore validate the framework across centers, devices, and include analyses of self-placement errors to assess robustness in more realistic scenarios.
Another issue is the reliance on consistent microphone placement over the trachea. While this was carefully managed during data collection, it is possible that in real-world use, especially in self-administered or remote settings, the placement may not always be accurate. Small changes in position could affect the sound quality and, in turn, the model’s predictions. Future versions of the system should account for this, possibly by incorporating signal quality checks or providing user guidance. There is also the matter of anthropometric data. Some features, like neck circumference or jaw position, might not always be easy to measure, particularly outside of a clinical environment. Although combining these features with acoustic data improves accuracy, it could limit the tool’s practicality in settings where full measurements are not available. Exploring ways to work with partial data or identifying simpler substitutes would make the system more accessible. Although the dataset is relatively small and imbalanced across severity groups, we mitigated overfitting through 3-fold stratified cross-validation during hyperparameter tuning, bootstrap aggregation with OOB validation, and repeated independent training trials. Nevertheless, larger and multi-center datasets are needed for stronger statistical power.
Finally, while the tool performs well as a screening aid, it is not a replacement for clinical diagnosis. Its role should be clearly defined within the broader diagnostic process, helping to flag potential cases but not making final decisions. More work is needed to understand how clinicians would use the results in practice and how to communicate the model’s output in a way that is both useful and trustworthy.

7. Conclusions

This study demonstrates the foundation for the effectiveness of a fast, objective, and non-invasive method for screening obstructive sleep apnea (OSA) severity using tracheal breathing sounds (TBS) recorded during wakefulness. By combining features from binary severity comparisons with anthropometric data, the proposed experiment’s model approach achieved high classification performance across multiple machine learning models. Notably, the SVM with a third-degree polynomial kernel delivered strong out-of-bag and test results, while Random Forests achieved perfect test sensitivity. These findings support the potential of TBS-based analysis as a practical and accessible alternative to polysomnography, enabling a reliable assessment of OSA severity in under 10 min and representing a significant advancement for early detection and perioperative risk management.
The developed framework successfully harnesses the multidimensional richness of breathing sound analysis and anthropometric data to screen for OSA severity levels with high accuracy and clinical relevance. The effective combination of feature selection, robust base classifiers, and a well-structured approach illustrates a scalable and interpretable method for non-invasive respiratory assessment. These findings represent a crucial step toward AI-driven, accessible sleep disorder diagnostics that can bridge the current gaps in OSA identification and management.

Author Contributions

Conceptualization, A.M.A. and Z.M.; methodology, A.M.A. and Z.M.; software, A.M.A.; validation, A.M.A. and Z.M.; formal analysis, A.M.A. and Z.M.; investigation, A.M.A. and Z.M.; data curation, A.M.A. and Z.M.; writing—original draft preparation, A.M.A.; writing—review and editing, A.M.A. and Z.M.; visualization, A.M.A. and Z.M.; supervision, Z.M.; project administration, Z.M.; funding acquisition, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Institutional Review Board Statement

The study was approved by the University of Manitoba’s Biomedical Research Ethics Board. All experimental procedures were conducted in accordance with the protocol approved by the board and its regulations.

Data Availability Statement

To access the anonymized data for research purposes, one may contact the PI of the study (last author).

Acknowledgments

We acknowledge the support of the NSERC (Natural Sciences and Engineering Research Council of Canada).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AHIApnea–Hypopnea Index
BMIBody Mass Index
CQTConstant-Q Transform
DFADetrended Fluctuation Analysis
ECOCError-Correcting Output Codes
GBMGradient Boosting Machine
kNNk-Nearest Neighbors
LASSOLeast Absolute Shrinkage and Selection Operator
LDALinear Discriminant Analysis
LRLogistic Regression
MFCCMel-Frequency Cepstral Coefficients
MPSMallampati Score
NCNeck Circumference
OOBOut-of-Bag
OSAObstructive Sleep Apnea
PSGPolysomnography
RFRandom Forest
RFERecursive Feature Elimination
RQARecurrence Quantification Analysis
SHAPSHapley Additive exPlanations
SVMSupport Vector Machine
TBSTracheal Breathing Sounds
TQWTTunable Q-Factor Wavelet Transform
VMDVariational Mode Decomposition

Appendix A. Anthropometric Missing Value Imputation

The imputation of missing anthropometric data was achieved within each severity group (Non-, Mild, Moderate and Severe OSA) using a k-nearest neighbors (k-NN) based approach []. The k-NN imputation estimates missing entries by leveraging the similarity among samples within the same OSA category, thus preserving the inherent structure and distribution of the data. This method enhances data completeness while minimizing the potential bias introduced by missingness [].
By combining severity-specific grouping with localized imputation, the preprocessing approach ensures robust and consistent feature sets that improve the reliability of downstream modeling and statistical analyses [,]. Figure A1 illustrates the steps involved in filling in missing arthrometric data.
Figure A1. Flow chart of missing anthropometric data filling.

Appendix B. Automatic Feature Normalization

An automatic selection algorithm was implemented to choose the most suitable normalization method for a given dataset, thereby optimizing feature preprocessing. For each feature X i X R n × d , four normalization techniques were evaluated: min–max scaling, z-score normalization, mean-range scaling, and robust scaling. Each feature was normalized individually and discretized into ten bins, and its dependency on the categorical labels y was quantified using mutual information [,,]:
I X ; Y = i , j p x i , y j log p x i , y j p x i p y j
The normalization method that yielded the highest cumulative mutual information across all features was selected. This adaptive selection ensured that feature–label relationships were preserved and enhanced during preprocessing []. Then, the chosen normalization was performed using one of four methods:
  • Min–Max Scaling: Rescales data to the [0, 1] range [] using
x = x min x max x min x
  • Z-Score Normalization: Standardizes features to a zero mean and unit variance []:
x = x μ σ
  • Mean-Range Scaling: Centers by the mean and scales by the range []:
x = x μ range ( x )
  • Robust Scaling: Centers by the median and scales by the interquartile range (IQR) []:
x = x median ( x ) IQR ( x )
For each method, parameters (mean, standard deviation, minimum, maximum, median, IQR) were computed from the training data if not pre-specified, enabling their consistent application to the testing sets. The proposed automatic normalization selection ensures the preprocessing step is systematically adapted to the underlying data distribution. The method enhances feature relevance and model discriminability by maximizing the mutual information between rescaled features and categorical labels []. Furthermore, it increases resilience to outliers, non-Gaussian distributions, and varying feature ranges. This adaptive strategy enhances the robustness and generalization capabilities of subsequent learning models. Figure A2 shows a flow chart of the automatic feature normalization logic.
Figure A2. Flow chart of automatic feature normalization.

Appendix C. Selecting Best Features

We implemented a robust, three-stage feature selection pipeline to ensure that the learning algorithms, particularly ensemble models, operated on a compact and informative set of inputs. This section describes the methodology in detail and explains the rationale for its use in models’ learning. The feature selection process combines filter, wrapper, and embedded methods to progressively refine the feature set. A pipeline flow chart is shown in Figure A3.
Figure A3. Flow chart of feature selection methodology.
The pipeline consisted of three main stages: univariate filtering using t-test, model-based feature ranking via SHAP values, and Recursive Feature Elimination (RFE) with RUSBoost ensemble. This hierarchical approach helped eliminate redundant, noisy, or non-informative features, thereby enhancing model generalizability and interpretability.

Appendix C.1. Stage 1: Filtering by Univariate t-Test

We applied a two-sample unpaired t-test to each feature in the first stage. The objective was to detect statistically significant differences in feature values between the two label groups. Features with a p-value ≤ 0.05 were retained for the next stage [,]. This method is computationally efficient and effective for removing features that are unlikely to contribute to class separation. However, it does not consider interactions between features or their impact on the learning algorithm’s performance [,].

Appendix C.2. Stage 2: SHAP-Based Feature Ranking

Features retained from the t-test were further ranked using SHAP (SHapley Additive exPlanations) values. A classifier was trained using an Error-Correcting Output Codes (ECOC) ensemble model, and SHAP values were computed over multiple iterations [,]. In each iteration:
  • The classifier was trained and evaluated.
  • SHAP values were computed for each feature.
  • The top N/2 features with the highest mean absolute SHAP value were stored.
Finally, we selected the features that appeared most frequently across iterations, ensuring robustness to sampling variance. This step introduces model explainability into the selection process, allowing for interpretability and greater confidence in the selected features [,]. Figure A4 shows the calculated shapely values for selected features when using the proposed method for base model 6.
Figure A4. Shapley values calculated for selected features of base model 6.

Appendix C.3. Stage 3: Embedded Recursive Feature Elimination (RFE) with Ensemble Model

The final stage employed Recursive Feature Elimination (RFE) using a RUSBoost ensemble classifier [,]. This method was embedded within the model training process, recursively ranking and removing the least essential features until the desired number of features was retained. RFE was conducted within a cross-validation loop to enhance generalizability and avoid overfitting a specific training partition. The most consistently retained features across folds were selected. This embedded method leverages the inherent feature importance estimates of the RUSBoost model and tailors feature selection to the classification task [,].
The proposed multi-stage feature selection pipeline offers several advantages, particularly within ensemble learning frameworks such as stacked generalization. The pipeline effectively balances speed, robustness, and task relevance by integrating filter methods such as t-tests, wrapper methods like RFE, and embedded techniques including SHAP-based ranking. Early-stage filtering removes non-discriminative features, reducing noise and enhancing the efficiency of subsequent modeling stages. SHAP values support model transparency and interpretability, which are essential in biomedical and clinical AI applications. Moreover, RFE aligns the selected features with the learning algorithm, ensuring they are optimized for predictive performance. As a result, the selected features are compact and stable and can be applied to the different experimental models used within the ensemble. In stacked ensemble systems, where individual experimental models are trained independently and a meta-learner combines their predictions, the quality of the features provided to each base model plays a critical role []. If experimental models are trained on noisy or inconsistent features, the ensemble’s performance can suffer due to weak or conflicting individual predictions. The proposed pipeline addresses this by ensuring each base model receives a consistent and informative feature set. Consequently, the meta-learner benefits from high-quality base predictions, which improves the overall strength and stability of the ensemble. This also reduces the generalization gap across both training and unseen data distributions. Therefore, the proposed feature selection pipeline is crucial for enhancing the accuracy, interpretability, and robustness of ensemble learning systems [].

Appendix D

Table A1. Summary of extracted features for all models.
Table A1. Summary of extracted features for all models.
Power Spectrum FeaturesBi-Spectrum FeaturesWaveletWaveletMFCCTPCDF
MeanPowerHf2 2Wavelet Approx Mean L1Wavelet Approx Mean L4MFCC Mean C1TP Histogram 1CDFMean
StandardDeviationEn F Bi D 3Wavelet Approx Std L1Wavelet Approx Std L4MFCC Std C1TP Histogram 2CDFStd
MaximumMean Bi D F 3Wavelet Approx Skewness L1Wavelet Approx Skewness L4MFCC Skewness C1TP Histogram 3Lyapunov
SlopeWCOB D Fx 3Wavelet Approx Kurtosis L1Wavelet Approx Kurtosis L4MFCC Kurtosis C1TP EntropyLyapunovExponentMean
SCBWCenterWCOB D Fy 3Wavelet Approx Entropy L1Wavelet Approx Entropy L4MFCC Median C1TP EnergyLyapunovExponentStd
SCBWBandwidthHf1 3Wavelet Approx Log Energy L1Wavelet Approx Log Energy L4MFCC Range C1TP SkewnessLyapunovExponentMax
SpectralSkewnessHf2 3Wavelet Approx Max To Min Ratio L1Wavelet Approx Max To Min Ratio L4MFCC Entropy C1TP KurtosisLyapunovExponentMin
SpectralKurtosisTotal EnergyWavelet Approx Spectral Centroid L1Wavelet Approx Spectral Centroid L4MFCC Mean Abs Diff C1TP Max ProbRecurrence
PeakFrequencyNormalized EnergyWavelet Approx Spectral Bandwidth L1Wavelet Approx Spectral Bandwidth L4Spectral Centroid C1TP Ratio1RecurrenceDeterminism
SpectralEnergyMax Abs BiWavelet Detail Mean L1Wavelet Detail Mean L4Spectral Bandwidth C1TP Ratio2Amplitude Modulation
SpectralEntropyMean Abs BiWavelet Detail Std L1Wavelet Detail Std L4MFCC Mean C2EPAmplitudeModulationMean
ZeroCrossingRateEntropy SkewnessWavelet Detail Skewness L1Wavelet Detail Skewness L4MFCC Std C2EP EnergyAmplitudeModulationStd
RMSEntropy KurtosisWavelet Detail Kurtosis L1Wavelet Detail Kurtosis L4MFCC Skewness C2EP Mean EnergyAmplitudeModulationMax
SpectralFlatnessSymmetry MetricWavelet Detail Entropy L1Wavelet Detail Entropy L4MFCC Kurtosis C2EP Max EnergyAmplitudeModulationMin
FM2MFreq1Asymmetry RatioWavelet Detail Log Energy L1Wavelet Detail Log Energy L4MFCC Median C2EP Min EnergyAmplitudeModulationMedian
FM2MFreq2Sym MeanWavelet Detail Max To Min Ratio L1Wavelet Detail Max To Min Ratio L4MFCC Range C2EP Std EnergyMiscellaneous
FreqSkewness1Sym MaxWaveletApproxSkewnessL3Wavelet Detail Spectral Centroid L4MFCC Entropy C2AVPDFA_ScalingExponent
FreqSkewness2Sym StdWaveletApproxKurtosisL3Wavelet Detail Spectral Bandwidth L4MFCC Mean Abs Diff C2AVP Histogram 1
SpectralCrestSym VarWaveletApproxEntropyL3TQWTSpectral Centroid C2AVP Histogram 2
BandPowerMean ValueWaveletApproxLogEnergyL3TQWTSpectral Bandwidth C2AVP Mean
Bi-Spectrum FeaturesStd ValueWavelet Approx Entropy L2SpectralEntropyMFCC Mean C3AVP Std
En T BiSkewness ValueWavelet Approx Log Energy L2BandPowerLowMFCC Std C3AVP Max
En T Bi DKurtosis ValueWavelet Approx Max To Min Ratio L2BandPowerMidMFCC Skewness C3AVP Min
En F BiRange ValueWavelet Approx Spectral Centroid L2BandPowerHighMFCC Kurtosis C3AVP Entropy
En F Bi DEnergy ValueWavelet Approx Spectral Bandwidth L2CQTMFCC Median C3AVP Energy
Mean Bi T FMedian ValueWavelet Detail Mean L2CQT Mean PowerMFCC Range C3DIR
Mean Bi D FIQR ValueWavelet Detail Std L2CQT Std PowerMFCC Entropy C3DIR Histogram 1
Mean Bi T F GCoef VariationWavelet Detail Skewness L2CQT Skewness PowerMFCC Mean Abs Diff C3DIR Histogram 2
Mean Bi D F GRegion AreaWavelet Detail Kurtosis L2CQT Kurtosis PowerSpectral Centroid C3DIR Histogram 3
Mean Bi T F HBounding Box AreaWavelet Detail Entropy L2CQT Total EnergySpectral Bandwidth C3MAG Mean
Mean Bi D F HAspect RatioWavelet Detail Log Energy L2CQT Temporal CentroidMFCC Mean C4MAG Std
WCOB TxCentroid XWavelet Detail Max To Min Ratio L2CQT Temporal SpreadMFCC Std C4MAG Max
WCOB TyCentroid YWavelet Detail Spectral Centroid L2CQT Spectral CentroidMFCC Skewness C4MAG Min
WCOB DxPerimeterWavelet Detail Spectral Bandwidth L2CQT Spectral BandwidthMFCC Kurtosis C4DIR Entropy
WCOB DyCompactnessWavelet Approx Mean L3CQT Spectral FlatnessMFCC Median C4DIR Energy
WCOB T FxBounding Box DiagonalWavelet Approx Std L3CQT Time EntropyMFCC Range C4Noise Metrics
WCOB T FyPeak ValueWavelet Approx Skewness L3CQT Freq EntropyMFCC Entropy C4NoiseToHarmonicRatio
WCOB D FxFrequency Centroid XWavelet Approx Kurtosis L3CQT Gabor Energy StdMFCC Mean Abs Diff C4Shimmer
WCOB D FyFrequency Centroid YWavelet Approx Entropy L3CQT Gabor Mean MeanSpectral Centroid C4Jitter
H1Frequency Bandwidth XWavelet Approx Log Energy L3CQT Gabor Std MeanSpectral Bandwidth C4PBP
H2Frequency Bandwidth YWavelet Approx Max To Min Ratio L3CQT Gabor Skewness MeanMFCC Mean C5PBP Mean
En F Bi D 1Spectral FluxWavelet Approx Spectral Centroid L3CQT Gabor Kurtosis MeanMFCC Std C5PBP Variance
Mean Bi D F 1Entropy ValueWavelet Approx Spectral Bandwidth L3CQT Freq Shifts MeanMFCC Skewness C5PBP Skewness
WCOB D Fx 1Texture ContrastWavelet Detail Mean L3CQT Freq Shifts StdMFCC Kurtosis C5PBP Kurtosis
WCOB D Fy 1Texture CorrelationWavelet Detail Std L3CQT Freq Shifts Dynamic RangeMFCC Median C5PBP Entropy
Hf1 1Texture EnergyWavelet Detail Skewness L3CQT Freq Intervals MeanMFCC Range C5LBP
Hf2 1Texture HomogeneityWavelet Detail Kurtosis L3CQT Freq Intervals StdMFCC Entropy C5LBP Mean
En F Bi D 2Fractal DimensionWavelet Detail Entropy L3CQT Bandwidth MeanMFCC Mean Abs Diff C5LBP Variance
Mean Bi D F 2Connected ComponentsWavelet Detail Log Energy L3CQT Bandwidth StdSpectral Centroid C5LBP Skewness
WCOB D Fx 2Euler NumberWavelet Detail Max To Min Ratio L3CQT Bandwidth Dynamic RangeSpectral Bandwidth C5LBP Kurtosis
WCOB D Fy 2Num HolesWavelet Detail Spectral Centroid L3 LBP Entropy
PSD: Power Spectral Density; Welch’s method: Welch’s method spectral estimation; CQT: Constant-Q Transform; MFCCs: Mel-Frequency Cepstral Coefficients; LBP: Local Binary Patterns; PBP: Probabilistic Binary Patterns; TP: Ternary Patterns; GBP: Gradient Binary Patterns; EP: Energy Patterns; AVP: Amplitude Variation Patterns; TQWT: Tunable Q-Factor Wavelet Transform; VMD: Variational Mode Decomposition; PSR: Phase Space Reconstruction; DFA: Detrended Fluctuation Analysis; NHR: Noise-to-Harmonics Ratio.
Table A2. Summary of selected features for each model.
Table A2. Summary of selected features for each model.
Feature NumberNon-OSA vs. Mild-OSANon-OSA vs. Moderate-OSANon-OSA vs. Severe-OSAMild-OSA vs. Moderate-OSAMild-OSA vs. Severe-OSAModerate-OSA vs. Severe-OSA
1MouthExpiration_Gap_163-296_SpectralEntropyAverage_Gap_45-329_SpectralSkewnessNoseInspiration_Gap_816-885_SpectralKurtosisNoseExpiration_Gap_299-707_MeanPowerMouthExpiration_Gap_367-515_SCBW_BandwidthNoseExpiration_Gap_384-567_SpectralCrest
2MouthExpiration_Gap_163-296_SpectralCrestNoseInspiration_Gap_321-577_SpectralSkewnessNoseInspiration_Gap_816-885_FreqSkewness1Mouth Expiration_BBox_229_712_22_26_iqrValueAverage_Average_BBoxes_EnFBiD_2Average_BBox_584_362_21_35_numHoles
3MouthExpiration_Gap_1375-1499_SpectralCrestAverage_BBox_548_370_24_29_entropyValueNoseExpiration_Gap_986-1325_SCBW_BandwidthMouth_Inspiration_FNMidFlow_MFCCMean_C1Nose Expiration_BBox_787_244_15_15_rangeValueMouth_Inspiration_FNMidFlow_WaveletApproxEntropyL2
4NoseInspiration_Gap_1066-1424_SCBW_BandwidthMouth Inspiration_BBox_203_786_12_11_rangeValueAverage_BBox_527_294_11_18_textureEnergyNose_Expiration_FNMidFlow_WaveletApproxSkewnessL3Average_Gap_73-307_SpectralCrestNose_Expiration_FNMidFlow_PeakCount
5Mouth_Inspiration_FNMidFlow_MFCCMean_C1Mouth_Expiration_FNMidFlow_WaveletApproxSpectralBandwidthL1Average_BBox_526_336_15_18_centroidYMouth_Inspiration_FN_MFCCMean_C1Nose Inspiration_BBox_766_223_15_15_medianValueMouth_Inspiration_FN_MFCCMean_C2
6Mouth_Inspiration_FNMidFlow_MFCCMedian_C5Mouth_Expiration_FNMidFlow_WaveletDetailEntropyL3Nose Inspiration_BBox_0_0_22_23_aspectRatioMouth_Inspiration_FN_MFCCStd_C1Nose Inspiration_BBox_220_207_581_615_peakValueMouthExpiration_Gap_282-460_SCBW_Bandwidth
7Nose_Expiration_FNMidFlow_WaveletApproxMeanL3Mouth Inspiration_BBox_229_756_15_17_iqrValueMouth Inspiration_BBox_0_934_73_89_eulerNumberMouth_Inspiration_FN_SpectralCentroid_C1Nose Inspiration_BBox_766_223_15_15_frequencyCentroidYMouthInspiration_Gap_79-281_RMS
8NoseExpiration_Gap_534-1003_BandPowerNose Inspiration_BBox_533_433_14_13_coefVariationMouth Inspiration_BBox_74_955_16_16_meanValueMouth_Inspiration_FN_SpectralBandwidth_C1Nose Inspiration_BBox_989_487_34_58_peakValueNoseInspiration_Gap_327-532_MeanPower
9NoseExpiration_Gap_1029-1329_SpectralSkewnessNose Expiration_BBox_515_348_11_29_entropyValueMouth Inspiration_BBox_74_955_16_16_stdValueMouth_Inspiration_FN_MFCCKurtosis_C2Nose Inspiration_BBox_989_487_34_58_frequencyCentroidYNoseInspiration_Gap_327-532_SCBW_Bandwidth
10Mouth Inspiration_BBox_10_480_22_39_numHolesAverage_FNMidFlow_DIR_Histogram_2Mouth Expiration_Average_BBoxes_totalEnergyMouth_Expiration_FN_WaveletDetailSpectralCentroidL2Nose Inspiration_BBox_989_487_34_58_spectralFluxNoseInspiration_Gap_327-532_SpectralEnergy
11Mouth Expiration_BBox_30_0_11_13_frequencyCentroidXAverage_FNMidFlow_AVP_Histogram_2Mouth Expiration_Average_BBoxes_normalizedEnergyMouth_Expiration_FN_CQTBandwidthDynamicRangeNose Inspiration_BBox_989_487_34_58_connectedComponentsNoseInspiration_Gap_327-532_BandPower
12Mouth Expiration_BBox_504_333_51_42_kurtosisValueMouth_Inspiration_FNMidFlow_ZeroCrossing1Mouth Expiration_Average_BBoxes_stdAbsBiMouth_Expiration_FN_MFCCStd_C1Nose Inspiration_BBox_989_487_34_58_eulerNumberMouth Inspiration_BBox_290_464_45_80_regionArea
13Mouth Expiration_BBox_504_333_51_42_boundingBoxAreaMouth_Inspiration_FNMidFlow_WaveletDetailSpectralBandwidthL1Mouth Expiration_Average_BBoxes_symStdMouth_Expiration_FN_SpectralCentroid_C1Nose Inspiration_BBox_226_683_94_101_iqrValueNose Inspiration_BBox_367_371_287_304_iqrValue
14Mouth Expiration_BBox_504_333_51_42_boundingBoxDiagonalMouth_Inspiration_FNMidFlow_WaveletDetailSpectralBandwidthL4Mouth Expiration_BBox_588_259_169_356_perimeterMouth_Expiration_FN_MFCCSkewness_C2Nose Inspiration_BBox_226_683_94_101_frequencyCentroidXNose Inspiration_BBox_367_371_287_304_textureHomogeneity
15Mouth Expiration_BBox_977_478_46_82_numHolesMouth_Inspiration_FNMidFlow_MFCCMean_C2Mouth Expiration_BBox_266_279_435_530_centroidYMouth_Expiration_FN_MFCCKurtosis_C2Nose Inspiration_BBox_241_721_42_42_boundingBoxAreaNose Inspiration_BBox_367_371_287_304_numHoles
16Mouth Expiration_BBox_0_975_48_48_aspectRatioMouth_Inspiration_FNMidFlow_MFCCKurtosis_C5Mouth Expiration_BBox_774_364_23_29_perimeterMouth_Expiration_FN_MFCCKurtosis_C3Nose Inspiration_BBox_241_721_42_42_frequencyCentroidYNose Expiration_BBox_315_321_390_424_textureContrast
17Mouth Expiration_BBox_0_975_48_48_textureContrastMouth_Inspiration_FNMidFlow_PBP_SkewnessMouth Expiration_BBox_300_553_19_16_centroidXNose_Inspiration_FN_ZeroCrossing2Nose Inspiration_BBox_243_745_14_14_textureEnergyNose Expiration_BBox_582_344_24_27_aspectRatio
18Nose Inspiration_BBox_0_959_64_64_boundingBoxDiagonalMouth_Inspiration_FNMidFlow_TP_Histogram_1Nose Inspiration_Average_BBoxes_EnFBiD_1Nose_Inspiration_FN_WaveletApproxSkewnessL1Nose Inspiration_BBox_243_745_14_14_eulerNumberNose Expiration_BBox_582_344_24_27_perimeter
19Nose Inspiration_BBox_0_959_64_64_textureEnergyMouth_Inspiration_FNMidFlow_TP_MaxProbNose Inspiration_Average_BBoxes_MeanBiDF_1Nose_Inspiration_FN_WaveletApproxSpectralCentroidL2Nose Inspiration_BBox_223_765_15_16_stdValueAverage_FNMidFlow_WaveletDetailMaxToMinRatioL3
20Average_FNMidFlow_WaveletApproxMaxToMinRatioL1Mouth_Inspiration_FNMidFlow_EP_MaxEnergyNose Inspiration_Average_BBoxes_WCOBDFx_1Nose_Inspiration_FN_WaveletDetailKurtosisL3Nose Inspiration_BBox_223_765_15_16_entropyValueAverage_FNMidFlow_WaveletDetailKurtosisL4
21Nose_Inspiration_FNMidFlow_HurstExponentMouth_Expiration_FNMidFlow_WaveletApproxKurtosisL1Nose Inspiration_Average_BBoxes_WCOBDFy_1Nose_Inspiration_FN_CQTStdPowerNose Inspiration_BBox_202_787_14_17_entropyValueMouth_Inspiration_FNMidFlow_LyapunovExponentMean
22Nose_Expiration_FN_KatzFDMouth_Expiration_FNMidFlow_WaveletApproxSkewnessL2Nose Inspiration_Average_BBoxes_Hf1_1Nose_Inspiration_FN_CQTSkewnessPowerNose Inspiration_BBox_202_787_14_17_textureContrastMouth_Inspiration_FNMidFlow_BandPowerHigh
23MouthInspiration_Gap_1217-1499_PeakFrequencyMouth_Expiration_FNMidFlow_PBP_KurtosisNose Inspiration_Average_BBoxes_Hf2_1Nose_Inspiration_FN_CQTTemporalCentroidNose Inspiration_BBox_202_787_14_17_textureEnergyNose_Inspiration_FNMidFlow_WaveletApproxEntropyL1
24Mouth Expiration_BBox_504_333_51_42_centroidYMouth_Expiration_FNMidFlow_PBP_EntropyNose Inspiration_Average_BBoxes_reserved1Nose_Inspiration_FN_CQTSpectralCentroidNose Inspiration_BBox_0_989_39_34_stdValueNose_Inspiration_FNMidFlow_WaveletApproxEntropyL2
25Mouth Expiration_BBox_504_333_51_42_peakValueMouthInspiration_Gap_0-18_FM2MFreq1Nose Inspiration_Average_BBoxes_EnFBiD_2Nose_Inspiration_FN_CQTBandwidthDynamicRangeNose Expiration_BBox_0_0_39_44_connectedComponentsNose_Inspiration_FNMidFlow_WaveletDetailSpectralCentroidL3
26Mouth Expiration_BBox_977_478_46_82_textureCorrelationMouthInspiration_Gap_64-362_MeanPowerNose Inspiration_Average_BBoxes_MeanBiDF_2Nose_Inspiration_FN_MFCCSkewness_C1Nose Expiration_BBox_465_0_66_46_medianValueNose_Expiration_FNMidFlow_WaveletApproxSkewnessL2
27Nose Inspiration_BBox_0_0_38_38_rangeValueMouthInspiration_Gap_64-362_RMSNose Inspiration_Average_BBoxes_WCOBDFx_2Nose_Inspiration_FN_MFCCKurtosis_C1Nose Expiration_BBox_465_0_66_46_iqrValueAverage_FN_AVP_Mean
28Nose Inspiration_BBox_477_0_55_39_stdValueMouthInspiration_Gap_64-362_FM2MFreq1Nose Inspiration_Average_BBoxes_WCOBDFy_2Nose_Inspiration_FN_SpectralBandwidth_C1Nose Expiration_BBox_465_0_66_46_textureCorrelationMouth_Inspiration_FN_KatzFD
29Nose Expiration_BBox_0_981_41_42_fractalDimensionMouthInspiration_Gap_64-362_BandPowerNose Inspiration_Average_BBoxes_Hf1_2Nose_Inspiration_FN_MFCCKurtosis_C2Nose Expiration_BBox_465_0_66_46_textureEnergyMouth_Inspiration_FN_LyapunovExponentMax
30Nose Expiration_BBox_495_986_47_37_stdValueMouthInspiration_Gap_378-565_StandardDeviationNose Inspiration_Average_BBoxes_Hf2_2Nose_Inspiration_FN_TP_SkewnessNose Expiration_BBox_465_0_66_46_fractalDimensionMouth_Expiration_FN_MFCCMedian_C3
31Mouth_Expiration_FNMidFlow_MFCCMedian_C1MouthInspiration_Gap_378-565_FM2MFreq1Nose Inspiration_Average_BBoxes_reserved2Nose_Expiration_FN_WaveletDetailSkewnessL2Nose Expiration_BBox_465_0_66_46_connectedComponentsNose_Inspiration_FN_WaveletDetailEntropyL3
32Mouth_Expiration_FNMidFlow_PBP_KurtosisMouthInspiration_Gap_584-810_FM2MFreq1Nose Inspiration_Average_BBoxes_bisEntropyNose_Expiration_FN_CQTMeanPowerNose Expiration_BBox_873_128_24_23_centroidYNose_Inspiration_FN_CQTGaborEnergyMean
33Mouth_Expiration_FNMidFlow_TP_Histogram_1MouthInspiration_Gap_837-1499_FM2MFreq1Nose Inspiration_BBox_140_140_733_744_spectralFluxNose_Expiration_FN_CQTSkewnessPowerNose Expiration_BBox_873_128_24_23_compactnessNose_Inspiration_FN_MFCCMedian_C3
34Mouth_Expiration_FNMidFlow_TP_MaxProbMouthExpiration_Gap_0-353_StandardDeviationNose Inspiration_BBox_140_140_733_744_connectedComponentsNose_Expiration_FN_CQTSpectralDynamicsStdNose Expiration_BBox_841_162_21_21_energyValueAverage_Gap_306-496_Maximum
35Mouth_Expiration_FNMidFlow_TP_Ratio1MouthExpiration_Gap_815-1435_StandardDeviationNose Inspiration_BBox_782_151_70_73_skewnessValueMouth_Inspiration_FNMidFlow_ZeroCrossing1Nose Expiration_BBox_787_244_15_15_frequencyCentroidXAverage_Gap_1282-1359_SCBW_Bandwidth

References

  1. Rizzo, D.; Baltzan, M.; Sirpal, S.; Dosman, J.; Kaminska, M.; Chung, F. Prevalence and Regional Distribution of Obstructive Sleep Apnea in Canada: Analysis from the Canadian Longitudinal Study on Aging. Can. J. Public Health 2024, 115, 970–979. [Google Scholar] [CrossRef]
  2. Lechat, B.; Naik, G.; Reynolds, A.; Aishah, A.; Scott, H.; Loffler, K.A.; Vakulin, A.; Escourrou, P.; McEvoy, R.D.; Adams, R.J.; et al. Multinight Prevalence, Variability, and Diagnostic Misclassification of Obstructive Sleep Apnea. Am. J. Respir. Crit. Care Med. 2022, 205, 563–569. [Google Scholar] [CrossRef]
  3. Singh, M.; Liao, P.; Kobah, S.; Wijeysundera, D.N.; Shapiro, C.; Chung, F. Proportion of Surgical Patients with Undiagnosed Obstructive Sleep Apnoea. Br. J. Anaesth. 2013, 110, 629–636. [Google Scholar] [CrossRef] [PubMed]
  4. Espiritu, J.R.D. Health Consequences of Obstructive Sleep Apnea. In Management of Obstructive Sleep Apnea; Springer International Publishing: Cham, Switzerland, 2021; pp. 23–43. [Google Scholar]
  5. American Academy of Sleep Medicine. Hidden Health Crisis Costing America Billions: Underdiagnosing and Undertreating Obstructive Sleep Apnea Draining Healthcare System, 1st ed.; Frost & Sullivan: Darien, IL, USA, 2016. [Google Scholar]
  6. The Harvard Medical School Division of Sleep Medicine. The Price of Fatigue: The Surprising Economic Costs of Unmanaged Sleep Apnea; Harvard Medical School Division of Sleep Medicine Boston: Boston, MA, USA, 2010. [Google Scholar]
  7. Colten, H.R.; Altevogt, B.M. Sleep Disorders and Sleep Deprivation; National Academies Press: Washington, DC, USA, 2006; ISBN 978-0-309-10111-0. [Google Scholar]
  8. American Academy of Sleep Medicine. International Classification of Sleep Disorders: Diagnostic & Coding Manual, 2nd ed.; American Academy of Sleep Medicine: Westchester, IL, USA, 2005; ISBN 0965722023. [Google Scholar]
  9. Noda, A.; Yasuma, F.; Miyata, S.; Iwamoto, K.; Yasuda, Y.; Ozaki, N. Sleep Fragmentation and Risk of Automobile Accidents in Patients with Obstructive Sleep Apnea—Sleep Fragmentation and Automobile Accidents in OSA. Health 2019, 11, 171–181. [Google Scholar] [CrossRef]
  10. Young, T.; Finn, L.; Peppard, P.E.; Szklo-Coxe, M.; Austin, D.; Nieto, F.J.; Stubbs, R.; Hla, K.M. Sleep Disordered Breathing and Mortality: Eighteen-Year Follow-up of the Wisconsin Sleep Cohort. Sleep 2008, 31, 1071. [Google Scholar] [CrossRef]
  11. Yoshihisa, A.; Takeishi, Y. Sleep Disordered Breathing and Cardiovascular Diseases. J. Atheroscler. Thromb. 2019, 26, 315–327. [Google Scholar] [CrossRef]
  12. Berry, R.B.; Brooks, R.; Gamaldo, C.E.; Harding, S.M.; Marcus, C.; Vaughn, B.V. The AASM Manual for the Scoring of Sleep and Associated Events, Rules, Terminology and Technical Specifications; American Academy of Sleep Medicine: Darien, IL, USA, 2012; Volume 176. [Google Scholar]
  13. Kushida, C.A.; Littner, M.R.; Morgenthaler, T.; Alessi, C.A.; Bailey, D.; Coleman, J.; Friedman, L.; Hirshkowitz, M.; Kapen, S.; Kramer, M.; et al. Practice Parameters for the Indications for Polysomnography and Related Procedures: An Update for 2005. Sleep 2005, 28, 499–523. [Google Scholar] [CrossRef] [PubMed]
  14. Bradley, T.D.; Floras, J.S. Sleep Apnea: Implications in Cardiovascular and Cerebrovascular Disease, 2nd ed.; Bradley, T.D., Floras, J.S., Eds.; CRC Press: Boca Raton, FL, USA, 2013; ISBN 978-0-429-11867-8. [Google Scholar]
  15. Butt, M.; Dwivedi, G.; Khair, O.; Lip, G.Y.H. Obstructive Sleep Apnea and Cardiovascular Disease. Int. J. Cardiol. 2010, 139, 7–16. [Google Scholar] [CrossRef] [PubMed]
  16. Chen, L.; Pivetta, B.; Nagappa, M.; Saripella, A.; Islam, S.; Englesakis, M.; Chung, F. Validation of the STOP-Bang Questionnaire for Screening of Obstructive Sleep Apnea in the General Population and Commercial Drivers: A Systematic Review and Meta-Analysis. Sleep Breath. 2021, 25, 1741–1751. [Google Scholar] [CrossRef] [PubMed]
  17. Mazzotti, D.R.; Keenan, B.T.; Thorarinsdottir, E.H.; Gislason, T.; Pack, A.I.; Pack, A.I.; Schwab, R.; Maislin, G.; Keenan, B.T.; Jafari, N.; et al. Is the Epworth Sleepiness Scale Sufficient to Identify the Excessively Sleepy Subtype of OSA? Chest 2022, 161, 557–561. [Google Scholar] [CrossRef]
  18. Yadollahi, A.; Moussavi, Z. Acoustic Obstructive Sleep Apnea Detection. In Proceedings of the 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Minneapolis, MN, USA, 3–6 September 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 7110–7113. [Google Scholar]
  19. Elwali, A.; Moussavi, Z. A Novel Decision Making Procedure during Wakefulness for Screening Obstructive Sleep Apnea Using Anthropometric Information and Tracheal Breathing Sounds. Sci. Rep. 2019, 9, 11467. [Google Scholar] [CrossRef] [PubMed]
  20. Elwali, A.; Moussavi, Z. Obstructive Sleep Apnea Screening and Airway Structure Characterization During Wakefulness Using Tracheal Breathing Sounds. Ann. Biomed. Eng. 2017, 45, 839–850. [Google Scholar] [CrossRef]
  21. Montazeri, A.; Giannouli, E.; Moussavi, Z. Assessment of Obstructive Sleep Apnea and Its Severity during Wakefulness. Ann. Biomed. Eng. 2012, 40, 916–924. [Google Scholar] [CrossRef]
  22. Hajipour, F.; Jozani, M.J.; Moussavi, Z. A Comparison of Regularized Logistic Regression and Random Forest Machine Learning Models for Daytime Diagnosis of Obstructive Sleep Apnea. Med. Biol. Eng. Comput. 2020, 58, 2517–2529. [Google Scholar] [CrossRef] [PubMed]
  23. Hajipour, F.; Jozani, M.J.; Elwali, A.; Moussavi, Z. Regularized Logistic Regression for Obstructive Sleep Apnea Screening during Wakefulness Using Daytime Tracheal Breathing Sounds and Anthropometric Information. Med. Biol. Eng. Comput. 2019, 57, 2641–2655. [Google Scholar] [CrossRef]
  24. Simply, R.M.; Dafna, E.; Zigel, Y. Diagnosis of Obstructive Sleep Apnea Using Speech Signals From Awake Subjects. IEEE J. Sel. Top. Signal Process. 2020, 14, 251–260. [Google Scholar] [CrossRef]
  25. Sola-Soler, J.; Fiz, J.A.; Torres, A.; Jane, R. Identification of Obstructive Sleep Apnea Patients from Tracheal Breath Sound Analysis during Wakefulness in Polysomnographic Studies. In Proceedings of the 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Chicago, IL, USA, 26–30 August 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 4232–4235. [Google Scholar]
  26. Alqudah, A.M.; Elwali, A.; Kupiak, B.; Hajipour, F.; Jacobson, N.; Moussavi, Z. Obstructive Sleep Apnea Detection during Wakefulness: A Comprehensive Methodological Review. Med. Biol. Eng. Comput. 2024, 62, 1277–1311. [Google Scholar] [CrossRef]
  27. Tregear, S.; Reston, J.; Schoelles, K.; Phillips, B. Obstructive Sleep Apnea and Risk of Motor Vehicle Crash: Systematic Review and Meta-Analysis. J. Clin. Sleep. Med. 2009, 5, 573. [Google Scholar] [CrossRef] [PubMed]
  28. Hajipour, F.; Moussavi, Z. Spectral and Higher Order Statistical Characteristics of Expiratory Tracheal Breathing Sounds During Wakefulness and Sleep in People with Different Levels of Obstructive Sleep Apnea. J. Med. Biol. Eng. 2019, 39, 244–250. [Google Scholar] [CrossRef]
  29. Elwali, A.; Moussavi, Z. Predicting Polysomnography Parameters from Anthropometric Features and Breathing Sounds Recorded during Wakefulness. Diagnostics 2021, 11, 905. [Google Scholar] [CrossRef] [PubMed]
  30. Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef] [PubMed]
  31. Batista, G.E.A.P.A.; Monard, M.C. An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Appl. Artif. Intell. 2003, 17, 519–533. [Google Scholar] [CrossRef]
  32. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman and Hall/CRC: Boca Raton, FL, USA, 1994; ISBN 9780429246593. [Google Scholar]
  33. Rangayyan, R.M.; Reddy, N.P. Biomedical Signal Analysis: A Case-Study Approach; Pergamon Press: New York, NY, USA, 2002; Volume 30. [Google Scholar]
  34. Mendel, J.M. Tutorial on Higher-Order Statistics (Spectra) in Signal Processing and System Theory: Theoretical Results and Some Applications. Proc. IEEE 1991, 79, 278–305. [Google Scholar] [CrossRef]
  35. Astfalck, L.C.; Sykulski, A.M.; Cripps, E.J. Debiasing Welch’s Method for Spectral Density Estimation. Biometrika 2023, 111, 1313–1329. [Google Scholar] [CrossRef]
  36. Jiang, M.; Wang, D.; Kuang, Y.; Mo, X. A Bicoherence-Based Nonlinearity Measurement Method for Identifying the Quadratic Phase Coupling of Nonlinear Systems. Int. J. Non Linear Mech. 2021, 131, 103–109. [Google Scholar]
  37. Dlask, M.; Kukal, J. Hurst Exponent Estimation from Short Time Series. Signal Image Video Process. 2019, 13, 263–269. [Google Scholar] [CrossRef]
  38. Farrús, M.; Hernando, J.; Ejarque, P. Jitter and Shimmer Measurements for Speaker Recognition. In Proceedings of the Interspeech 2007, Antwerp, Belgium, 27–31 August 2007; ISCA: Singapore, 2007; pp. 778–781. [Google Scholar]
  39. Jotz, G.P.; Cervantes, O.; Abrahão, M.; Settanni, F.A.P.; de Angelis, E.C. Noise-to-Harmonics Ratio as an Acoustic Measure of Voice Disorders in Boys. J. Voice 2002, 16, 28–31. [Google Scholar] [CrossRef]
  40. Gosala, B.; Kapgate, P.D.; Jain, P.; Chaurasia, R.N.; Gupta, M. Wavelet transforms for feature engineering in EEG data processing: An application on Schizophrenia. Biomed. Signal Process. Control. 2023, 85, 104811. [Google Scholar] [CrossRef]
  41. Abdul, Z.K.; Al-Talabani, A.K. Mel Frequency Cepstral Coefficient and Its Applications: A Review. IEEE Access 2022, 10, 122136–122158. [Google Scholar] [CrossRef]
  42. Wang, J.-C.; Wang, J.-F.; Weng, Y.-S. Chip design of MFCC extraction for speech recognition. Integration 2002, 32, 111–131. [Google Scholar] [CrossRef]
  43. Kohlrausch, A. Binaural Masking Experiments Using Noise Maskers with Frequency-Dependent Interaural Phase Differences. II: Influence of Frequency and Interaural-Phase Uncertainty. J. Acoust. Soc. Am. 1990, 88, 1749–1756. [Google Scholar] [CrossRef]
  44. Rosenstein, M.T.; Collins, J.J.; De Luca, C.J. A Practical Method for Calculating Largest Lyapunov Exponents from Small Data Sets. Phys. D 1993, 65, 117–134. [Google Scholar] [CrossRef]
  45. Zhao, K.; Wen, H.; Guo, Y.; Scano, A.; Zhang, Z. Feasibility of Recurrence Quantification Analysis (RQA) in Quantifying Dynamical Coordination among Muscles. Biomed. Signal Process. Control 2023, 79, 104042. [Google Scholar] [CrossRef]
  46. Borowska, M. Entropy-Based Algorithms in the Analysis of Biomedical Signals. In Studies in Logic, Grammar and Rhetoric; University of Bialystok: Bialystok, Poland, 2015; Volume 43, pp. 21–32. [Google Scholar] [CrossRef]
  47. Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
  48. Divya, S.; Suresh, L.P.; John, A. Image Feature Generation Using Binary Patterns—LBP, SLBP and GBP. In ICT Analysis and Applications; Springer: Singapore, 2022; pp. 233–239. [Google Scholar]
  49. Selesnick, I.W. Wavelet Transform with Tunable Q-Factor. IEEE Trans. Signal Process. 2011, 59, 3560–3575. [Google Scholar] [CrossRef]
  50. Márton, L.F.; Brassai, S.T.; Bakó, L.; Losonczi, L. Detrended fluctuation analysis of EEG signals. Procedia Technol. 2014, 12, 125–132. [Google Scholar] [CrossRef]
  51. Vergara, J.R.; Estévez, P.A. A Review of Feature Selection Methods Based on Mutual Information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
  52. Zhao, Z.; Liu, H. Feature Selection Based on Mutual Information with Correlation Coefficient. Appl. Intell. 2022, 52, 1169–1180. [Google Scholar] [CrossRef]
  53. Liu, S.; Motani, M. Improving Mutual Information Based Feature Selection by Boosting Unique Relevance. arXiv 2022, arXiv:2212.06143. [Google Scholar] [CrossRef]
  54. Singh, D.; Singh, B. Feature Wise Normalization: An Effective Way of Normalizing Data. Pattern Recognit. 2022, 122, 108307. [Google Scholar] [CrossRef]
  55. Haury, A.-C.; Gestraud, P.; Vert, J.-P. The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures. PLoS ONE 2011, 6, e28210. [Google Scholar] [CrossRef]
  56. Wang, D.; Zhang, H.; Liu, R.; Lv, W.; Wang, D. t-Test Feature Selection Approach Based on Term Frequency for Text Categorization. Pattern Recognit. Lett. 2014, 45, 1–10. [Google Scholar] [CrossRef]
  57. Khoshgoftaar, T.M.; Wang, H.; Liang, Q.; Hancock, J.T. Feature Selection Strategies: A Comparative Analysis of SHAP-Value and Importance-Based Methods. J. Big Data 2024, 11, 44. [Google Scholar] [CrossRef]
  58. Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst. Man Cybern. Part. A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
  59. Mounce, S.R.; Ellis, K.; Edwards, J.M.; Speight, V.L.; Jakomis, N.; Boxall, J.B. Ensemble Decision Tree Models Using RUSBoost for Estimating Risk of Iron Failure in Drinking Water Distribution Systems. Water Resour. Manag. 2017, 31, 925–942. [Google Scholar] [CrossRef]
  60. Wolpert, D.H. Stacked Generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
  61. Wang, M.; Qian, Y.; Yang, Y.; Chen, H.; Rao, W.-F. Improved Stacking Ensemble Learning Based on Feature Selection to Accurately Predict Warfarin Dose. Front. Cardiovasc. Med. 2024, 10, 1320938. [Google Scholar] [CrossRef] [PubMed]
  62. Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
  63. Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C.J., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
  64. Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
  65. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  66. Büchlmann, P.; Yu, B. Analyzing Bagging. Ann. Stat. 2002, 30, 927–961. [Google Scholar] [CrossRef]
  67. Kwon, Y.; Zou, J. Data-OOB: Out-of-Bag Estimate as a Simple and Efficient Data Value. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR: Cambridge, MA, USA, 2023; Volume 202, pp. 18135–18152. [Google Scholar]
  68. Klevak, E.; Lin, S.; Martin, A.; Linda, O.; Ringger, E. Out-Of-Bag Anomaly Detection. arXiv 2020, arXiv:2009.09358. [Google Scholar] [CrossRef]
  69. Japkowicz, N.; Stephen, S. The Class Imbalance Problem: A Systematic Study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
  70. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  71. Varma, S.; Simon, R. Bias in Error Estimation When Using Cross-Validation for Model Selection. BMC Bioinform. 2006, 7, 91. [Google Scholar] [CrossRef]
  72. Cawley, G.C.; Talbot, N.L.C. On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. J. Mach. Learn. Res. 2010, 11, 2079–2107. [Google Scholar]
  73. Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence—Volume 2; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; pp. 1137–1143. [Google Scholar]
  74. Dietterich, T.G. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput. 1998, 10, 1895–1923. [Google Scholar] [CrossRef] [PubMed]
  75. Alqudah, A.M.; Moussavi, Z. A Review of Deep Learning for Biomedical Signals: Current Applications, Advancements, Future Prospects, Interpretation, and Challenges. Comput. Mater. Contin. 2025, 83, 3753–3841. [Google Scholar] [CrossRef]
  76. Finkelstein, Y.; Wolf, L.; Nachmani, A.; Lipowezky, U.; Rub, M.; Shemer, S.; Berger, G. Velopharyngeal Anatomy in Patients With Obstructive Sleep Apnea Versus Normal Subjects. J. Oral Maxillofac. Surg. 2014, 72, 1350–1372. [Google Scholar] [CrossRef] [PubMed]
  77. Goldshtein, E.; Tarasiuk, A.; Zigel, Y. Automatic Detection of Obstructive Sleep Apnea Using Speech Signals. IEEE Trans. Biomed. Eng. 2011, 58, 1373–1382. [Google Scholar] [CrossRef] [PubMed]
  78. Qi, F.; Li, C.; Wang, S.; Zhang, H.; Wang, J.; Lu, G. Contact-Free Detection of Obstructive Sleep Apnea Based on Wavelet Information Entropy Spectrum Using Bio-Radar. Entropy 2016, 18, 306. [Google Scholar] [CrossRef]
  79. Shams, E.; Karimi, D.; Moussavi, Z. Bispectral Analysis of Tracheal Breath Sounds for Obstructive Sleep Apnea. In Proceedings of the 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, San Diego, CA, USA, 28 August–1 September 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 37–40. [Google Scholar]
  80. Gramegna, A.; Giudici, P. Shapley Feature Selection. FinTech 2022, 1, 72–80. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.