4.2. Data Processing and Experimental Setup
Extracting SpO
2 channel signals from polysomnography (PSG) recordings, and samples are categorized as normal (AHI < 5), mild (5 ≤ AHI < 15), moderate (15 ≤ AHI < 30), or severe (AHI ≥ 30) according to the Apnea–Hypopnea Index (AHI).
Figure 7 shows the comparison of SpO
2 signals between diseased and non-diseased patients
Due to deep learning models require uniform sequence lengths, and recording durations vary substantially in SHHS1 (~3 to ~8.9 h), all SpO
2 sequences are truncated or zero-padded to 8 h at 1 Hz (8 × 3600 = 28,800 samples) and the sample sequences are denoised using OSA-aware adaptive smoothing (see
Section 3.3 for details). Min–max normalization is applied to all SpO
2 signals, mapping values to the [0, 1] interval.
To address class imbalance, oversampling, undersampling, and the SMOTE (Synthetic Minority Over-sampling Technique) algorithm are employed to obtain a balanced distribution of samples across categories during training. The model is trained with a batch size of 64 for 200 epochs, using the cross-entropy loss and the Adam optimizer with an initial learning rate of 0.01 to improve training efficiency.
Table 2 lists the development environment used in this study.
To avoid overfitting or training stagnation, a learning rate scheduler is used; when the validation loss failed to decrease for five consecutive epochs, the learning rate is halved until it reached 1 × 10−4, after which it is held constant. This procedure helped maintain stable convergence, with adjustments made when convergence bottlenecks occurred.
4.3. Evaluation Metrics
Model performance is evaluated by calculating four types of confusion matrices and calculating three metrics for each class: sensitivity, specificity, and accuracy. Class-wise averages for each metric and the overall accuracy are then computed, which is shown as:
Performance metrics, including sensitivity, specificity, and accuracy, are provided, where 1 ≤ i ≤ C = 4. These metrics are used to represent the i-th element of the prediction result classified into OSA severity levels {normal, mild, moderate, severe}; for example, i = 3 corresponded to moderate cases. True positives (TP) and false positives (FP) are defined as cases in which the predicted result is correctly or incorrectly classified as the i-th case, respectively. Similarly, true negatives (TN) and false negatives (FN) are defined as cases in which the predicted result is correctly identified as not belonging to the i-th case, or is incorrectly excluded from the i-th case, respectively.
Next, the average values and overall accuracy of the performance indicators in (1)–(3) are given by the following equations:
Performance metrics include sensitivity, specificity, and accuracy. The F1-score is additionally reported to better account for class imbalance. For each OSA severity level i, the F1-score is defined as:
where Pre
i and Sen
i denote the precision and sensitivity of class i, respectively. The macro-averaged F1-score is obtained by taking the average of
for all classes.
To quantify the overall reliability beyond chance agreement, Cohen’s kappa coefficient (
κ) is calculated between the predicted labels and the reference annotations. Let
po denote the observed accuracy and
pe the expected accuracy by chance based on the marginal distributions of the confusion matrix; then
where values of
κ above 0.6 are generally interpreted as indicating substantial agreement, and values above 0.8 as almost perfect agreement.
Receiver operating characteristic (ROC) curves and the corresponding area under the curve (AUC) are also computed. In the four-class setting, a one-vs-rest strategy was applied. A separate ROC curve and AUC value were obtained for each severity level. For the binary experiments with AHI cutoffs of 5 and 15, ROC curves are generated for discriminating normal vs. apneic subjects at each threshold.
4.4. OSA Classification Results
In this study, a 10% subset of SHHS1 and the full SHHS2 cohort are used to evaluate the model’s four-class classification performance, as summarized in
Table 3. The SHHS1 and SHHS2 test sets comprised 580 and 2651 records, respectively. The confusion matrices for the two datasets are presented in
Figure 8 and
Figure 9 to illustrate the test results.
On SHHS1, the model achieved 80.51% accuracy, with average sensitivity 81.98%, specificity 92.93%, precision 81.12% and F1-score 81.86%. On SHHS2, the corresponding values are 76.61%, 79.26%, 92.06%, 76.44%, and 77.39%, respectively.
For the four-class task, the macro-averaged F1-scores are 0.8186 on SHHS1 and 0.7739 on SHHS2, indicating balanced performance across the four severity levels despite the class imbalance. Cohen’s kappa coefficients are 0.729 and 0.687, respectively, corresponding to substantial agreement between the model predictions and the reference labels. These results suggest that the proposed model maintains good overall reliability beyond chance agreement.
Two sets of experiments are designed using AHI thresholds of 5 and 15 to examine the effect of these cutoffs on the performance of the proposed model, with the goal of enhancing its ability to distinguish normal signals from those of patients with apnea. Accordingly, the four OSA severity levels are collapsed into two categories: normal and apneic. Records with an AHI below the designated threshold are classified as normal, whereas those at or above the threshold are classified as apneic.
Figure 10 and
Figure 11 show the confusion matrices obtained from the experiment, respectively.
The test results are presented in
Table 4. High overall accuracy is observed at both cutoffs. Using SHHS1 as an example, when the AHI threshold is set to 5, a diagnosis accuracy of 91.37% is achieved; the sensitivity, specificity, F1-Score, and accuracy are 93.81%, 88.32%, 92.38% and 90.99%, respectively.
Figure 12 shows the one-vs-rest ROC curves for the four OSA severity levels on SHHS1 and SHHS2. The AUC values are 0.934, 0.871, 0.894, and 0.972 for the normal, mild, moderate, and severe classes, respectively. The high AUC for the severe classes indicate excellent discrimination between severe OSA and the remaining classes, while the slightly lower AUC for the mild class reflects the clinical overlap between normal and mild OSA.
Similar trends are observed on SHHS2, with AUCs of 0.947, 0.838, 0.842, and 0.971 for the four classes, demonstrating that the proposed model maintains robust discrimination across datasets, particularly for moderate-to-severe OSA.
For the binary setting, as shown in
Figure 13, the AUCs are 0.916 and 0.928 for AHI cutoffs of 5 on SHHS1 and SHHS2, respectively, and 0.965 and 0.950 for AHI cutoffs of 15. The consistently high AUC values indicate strong diagnosis performance in distinguishing normal from apneic subjects at clinically relevant thresholds.
4.5. Impact of OSA-Aware Adaptive Smoothing
To assess the contribution of OSA-aware adaptive smoothing, an additional set of experiments is conducted in which the adaptive smoothing step is replaced by alternative preprocessing strategies, while all other components of the pipeline remained unchanged. The following methods are compared:
Fixed median filter: a standard 11-point median filter applied uniformly to the entire SpO2 sequence. This method uses a fixed window length for all regions, regardless of local signal variability, and thus represents a typical non-adaptive denoising strategy.
Butterworth low-pass filter: a 4th-order low-pass Butterworth filter applied to the SpO2 signal. The cut-off frequency is chosen within the dominant frequency band of desaturation events, with the aim of attenuating high-frequency noise while preserving slower physiological fluctuations. The same cut-off parameter is applied globally over the whole recording.
Proposed OSA-aware adaptive smoothing: the full method described in
Section 3.3, where the local standard deviation within a sliding window is used to adaptively adjust the effective smoothing window. High-variance regions, which usually correspond to desaturation events and recovery phases, are smoothed with a shorter window to preserve detailed morphology, whereas low-variance baseline segments are smoothed with a longer window for stronger noise suppression.
Each preprocessing variant is used to train and evaluate the complete MSC-Mamba-LSTM network on SHHS1 and SHHS2 under identical training and evaluation settings.
Table 5 summarizes the performance in terms of overall classification accuracy (ACC). On SHHS1, accuracies of 0.7879, 0.7914, and 0.8052 are obtained for the fixed median filter, Butterworth low-pass filter, and OSA-aware adaptive smoothing, respectively. On SHHS2, the corresponding accuracies are 0.7602, 0.7533, and 0.7661, respectively. Thus, OSA-aware adaptive smoothing consistently achieved the best or near-best accuracy on both datasets, with absolute improvements of approximately 1–2 percentage points over standard fixed-parameter filters.
4.6. Ablation Experiment
In this study, the contributions of multi-scale convolution, the Mamba module, the MCA module, and the LSTM branch are systematically assessed by using ablation experiments. Overall, each component makes a positive contribution to performance across the two test sets, although some simplified variants achieve accuracy that is close to that of the full model on SHHS2.
As a baseline, a two-branch architecture comprising a conventional CNN and an LSTM was used, referred to as Model1. This configuration captures local features and temporal information, but its classification accuracy is significantly lower than that of the full model, which includes the additional modules.
In a subsequent ablation study, the multi-scale convolution module was replaced with a single-scale convolution (kernel size of 3), referred to as Model2. This modification diminished the model’s ability to capture features across multiple temporal scales, particularly when both high and low frequency components are present. As a result, recognition of complex patterns such as desaturation events and rapidly fluctuating SpO2 signals degraded substantially, highlighting the necessity of multi-scale convolution for extracting frequency-specific information through hierarchical representations. The adverse effect is particularly pronounced for physiological signals characterized by complex dynamics.
Next, replacing the Mamba module with standard convolutional layers, referred to as Model3, saw a marked performance drop. The Mamba module captures temporal dependencies and dynamic changes in time series, enhancing the model’s responsiveness to rapid transitions or event-driven patterns. Without it, Model3 struggled to model these dependencies and underperformed on dynamic events (e.g., sudden desaturation). These findings highlight the Mamba module’s central role in modeling temporal relationships and trends.
In an additional ablation, the MCA module was removed, resulting in a variant referred to as Model4. The MCA module adaptively re-weights channel-wise features, enabling the network to emphasize salient information. In the absence of MCA, Model4 fails to highlight informative channels, thereby impairing the detection of segments pertinent to obstructive sleep apnea (OSA) severity. Therefore, MCA is critical for estimating feature importance in SpO2 signals, particularly with multi-channel inputs, by directing attention to the channels most informative for classification.
Finally, eliminating the LSTM branch and retaining only the deep branch resulted in Model5, which substantially degraded overall performance. The LSTM branch captures long-term dependencies characteristic of physiological time series, particularly SpO2. It enables the model to represent extended temporal context and periodic fluctuations. Without LSTM, Model5 fails to encode these long-range patterns, reducing sensitivity to long-term structure and lowering classification accuracy. Accordingly, LSTM is indispensable for modeling long-term dependencies and periodicity in temporal signals.
As a whole, these modules enhance multi-level feature extraction for complex physiological signals, yielding substantial gains in OSA severity classification. Each component contributes in a unique and complementary manner to overall performance. The results of the ablation experiment are listed in
Table 6.
These results indicate that each module complements the others in processing complex physiological signals (such as SpO2), jointly improving the multi-level feature extraction ability and classification performance of the model. All experiments are conducted under the same training configuration.
It is noteworthy that on SHHS2 the accuracies of some ablated variants are very close to that of the proposed full model. These differences, which are within 0.3 percentage points, are comparable to the expected variability due to random initialization and data imbalance in SHHS2. In addition, although not shown in
Table 6, the full model consistently yields higher sensitivity for moderate and severe OSA and more stable macro-averaged performance across classes indicating that the additional modules mainly improve robustness and class-wise balance rather than producing large gains in overall accuracy on every individual test set.
Furthermore, the pattern observed in
Table 6 suggests a degree of redundancy and regularization among modules. Removing a single component can occasionally lead to a slightly higher accuracy on one dataset by reducing model capacity and mitigating overfitting, but this comes at the cost of reduced generalization to other datasets or OSA severity levels.
In summary, the ablation results support the conclusion that multi-scale convolution, Mamba, MCA, and LSTM provide complementary benefits, even though the numerical improvements in overall accuracy are modest for some ablated variants.