Next Article in Journal
The Development and Validation of a CFD Model of a Heave Plate for Industrial Applications Using a Lattice-Boltzmann, LES Method
Previous Article in Journal
AI-Driven Innovation in Manufacturing Digitalization: Real-Time Predictive Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluation of Data Augmentation Under Label Scarcity for ECG-Based Detection of Sleep Apnea

1
Department of Artificial Intelligence Convergence, Hallym University, Chuncheon 24254, Republic of Korea
2
Cerebrovascular Disease Research Center, Hallym University, Chuncheon 24254, Republic of Korea
3
Department of Population Health Science and Policy, Icahn School of Medicine, Mount Sinai, New York, NY 10029, USA
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(24), 13231; https://doi.org/10.3390/app152413231
Submission received: 19 November 2025 / Revised: 15 December 2025 / Accepted: 16 December 2025 / Published: 17 December 2025
(This article belongs to the Section Biomedical Engineering)

Abstract

Supervised ECG-based sleep apnea detection typically depends on large and fully annotated datasets, yet the rarity and cost of labeling apneic events often lead to substantial annotation scarcity in practice. This study provides a controlled evaluation of how such scarcity degrades classification performance and, as a key contribution, investigates whether a constrained, morphology-preserving ECG augmentation framework can compensate for reduced apnea-label availability. Using the PhysioNet Apnea–ECG dataset, we simulated seven levels of label retention ( r = 5 100 % ) and trained a lightweight CNN–BiLSTM model under both subject-dependent (SD) and subject-independent (SI) five-fold protocols. Offline augmentation was applied only to apnea segments and consisted of simple, physiologically motivated time-domain perturbations designed to retain realistic cardiac and respiratory dynamics. Across both evaluation settings, augmentation substantially mitigated performance loss in the low- and mid-scarcity regimes. Under SI evaluation, the mean F1-score improved from 0.57 to 0.72 at r = 5 % and from 0.63 to 0.76 at r = 10 % , with scores at r = 10 40 % (0.75–0.77) approaching the full-label baseline of 0.79. Temporal and spectral analyses confirmed preservation of P–QRS–T morphology and respiratory modulation without distortion. These results demonstrate that simple and interpretable ECG augmentations provide an effective and reproducible baseline for data-efficient apnea screening and offer a practical path toward scalable annotation and robust single-lead deployment under label scarcity.

1. Introduction

Obstructive sleep apnea (OSA) is a prevalent sleep-related breathing disorder associated with cardiovascular, metabolic, and cognitive risks [1,2]. Polysomnography (PSG) remains the diagnostic gold standard; however, it is resource-intensive and requires specialized laboratory settings [3,4]. These limitations have inspired the development of low-cost screening approaches based on single-lead electrocardiography (ECG), particularly for wearable and home-monitoring applications [5,6,7].
The PhysioNet Apnea–ECG dataset has become a standard benchmark for ECG-based OSA detection [3]. Recent studies demonstrated strong performance using convolutional neural networks (CNNs), hybrid CNN–LSTM architectures, and transformer-based models [8,9,10,11,12]. However, these advances often assume access to fully labeled datasets and may be dependent on subject-overlapping splits that yield optimistic estimates of real-world generalization [13]. In practice, annotating apnea events is costly and labor-intensive, making limited and imbalanced labels the norm rather than the exception [14,15].
Data augmentation offers a practical means of compensating for limited labels by diversifying waveform amplitude, timing, and noise properties [14,16]. Nonetheless, its quantitative contribution under progressively reduced apnea–label availability has not been systematically analyzed, and whether augmentation can meaningfully compensate for low supervision in clinically realistic evaluation settings remains unclear.
To address this gap, we established a controlled baseline that examines how conventional ECG augmentations influence apnea detection under varying label–retention ratios. Using the PhysioNet Apnea–ECG dataset, seven scarcity levels were simulated ( r = 5 , 10 , 20 , 40 , 60 , 80 , 100 % ), where only a fraction of real apnea labels was retained at r < 100 % and missing apnea segments were replenished through augmentation.
A compact CNN–BiLSTM classifier was evaluated using five-fold subject-dependent (SD) and subject-independent (SI) protocols. This design enables a reproducible assessment of how scarcity and augmentation interact under clinically realistic settings, without aiming for state-of-the-art performance. The resulting analysis provides a practical baseline for future research on data-efficient ECG-based apnea screening.
The primary contributions of this study are threefold: (i) we provide a controlled evaluation of how apnea–label scarcity affects ECG-based screening performance under both subject-dependent (SD) and subject-independent (SI) protocols; (ii) we introduce an augmentation framework based on simple, morphology-preserving time-domain perturbations and assess its ability to compensate for reduced apnea supervision; and (iii) we demonstrate that conventional ECG augmentation substantially mitigates performance degradation in low- to mid-label regimes, establishing a practical baseline for data-efficient single-lead apnea screening in clinically realistic settings.

2. Related Studies

2.1. ECG-Based Sleep Apnea Detection

ECG-based detection of sleep apnea has been investigated for more than two decades, with early studies focusing on heart rate variability (HRV) and respiration-modulated features [17,18]. Recent studies have applied CNNs [5,8], compact convolutional backbones [19], and transformer-based encoders [11] to improve the classification performance and generalization of the models. Hybrid architectures combining convolutional feature extractors with recurrent or attention modules [10,20] achieve high F1 scores in subject-dependent settings; however, they tend to degrade when evaluated under subject-independent protocols [15].

2.2. Data Augmentation for ECG Signals

Data augmentation effectively improves model robustness and mitigates class imbalances across various biosignal domains. In ECG classification, the commonly applied operations include amplitude scaling, temporal stretching, random jitter, and controlled noise injection [14,16]. Most studies have focused on enhancing generalizability or reducing overfitting rather than addressing label-scarcity scenarios. In the detection of sleep apnea, the apnea class typically constitutes only 30–40% of the available data, and real-world datasets may contain fewer than 10% of annotated apnea events [15,21]. Therefore, assessing the performance gain that is achievable through augmentation under reduced label availability is crucial for the practical deployment of ECG-based apnea screening.

2.3. Evaluation Protocols and Reproducibility

The evaluation methodology critically affects the reported performance. Segment-wise or record-wise splitting can lead to data leakage, where a subject’s data appear in both the training and test sets, and thereby inflate the results [2]. Recent benchmarks emphasize subject-independent validation to ensure a fair comparison [13]. Metrics such as the F1-score and area under the precision–recall curve (AUPRC) are more informative, rather than accuracy, in class-imbalanced settings [16]. This study followed these guidelines to ensure reproducibility and clinical relevance.

2.4. Label-Efficient and Semi-Supervised ECG Learning

Recent studies that have explored label-efficient ECG learning using contrastive or self-supervised objectives have facilitated the learning of robust representations from large unlabeled datasets [22,23]. When applied to arrhythmia classification and cardiorespiratory monitoring, these approaches have shown substantial improvements in downstream performance when labeled data are scarce [24,25]. Recently, contrastive ECG encoders have been optimized for wearable single-lead signals, enabling robust representations without multi-lead redundancy [26]. Furthermore, semi-supervised frameworks have been investigated for OSA detection by combining pseudo-labeling and unsupervised feature learning with a relatively small number of annotated ECG recordings [15,21]. Although label-efficient and semi-supervised methods provide powerful alternatives to fully supervised training, their performance typically depends on large-scale pretraining, multi-view inputs, or additional sensing modalities. Conversely, the present study focuses on a purely supervised setting and isolates the contribution of conventional label-preserving augmentation under controlled label-scarcity levels.

2.5. Synthetic ECG and Advanced Augmentation

In addition to simple time-domain perturbations, several studies have explored the generation of synthetic ECG using generative adversarial networks, transformer-based generators, or simulator-driven pipelines to address class imbalances and expand data diversity [27,28,29]. In addition, recent studies have explored diffusion-based generative ECG models, which can synthesize rhythm-preserving waveforms with improved distributional coverage [30,31]. In OSA detection specifically, generative augmentation has been used to amplify respiration-linked variability in ECG signals [32]. These synthetic approaches can increase the effective sample size for rare events and improve robustness; however, they require careful tuning to avoid distributional drift, mode collapse, or overreliance on artificial samples [33]. Herein, we deliberately restricted the augmentation to interpretable morphology-preserving transformations that were applied to real apnea segments. By examining their performance across multiple label-scarcity levels within this conventional augmentation regime, our framework provides a simple and reproducible baseline that can be systematically compared against more complex generative or self-supervised approaches.

3. Materials and Methods

Figure 1 shows the overall workflow of this study from raw ECG acquisition to model evaluation. Starting from the PhysioNet Apnea–ECG database, the signals were subjected to preprocessing and five-fold splitting. By using data augmentation, we then simulated different levels of apnea-label retention and compensated for the resulting label scarcity. The resulting datasets were used to train the CNN–BiLSTM classifier, and its performance was assessed on held-out test splits under each fold. The following subsections describe each of these components.

3.1. Dataset and Preprocessing

Experiments were conducted using the publicly available PhysioNet Apnea–ECG database [3,4], which provides single-lead ECG recordings that are annotated at 1-min resolution for sleep-apneic events. All signals were sampled at 100 Hz and segmented into non-overlapping 60-s windows according to the reference annotations. Because consecutive windows originated from the same overnight recording, temporal dependence between adjacent segments is expected; the resulting samples should therefore be interpreted as clustered observations within subjects rather than fully independent events. After discarding the invalid, flat, or truncated segments, 34,238 valid samples were retained (21,182 normal; 13,056 apnea) that corresponded to an overall normal-to-apnea ratio of approximately 1:0.6.
Signals were filtered using a 0.5–40 Hz fourth-order zero-phase Butterworth band-pass filter to suppress baseline drift and high-frequency artifacts. Each segment was normalized via z-score scaling and clipped to ± 4 σ to mitigate extreme outliers while preserving the morphological fidelity. Segments containing NaN or those with extremely low variance were removed. The processed segments were stored as float32 arrays of shape ( N , 1 , 6000 ) for efficient mini-batch loading.

3.2. Evaluation Protocols

To assess performance under both subject-independent and subject-dependent scenarios, we constructed two complementary five-fold cross-validation schemes: Subject-Independent (SI) and Subject-Dependent (SD). All the splits were ascertained deterministically using a fixed random seed.
  • Subject-Independent (SI) Splitting. In the SI protocol, subjects were partitioned such that no individual appeared in more than one split, ensuring that all segments from a given subject were assigned exclusively to a single set. Because the Apnea–ECG dataset contains four recording prefixes (a, b, c, and x), the SI allocation was stratified by prefix prior to shuffling to preserve the dataset’s natural distribution. All 70 subjects (20 “a”, 5 “b”, 10 “c”, and 35 “x”) were divided into five folds using this prefix-stratified strategy, so that each fold received a comparable mixture of subjects from all prefixes. For each fold, the held-out subjects formed the test set, and the remaining subjects were split into training and validation subsets using an 80–20 prefix-stratified split. The validation and test subjects for each fold are presented in Table 1, and the corresponding segment-level class statistics are summarized in Table 2. This protocol evaluates the performance of entirely unseen participants.
  • Subject-Dependent (SD) Splitting. In the SD protocol, every subject contributes segments to all folds; however, each subject’s own segments remain strictly partitioned across training, validation, and testing for any given fold. For each subject, apnea and normal segments were first separated and stratified, after which the subject’s data were divided into five label-balanced partitions. For a global fold k, the subject’s k-th partition was assigned as the test set, the ( k + 1 ) -th partition (in cyclic order) served as the validation set, and the remaining three partitions were designated as the training set. This scheme maintains a per-subject label balance while preventing segment-level leakage. Table 3 summarizes the segment distributions for the SD setting.

3.3. Label-Scarcity Simulation and Augmentation

Label scarcity was applied exclusively to the training portion of each fold. The validation and test sets were fully labeled to ensure consistent evaluation. To emulate realistic annotation constraints which apneic make the labeling of events costly and time-intensive, we retained only a fraction r % of apnea segments in the training set while keeping all normal segments unchanged. This preserved the natural class prior to the fully labeled training set before augmentation. Scarcity levels were r = { 5 , 10 , 20 , 40 , 60 , 80 , 100 } % . For each level, a subject-balanced subset of real apnea segments is sampled according to r to form a reduced labeled apnea pool. The missing apnea samples were then replenished via offline augmentation to restore the total apnea count to the full-label reference size. Therefore, r governs the number of real apnea labels, whereas the size of the effective training set remains comparable across scarcity conditions.
Two complementary augmentation strategies were used. (i) offline synthesis of additional apnea samples to compensate for reduced label availability and (ii) lightweight on-the-fly perturbations applied during mini-batch sampling to improve generalization. Although some atomic operations (e.g., small shifts or noise injection) appear in both, their roles and operating regimes are fundamentally different: offline augmentation expands the apnea sample pool, whereas on-the-fly augmentation provides stochastic regularization for all training samples.
  • Offline augmentation (apnea class only). Offline augmentation was applied exclusively to apnea-class training samples to restore the full apnea count under label scarcity. New samples were generated from a pool of real apnea segments using a set of morphology-preserving transformations. The specific transformations used in offline augmentation are summarized in Table 4. Each transformation is activated independently with a fixed probability (e.g., p = 0.4 0.7 , depending on the perturbation type), to produce a diverse collection of synthesized variants. This probabilistic design avoids deterministic augmentation patterns, increases morphological diversity, and maintains physiological plausibility while preventing overfitting to a narrow set of handcrafted transformations. All synthesized segments were stored on disk and used only within the training split.
  • On-the-fly augmentation (all training samples). A lightweight perturbation pipeline was applied to every training sample (normal and apnea) at mini-batch time. The operations used for on-the-fly augmentation are likewise summarized in Table 4. All operations were implemented stochastically, and each the magnitude of each perturbation was sampled from a predefined distribution, which resulted in a negligible effect when the sampled value was near zero. This yielded natural variability across mini-batches without systematically altering the ECG morphology. In contrast with offline augmentation, which expands the apnea dataset, on-the-fly perturbations act solely for regularization and do not change the total number of samples.

3.4. Classifier Architecture

A compact 1D CNN–BiLSTM architecture with attention pooling was used to capture both local ECG morphology and long-range temporal structure across each 60-s segment. The network receives inputs of shape ( B , 1 , 6000 ) and produces a single logit per segment.
The model begins with a convolutional stem consisting of two convolutional layers that extract low-level morphological features while reducing temporal resolution:
Conv 1 d ( 1 64 , k = 15 , s = 2 ) , BN , SiLU ; Conv 1 d ( 64 128 , k = 11 , s = 2 ) , BN , SiLU .
These layers are followed by two residual depthwise-separable convolution blocks (kernel sizes 9 and 7), each equipped with squeeze-and-excitation gating to adaptively recalibrate channel-wise representations.
Temporal downsampling is then performed using a strided projection layer,
Conv 1 d ( 128 192 , k = 5 , s = 5 ) ,
which produces a sequence of T = 300 time steps with d model = 192 feature channels.
To capture temporal dependencies across the full segment, the architecture employs a single-layer bidirectional LSTM ( H = 192 per direction), yielding contextualized features of shape ( B , T , 384 ) . A learned attention pooling mechanism computes weights α = softmax ( X w ) and aggregates a pooled representation,
z = t = 1 T α t X t ,
which summarizes the most salient temporal dynamics.
Finally, a two-layer MLP head,
Linear ( 384 384 ) , SiLU , Dropout ( 0.3 ) , Linear ( 384 1 ) ,
maps the pooled descriptor to a single logit, with sigmoid activation applied only at inference.
This CNN–BiLSTM design was chosen as a compact yet expressive baseline rather than as a new state-of-the-art architecture. The convolutional stem and depthwise-separable residual blocks are tailored to capture local P–QRS–T morphology and noise characteristics, while the BiLSTM layer summarizes longer-range temporal dependencies over each 60-s segment. This hybrid CNN–RNN pattern mirrors prior ECG-based OSA detectors and other single-lead ECG classifiers that combine convolutional feature extractors with recurrent encoders [5,8,10,13,19]. In contrast, temporal-convolutional networks and transformer-based or convolution–attention hybrids [10,11,12,34] typically require deeper stacks and larger parameter budgets. Under the severe label-scarcity regimes considered in this work (e.g., 5–10% real apnea labels), such higher-capacity models may be more prone to overfitting and less representative of resource-constrained wearable deployment scenarios. We therefore deliberately employed a lightweight CNN–BiLSTM architecture to maintain the focus on the effect of conventional label-preserving augmentation under clinically realistic model complexity, while leaving the exploration of more advanced architectures for future research.

3.5. Experimental Setup

Two experimental conditions were evaluated.
(i)
Full-label baseline (100% labels): the r = 100 setting, using only on-the-fly augmentation.
(ii)
Label-scarcity settings ( r { 5 , 10 , 20 , 40 , 60 , 80 } % ): reduced apnea supervision, where offline augmentation was applied to restore the apnea count to the full-label level.
Each condition was assessed using both SI and SD five-fold cross-validation. For each scarcity level, independent models were trained across all five folds, and the results were summarized as the fold-averaged mean with 95% bootstrap confidence intervals (CI).
At every training epoch, the exponential-moving-average (EMA) model (decay ~0.999) was evaluated on the validation split. EMA was used to obtain smoother and more reliable validation estimates under stochastic mini-batch updates. A decision threshold was selected for each epoch by scanning the values in [0.1, 0.9] and choosing the value that maximized the validation F1-score. A short moving average was applied to stabilize the fluctuations. The test split was evaluated using this validation-optimized threshold, and no additional tuning or post-hoc adjustment was performed on the test data. The primary evaluation metric is the F1-score. In addition, we reported the AUPRC, AUROC, and accuracy.
All models were implemented in PyTorch 2.2 and trained on a workstation equipped with four NVIDIA RTX Titan GPUs (24 GB each). The experiments were conducted on Ubuntu 22.04 with CUDA 11.8 and cuDNN 9.0. Optimization was performed using AdamW with a learning rate of 1 × 10 3 , weight decay 5 × 10 4 , and a warmup–cosine learning-rate schedule (5% warmup). The loss function was BCEWithLogitsLoss, with a positive class weight computed from the imbalance of the training set. Mini-batches were constructed using a weighted random sampler to ensure label-balanced sampling. The gradients were clipped to g 2 1.0 , and the EMA of the model parameters was maintained throughout training to stabilize the validation performance.

4. Results

4.1. Effect of Label Scarcity and Augmentation

Figure 2 shows how the F1-score and AUPRC vary with the apnea-label retention ratio r under SD and SI evaluations, with and without data augmentation. In both settings, the performance degrades as r decreases; however, this degradation is substantially mitigated when offline augmentation is applied to replenish the missing apnea samples.
Without offline augmentation, both SD and SI models exhibited a gradual decline in F1-score as r decreased from 100% to 5%. When augmentation was enabled, the SD results showed that F1-scores at r 10 % were nearly restored to the full-label baseline; even at r = 5 %, a large fraction of the performance loss was recovered.
A similar trend appeared in the SI setting; however, owing to the more challenging subject-independent evaluation, the absolute performance was lower. The full-label SI baseline ( r = 100 % ) achieves an average F1-score of 0.79. At r = 5 % , the F1 decreased sharply to 0.57 without augmentation, but increased to 0.72 with augmentation. For most scarcity levels ( r = 10 –80%), the SI performance with augmentation remained close to the full-label reference and yielded F1-scores in the range of 0.75–0.77.
A corresponding pattern was observed for AUPRC. In the SI evaluation, the full-label baseline achieved an AUPRC of 0.87. At r = 5 % , AUPRC fell to 0.56 without augmentation, but improved to 0.75 with augmentation. For all r 10 % , AUPRC values remained within 0.78–0.86, indicating strong recovery toward the full-label baseline. The results of SD evaluation showed the same qualitative behavior, with a higher absolute AUPRC and minimal degradation for r 10 % .
Table 5 and Table 6 provide a quantitative summary of all the metrics across r. Each metric is reported as a five-fold mean with a 95% confidence interval obtained via non-parametric bootstrap resampling. In the SD setting, all metrics across all r levels yielded identical Wilcoxon p-values ( p = 0.0625 ), indicating a consistent marginal trend favoring augmentation despite not reaching conventional significance ( α = 0.05 ). In the SI setting, similar marginal evidence was observed only for severe scarcity, particularly for F1-score at r = 5 and 10, and for accuracy at r = 5 ( p = 0.0625 ), while other metrics showed positive but nonsignificant changes. These results suggest that augmentation may be most impactful when labeled apnea events are extremely limited, with diminishing but directionally consistent influence as more labels become available.

4.2. Impact of Augmentation on Data Variability

To assess whether offline augmentation preserved the inherent statistical properties of apnea segments, we compared the spectral distributions obtained under label scarcity with those obtained from the fully labeled baseline.
The power spectral density (PSD) was computed using Welch’s method, and for each r the mean PSD and 10–90% percentile bands were derived over all apnea segments in the training fold. Representative comparisons for r = 5 % , 10 % (Figure 3) show that the augmented datasets reproduce the full-label baseline ( r = 100 % ) remarkably well across the physiologically relevant range of 0.5–40 Hz. Even under extreme label scarcity ( r = 5 % ), the spectral similarity remains high, with Pearson correlation exceeding 0.998 and KL-divergence values of the order of 10 3 (Table 7).
Band power analysis further indicated that the deviations in the low-frequency (0.5–4 Hz) and mid-frequency (4–10 Hz) bands remained within approximately ± 3 % across all r. Larger differences appeared in the high-frequency range (10–25 Hz), particularly for r = 5 % and 10 % ; however, this region contributed to only a small fraction of the total spectral energy. Therefore, these deviations are consistent with the introduction of mild noise-like perturbations instead of the distortion of the morphology-bearing components of the ECG waveform.
To verify that offline augmentation preserves physiologically meaningful characteristics of apneic ECG, we evaluated three signal properties capturing different mechanisms: beat morphology, cardiac timing, and respiration-modulated variability. Beat morphology was assessed by extracting individual beats via R-peak detection using a smoothed energy envelope, followed by temporal alignment at the R-peak and z-score normalization within each segment. A representative mean beat waveform was computed separately for the real and augmented datasets, and compared against the real mean with its ± 1 standard deviation band. As shown in Figure 4, the augmented mean waveform remains confined within the real envelope, indicating that beat morphology are preserved without visually detectable distortion.
Cardiac timing was evaluated using QRS duration and R–R intervals (RRI). QRS duration was measured as the contiguous interval around each R-peak where the absolute ECG amplitude exceeded 50% of the peak value, constrained to 30–300 ms to ensure physiologically valid values. RRIs were computed from successive R-peaks, and only intervals within 300–2000 ms were retained. Table 8 shows that mean deviations between the real and augmented data were less than 2 % for QRS duration and less than 1.5 % for RRI, indicating that augmentation does not perturb cardiac timing or beat-to-beat structure.
Respiration-modulated variability was quantified using an ECG-derived respiration (EDR) surrogate computed from R-peak amplitudes. After detecting R-peaks at { r k } , amplitude samples a k = | x ( r k ) | were extracted and linearly interpolated at the original sampling rate to produce an amplitude-tracking envelope reflecting respiratory influence on QRS amplitude. The temporal variance of this signal was used as a scalar measure of respiratory modulation. As summarized in Table 8, the relative deviation of EDR variance remained below 10 % across r = 5 80 % , with most cases below 5 % , suggesting preserved respiration-linked amplitude dynamics.
Collectively, the deviations across all three indicators—less than 2 % for QRS duration, less than 1.5 % for RRI, and generally below 5– 10 % for the EDR variance—represent mild amplitude modulation rather than physiological drift. These results confirm that offline augmentation increases sample diversity while maintaining clinically meaningful characteristics of apneic ECG signals.
Overall, the offline augmentation increased sample diversity while preserving essential physiological characteristics of apneic ECG segments—including morphology, cardiac timing, respiration-modulated variability, and spectral structure—without introducing distributional drift away from the real apnea manifold.

5. Discussion

5.1. Summary of Main Findings

This study systematically evaluated the effect of conventional ECG-based data augmentation on apnea detection under varying levels of label scarcity. Across both SD and SI protocols, the classifier’s F1-score decreased as the apnea label-retention ratio r decreased, consistent with prior reports that OSA-detection models are sensitive to limited supervision and class imbalance [14,15,16]. Nonetheless, offline augmentation substantially mitigated this reduction. In the SD setting, F1-scores for r 10 % were nearly indistinguishable from the full-label baseline, whereas the r = 5 % condition recovered much of its degraded performance.
In the SI evaluation, the full-label baseline achieved an average F1-score of 0.79, which dropped to 0.57 and 0.63 at r = 5 and 10 % , respectively. Offline augmentation increased these values to 0.72 and 0.76, and maintained performance between 0.75 and 0.77 for all r 10 % . Thus, only the extreme scarcity condition ( r = 5 % ) exhibited a noticeable remaining gap, suggesting that subject-independent generalization may still be constrained when supervision becomes exceedingly limited.
To determine whether these changes reflected systematic benefits rather than incidental numerical variation, we applied non-parametric Wilcoxon tests. In the SD protocol, all metrics across all r values produced the same marginal p-value ( p = 0.0625 ), suggesting a consistent directional benefit of augmentation, albeit one that did not cross the conventional significance threshold ( α = 0.05 ). In SI evaluation, similar marginal evidence emerged only under severe label scarcity (F1-score at r = 5 and 10 % , and accuracy at r = 5 % ), indicating that augmentation becomes most impactful when supervision is insufficient for learning pathological variability. This borderline behavior likely reflects the limited number of paired comparisons (five folds), since small-sample Wilcoxon tests have coarse p-value resolution and may understate statistical significance despite consistent directional effects.
Collectively, these findings imply that label-preserving ECG perturbations are most beneficial when the dataset lacks adequate pathological diversity. Their impact diminishes once the model has already encountered sufficient apnea events, consistent with the diminishing returns of augmentation in well-supervised regimes. Thus, augmentation acts not as a universal performance enhancer, but as a pragmatic mechanism for improving data efficiency in low-label clinical settings, without requiring architectural modifications or advanced learning paradigms [22,27,28].

5.2. Implications for Data-Efficient Apnea Screening

These results have practical implications for the development of low-cost apnea-screening systems that use single-lead ECG. First, the finding that r 20 –40% labels are often sufficient to recover most of the full-label performance presents a feasible annotation target for the development of wearable ECG applications, where manual labeling is cost- and time-intensive [35,36]. Second, the trend of diminishing returns at higher r values suggests that augmentation primarily benefits low- and mid-scarcity regimes, which is consistent with the observations across various ECG classification tasks [14]. In high-label regimes (e.g., r 60 % ), augmentation behaves more like a mild regularizer rather than a mechanism for reconstructing missing variability.
Third, the persistent performance gap between SD and SI highlights the importance of subject-independent evaluations because record-wise or segment-wise splits can overestimate real-world performance [2]. Even with identical preprocessing, architecture, and augmentation, the SI results remained relatively low, demonstrating a trend that has been consistently reported in recent OSA- and ECG-generalization studies [13].

5.3. Augmentation Quality and Physiological Preservation

Eusuring that the generated variability remains physiologically plausible is a crucial aspect of data augmentation. In this study, augmentation quality was assessed along complementary axes: frequency-domain preservation of spectral structure and time-domain preservation of beat morphology, cardiac timing, and respiration-modulated variability, providing a consistency check that augmented segments remain close to the manifold of real apneic ECG.
Our PSD analysis revealed that augmented apnea segments closely track the spectral profile of the full-label baseline across 0.5–40 Hz (Figure 3), with Pearson correlations exceeding 0.998 and KL-divergence values on the order of 10 3 for all r (Table 7). Differences in the low- and mid-frequency band powers remained within approximately ± 3 % whereas comparatively large deviations in the high-frequency band were confined to a region with relatively low absolute energy. Therefore, these discrepancies likely reflect benign, noise-like perturbations rather than deformation of the morphology-bearing components of the ECG [33].
Complementary time-domain analyses corroborated this conclusion. Mean beat overlays (Figure 4) showed that the augmented average waveform remains confined within the real mean ± 1 standard deviation envelope, indicating preserved QRS complexes and T-wave morphology even at r = 5 10 % . Quantitative comparisons in Table 8 further revealed that, across r = 5 80 % , the mean QRS duration and R–R interval of the mixed real+augmented data deviated by less than 2 % and 1.5 % , respectively, from the fully real reference, while the variance of the ECG-derived respiration surrogate remained within approximately 5– 10 % . These small discrepancies are consistent with mild amplitude modulation and breath-related variability rather than physiological drift.
This combination of spectral and morphological preservation is consistent with findings from recent ECG augmentation and synthetic-signal studies, which emphasize that successful augmentation should expand data-manifold coverage without inducing distributional drift [25,28,37]. The close agreement observed between augmented and fully real distributions across both the frequency and time domains suggests that the applied transformations respect underlying physiological constraints and do not introduce artifacts that could mislead the classifier.

5.4. Limitations and Future Research

This study had some limitations. First, experiments were conducted on a single dataset using a compact CNN–BiLSTM architecture. Although this setup ensures reproducibility and computational efficiency, recent studies have suggested that transformer-based or hybrid temporal models can improve generalization [10,11,12,34]. Future research studies should therefore investigate whether the observed label-scarcity effects persist across more advanced architectures and larger-scale supervised or semi-supervised frameworks.
Second, only conventional time-domain augmentation was examined. Although these methods are widely used and interpretable [38], more advanced approaches, such as mixup, contrastive pre-training, and adversarial synthesis, may yield additional gains, especially in ultra-low-label regimes ( r < 5 % ) [23,27]. A unified comparison under the same scarcity protocol would clarify whether complex augmentation strategies truly outperform simple label-preserving perturbations.
Third, this study used the Apnea–ECG dataset [3], which was collected using clinical-grade equipment under controlled recording conditions. The resulting signals are relatively clean, with limited motion artifacts, electrode drift, or sensor instability. Such characteristics are not fully representative of wearable or home-sleep screening scenarios, potentially leading to optimistic estimates of model robustness and augmentation effectiveness. The reported gains should therefore be interpreted as being conditional on this idealized recording environment. Moreover, the Apnea–ECG recordings were collected more than two decades ago, and may not fully reflect modern sensor characteristics or demographic variability present in contemporary wearable datasets. Future studies should evaluate scarcity-driven augmentation on noisier wearable datasets or incorporate artifact-aware perturbations that mimic realistic acquisition variability, such as motion-induced baseline wander, posture changes, and impedance-dependent noise.
Fourth, our analysis was performed at the segment level using non-overlapping 60-s ECG windows. Because these windows are extracted consecutively from overnight recordings, temporally adjacent segments from the same subject are likely autocorrelated. Although our cross-validation design avoids any reuse of raw segments across train, validation, and test splits and enforces subject-wise separation in the SI setting, it does not eliminate within-subject temporal dependence. Accordingly, the reported sample size should be interpreted as a collection of clustered segment-level observations rather than fully independent apnea events, and the performance metrics reflect segment-level discrimination, consistent with the statistical implications of clustered physiological measurements [39]. Because the same segmentation procedure, splitting rules, and subject composition are used for every value of r, the temporal clustering structure is preserved across all label-scarcity conditions. As a result, the relative comparison of augmentation efficacy is not confounded by changes in the effective number of independent observations but instead reflects differences in the amount of real apnea supervision. Furthermore, window-based segment-level inference is widely used in ECG-based sleep-apnea detection [8], whereas patient-level metrics such as the apnea–hypopnea index can only be derived by aggregating information across multiple consecutive segments. Future research studies should therefore extend the present framework to models that explicitly leverage temporal context across multiple consecutive windows and to patient-level evaluations based on apnea–hypopnea indices [40].
Fifth, in the SD protocol, the model is exposed to multiple segments from the same individual during both training and testing. Under this setting, augmentation can unintentionally reinforce subject-specific morphology or noise characteristics, potentially enhancing identity-related cues rather than purely pathology-driven discrimination. As a result, augmentation-induced gains in SD should be interpreted as improvements in intra-subject generalization (i.e., adapting to a patient once some of their data are available), rather than as evidence of subject-independent performance. Conversely, the SI protocol more directly reflects deployment to unseen patients, where such identity cues cannot be exploited. Future research may further mitigate identity-driven effects by incorporating subject-level regularization or domain randomization strategies.
Finally, we examined the label scarcity for the apnea class only. Real-world deployments must address the label noise, device variability, nighttime drift, and population heterogeneity [13,33]. Future studies should investigate how augmentation interacts with these additional sources of variability, particularly cross-device and cross-cohort robustness, to support scalable ECG-based OSA screening.

5.5. Overall Significance

In summary, this study provides controlled empirical evidence that conventional ECG augmentation can substantially improve apnea detection performance in low-label regimes, especially under SI evaluation. The results demonstrate that physiologically meaningful augmentation enables strong baselines, even without generative modeling or self-supervised pretraining, and offers a practical foundation for future research on data-efficient ECG-based apnea screening.

6. Conclusions

This work presents a controlled investigation of the influence of conventional label-preserving ECG augmentations on ECG-based apnea classification under varying levels of apnea-label scarcity. Across both SD and SI evaluations, offline augmentation substantially mitigated performance degradation in low- and mid-scarcity regimes, which restored much of the full-label F1-score even when only 10–40% of the real apnea labels was available. These gains were most pronounced under subject-independent evaluation, where the full-label SI baseline achieved an average F1-score of 0.79 whereas offline augmentation increased the F1 from 0.57 to 0.72 at r = 5 % and from 0.63 to 0.76 at r = 10 % .
Spectral analyses demonstrated that the augmented apnea segments closely preserved the physiological frequency characteristics of the real segments, thereby exhibiting a high correlation and low KL divergence across all r. Complementary time-domain analyses showed that augmented segments also maintained beat morphology, cardiac timing, and respiration-modulated variability: deviations in QRS duration and R–R intervals were below approximately 2 % and 1.5 % , respectively, and the variance of the ECG-derived respiration surrogate remained within a few to 10 percent of the fully real reference. The observed discrepancies were localized primarily in high-frequency regions with a low absolute energy, indicating that augmentation introduced benign variability rather than morphology-distorting artifacts.
Collectively, these results show that simple and interpretable time-domain augmentation can provide a strong, reproducible baseline for data-efficient ECG-based apnea screening without relying on complex generative models or self-supervised pretraining. Future extensions may include the evaluation of additional architectures, the integration of more advanced augmentation or contrastive learning strategies, and the assessment of patient-level performances using broader clinical datasets and wearable device cohorts.

Author Contributions

Conceptualization, S.R. and I.c.J.; methodology, S.R. and I.c.J.; software, S.R. and J.K.; validation, S.R., J.K. and I.c.J.; writing—original draft preparation, S.R.; writing—review and editing, S.R., J.K. and I.c.J.; funding acquisition, I.c.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant from the National Research Foundation of Korea (NRF), funded by the Korean government (MSIT) (grant No. NR070859).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in the Apnea–ECG Database at https://doi.org/10.1109/CIC.2000.898505.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AUPRCArea Under the Precision–Recall Curve
AUROCArea Under the Receiver Operating Characteristic Curve
BNBatch Normalization
BiLSTMBidirectional Long Short-Term Memory
CNNConvolutional Neural Network
EMAExponential Moving Average
HRVHeart Rate Variability
KLKullback–Leibler Divergence
MLPMulti-Layer Perceptron
OSAObstructive Sleep Apnea
PSDPower Spectral Density
SDSubject-Dependent
SISubject-Independent

References

  1. Dey, D.; Chaudhuri, S.; Munshi, S. Obstructive Sleep Apnoea Detection Using Convolutional Neural Network Based Deep Learning Framework. Biomed. Eng. Lett. 2018, 8, 95–100. [Google Scholar] [CrossRef] [PubMed]
  2. Ramachandran, A.; Karuppiah, A. A Survey on Recent Advances in Machine Learning Based Sleep Apnea Detection. Healthcare 2021, 9, 914. [Google Scholar] [CrossRef] [PubMed]
  3. Penzel, T.; Moody, G.B.; Mark, R.G.; Goldberger, A.L.; Peter, J.H. The Apnea–ECG Database. In Proceedings of the Computers in Cardiology 2000, Cambridge, MA, USA, 24–27 September 2000; pp. 255–258. [Google Scholar] [CrossRef]
  4. Goldberger, A.L.; Amaral, L.A.N.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 2000, 101, e215–e220. [Google Scholar] [CrossRef] [PubMed]
  5. Sheta, A.; Turabieh, H.; Thaher, T.; Too, J.; Mafarja, M.; Hossain, M.S.; Surani, S.R. Diagnosis of Obstructive Sleep Apnea from ECG Signals Using Machine Learning and Deep Learning Classifiers. Appl. Sci. 2021, 11, 6622. [Google Scholar] [CrossRef]
  6. Liu, M.H.; Chien, S.Y.; Wu, Y.L.; Sun, T.H.; Huang, C.S.; Hsu, K.C.; Hang, L.W. EfficientNet-based Machine Learning Architecture for Sleep Apnea Identification in Clinical Single-Lead ECG Signal Data Sets. BioMed. Eng. Online 2024, 23, 57. [Google Scholar] [CrossRef]
  7. Yamane, T.; Fujii, M.; Morita, M. Clinical-Level Screening of Sleep Apnea Syndrome with Single-Lead ECG Alone Using Machine Learning with Appropriate Time Windows. Sleep Breath. 2025, 29, 156. [Google Scholar] [CrossRef]
  8. Wang, T.; Lu, C.; Shen, G.; Hong, F. Sleep apnea detection from a single-lead ECG signal with automatic feature-extraction through a modified LeNet-5 convolutional neural network. PeerJ 2019, 7, e7731. [Google Scholar] [CrossRef]
  9. Almutairi, H.; Hassan, G.M.; Datta, A. Classification of Obstructive Sleep Apnoea from Single-Lead ECG Signals Using Convolutional Neural and Long Short-Term Memory Networks. Biomed. Signal Process. Control 2021, 69, 102906. [Google Scholar] [CrossRef]
  10. Pham, D.T.; Mouček, R. Efficient Sleep Apnea Detection Using Single-Lead ECG: A CNN–Transformer–LSTM Approach. Comput. Biol. Med. 2025, 196, 110655. [Google Scholar] [CrossRef]
  11. Hu, S.; Cai, W.; Gao, T.; Wang, M. A Hybrid Transformer Model for Obstructive Sleep Apnea Detection Based on Self-Attention Mechanism Using Single-Lead ECG. IEEE Trans. Instrum. Meas. 2022, 71, 2514011. [Google Scholar] [CrossRef]
  12. Nguyen, H.X.; Nguyen, D.V.; Pham, H.H.; Do, C.D. MPCNN: A Novel Matrix Profile Approach for CNN-based Sleep Apnea Classification. arXiv 2023, arXiv:2311.15041. [Google Scholar]
  13. Ahmadzadeh, S.; Luo, J.; Wiffen, R. Review on Biomedical Sensors, Technologies and Algorithms for Diagnosis of Sleep Disordered Breathing: Comprehensive Survey. IEEE Rev. Biomed. Eng. 2022, 15, 4–22. [Google Scholar] [CrossRef] [PubMed]
  14. Rahman, M.M.; Rivolta, M.W.; Badilini, F.; Sassi, R. A Systematic Survey of Data Augmentation of ECG Signals for AI Applications. Sensors 2023, 23, 5237. [Google Scholar] [CrossRef] [PubMed]
  15. Hu, S.; Wang, Y.; Liu, J.; Yang, C.; Wang, A.; Li, K.; Liu, W. Semi-Supervised Learning for Low-Cost Personalized Obstructive Sleep Apnea Detection Using Unsupervised Deep Learning and Single-Lead Electrocardiogram. IEEE J. Biomed. Health Inform. 2023, 27, 5281–5292. [Google Scholar] [CrossRef]
  16. Safdar, M.F.; Nowak, R.M.; Pałka, P. Pre-Processing Techniques and Artificial Intelligence Algorithms for Electrocardiogram (ECG) Signals Analysis: A Comprehensive Review. Comput. Biol. Med. 2024, 170, 107908. [Google Scholar] [CrossRef]
  17. Fatimah, B.; Singh, P.; Singhal, A.; Pachori, R.B. Detection of Apnea Events from ECG Segments Using Fourier Decomposition Method. Biomed. Signal Process. Control 2020, 61, 102005. [Google Scholar] [CrossRef]
  18. Mashrur, F.R.; Islam, M.S.; Saha, D.K.; Islam, S.R.; Moni, M.A. SCNN: Scalogram-Based Convolutional Neural Network to Detect Obstructive Sleep Apnea Using Single-Lead ECG Signals. Comput. Biol. Med. 2021, 134, 104532. [Google Scholar] [CrossRef]
  19. Pan, H.; Yu, Y.; Ye, J.; Zhang, X. MobileNetV2: A Lightweight Classification Model for Home-Based Sleep Apnea Screening. arXiv 2024, arXiv:2412.19967. [Google Scholar]
  20. Mohammadi, Z.; Mohammadi, S. SleepLiteCNN: Energy-Efficient Sleep Apnea Subtype Classification with 1-Second Resolution Using Single-Lead ECG. arXiv 2025, arXiv:2508.02718. [Google Scholar]
  21. Islam, M.A.; Chaki, S.; Yousuf, M.A.; Moni, M.A. Sleep Apnea Detection through HRV and SpO2 Analysis of Wearable Sensors. In Proceedings of the ICCA 2024: 3rd International Conference on Computing Advancements, Dhaka, Bangladesh, 17–18 October 2024. [Google Scholar] [CrossRef]
  22. Mehari, T.; Strodthoff, N. Self-supervised representation learning from 12-lead ECG data. Comput. Biol. Med. 2022, 141, 105114. [Google Scholar] [CrossRef]
  23. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
  24. Eldele, E.; Ragab, M.; Chen, Z.; Wu, M.; Kwoh, C.K.; Li, X.; Guan, C. Time-series representation learning via temporal and contextual contrasting. arXiv 2021, arXiv:2106.14112. [Google Scholar]
  25. Kachuee, M.; Fazeli, S.; Sarrafzadeh, M. ECG Heartbeat Classification: A Deep Transferable Representation. In Proceedings of the 2018 IEEE International Conference on Healthcare Informatics (ICHI), New York, NY, USA, 4–7 June 2018; Volume 32. [Google Scholar]
  26. Zou, Y.; Wang, P.; Du, L.; Chen, X.; Li, Z.; Song, J.; Fang, Z. A Multi-Level Multiple Contrastive Learning Method for Single-Lead Electrocardiogram Atrial Fibrillation Detection. Bioengineering 2025, 12, 44. [Google Scholar] [CrossRef] [PubMed]
  27. Aslam, M.; Naqvi, S.S.; Khan, T.M.; Holmes, G.; Naffa, R. Trainable guided attention based robust leather defect detection. Eng. Appl. Artif. Intell. 2023, 124, 106438. [Google Scholar] [CrossRef]
  28. Zanchi, B.; Monachino, G.; Fiorillo, L.; Conte, G.; Auricchio, A.; Tzovara, A.; Faraci, F.D. Synthetic ECG Signals Generation: A Scoping Review. Comput. Biol. Med. 2025, 184, 109453. [Google Scholar] [CrossRef] [PubMed]
  29. Venugopal, A.; Resende Faria, D. Boosting EEG and ECG Classification with Synthetic Data: A WGAN-GP Approach. Appl. Sci. 2024, 14, 10818. [Google Scholar] [CrossRef]
  30. Alcaraz, J.M.L.; Strodthoff, N. Diffusion-based conditional ECG generation with structured state space models. Comput. Biol. Med. 2023, 163, 107115. [Google Scholar] [CrossRef]
  31. Zama, M.H.; Schwenker, F. ECG Synthesis via Diffusion-Based State Space Augmented Transformer. Sensors 2023, 23, 8328. [Google Scholar] [CrossRef]
  32. Wicaksono, P.; Philip, S.; Alam, I.N.; Isa, S.M. Dealing with Imbalanced Sleep Apnea Data Using Deep Convolutional Generative Adversarial Networks. Traitement du Signal 2022, 39, 1527–1536. [Google Scholar] [CrossRef]
  33. Shajari, S.; Kuruvinashetti, K.; Komeili, A.; Sundararaj, U. The emergence of AI-based wearable sensors for digital health technology: A review. Sensors 2023, 23, 9498. [Google Scholar] [CrossRef]
  34. Cheng, L.; Bai, J.; Liu, A.; Feng, S.; Sun, R. Automated OSAHS detection from ECG using temporal convolutional network. Sci. Rep. 2025, 15, 35915. [Google Scholar] [CrossRef]
  35. Osa-Sanchez, A.; Ramos-Martinez-de Soria, J.; Mendez-Zorrilla, A.; Ruiz, I.O.; Garcia-Zapirain, B. Wearable Sensors and Artificial Intelligence for Sleep Apnea Detection: A Systematic Review. J. Med. Syst. 2025, 49, 66. [Google Scholar] [CrossRef] [PubMed]
  36. Lee, S.; Yun, S. Sleep Apnea Detection Using Wireless Wearable Single-Lead ECG: A One Dimensional Convolutional Neural Network Approach. Digit. Health Res. 2024, 2, e3. [Google Scholar] [CrossRef]
  37. Bagga, M.; Jeon, H.; Issokson, A. ECGNet: A Generative Adversarial Network Approach to the Synthesis of 12-Lead ECG Signals from Single-Lead Inputs. arXiv 2023, arXiv:2310.03753. [Google Scholar]
  38. Iglesias, G.; Talavera, E.; González-Prieto, Á.; Mozo, A.; Gómez-Canaval, S. Data augmentation techniques in time series domain: A survey and taxonomy. Neural Comput. Appl. 2023, 35, 10123–10145. [Google Scholar] [CrossRef]
  39. Eisner, D.A. Pseudoreplication in physiology: More means less. J. Gen. Physiol. 2021, 153, e202012826. [Google Scholar] [CrossRef]
  40. Papini, G.B.; Fonseca, P.; van Gilst, M.M.; van Dijk, J.P.; Pevernagie, D.A.A.; Bergmans, J.W.M.; Vullings, R.; Overeem, S. Estimation of the apnea-hypopnea index in a heterogeneous sleep-disordered population using optimised cardiovascular features. Sci. Rep. 2019, 9, 17448. [Google Scholar] [CrossRef]
Figure 1. Overall pipeline of the proposed label-scarcity evaluation framework.
Figure 1. Overall pipeline of the proposed label-scarcity evaluation framework.
Applsci 15 13231 g001
Figure 2. F1-score and AUPRC as a function of apnea-label retention r with and without data augmentation under subject-dependent (SD) and subject-independent (SI) evaluation. Error bars denote 95% CI for baseline models, and shaded regions denote 95% CI for augmented models. Each point represents statistics aggregated over five cross-validation folds. (a) SI–F1-score vs. r; (b) SD–F1-score vs. r; (c) SI–AUPRC vs. r; (d) SD-AUPRC vs. r.
Figure 2. F1-score and AUPRC as a function of apnea-label retention r with and without data augmentation under subject-dependent (SD) and subject-independent (SI) evaluation. Error bars denote 95% CI for baseline models, and shaded regions denote 95% CI for augmented models. Each point represents statistics aggregated over five cross-validation folds. (a) SI–F1-score vs. r; (b) SD–F1-score vs. r; (c) SI–AUPRC vs. r; (d) SD-AUPRC vs. r.
Applsci 15 13231 g002
Figure 3. Average power spectral density (PSD) of apnea segments under the SI setting, comparing the full-label baseline ( r = 100 % ) with the label-scarce conditions ( r = 5 % , 10 % ) augmented offline. (a) Welch PSD—SI, fold 1, r = 5 ; (b) Welch PSD—fold 1, r = 10 .
Figure 3. Average power spectral density (PSD) of apnea segments under the SI setting, comparing the full-label baseline ( r = 100 % ) with the label-scarce conditions ( r = 5 % , 10 % ) augmented offline. (a) Welch PSD—SI, fold 1, r = 5 ; (b) Welch PSD—fold 1, r = 10 .
Applsci 15 13231 g003
Figure 4. Mean beat comparison under the SI setting (fold 1) for r = 5 % and r = 10 % . The augmented mean waveform (dashed) remains confined within the real mean ± 1 standard deviation envelope (gray band), indicating preserved QRS duration, amplitude profile, and T-wave morphology.
Figure 4. Mean beat comparison under the SI setting (fold 1) for r = 5 % and r = 10 % . The augmented mean waveform (dashed) remains confined within the real mean ± 1 standard deviation envelope (gray band), indicating preserved QRS duration, amplitude profile, and T-wave morphology.
Applsci 15 13231 g004
Table 1. Validation and test subject identities for subject-independent (SI) 5-fold cross-validation. Training subjects are the remaining subjects in each fold.
Table 1. Validation and test subject identities for subject-independent (SI) 5-fold cross-validation. Training subjects are the remaining subjects in each fold.
FoldValidation SubjectsTest Subjects
1a07, a10, a19, b05, c04, c05, x05, x07, x12, x16, x17, x34a14, a17, a18, a20, b01, c01, c06, x01, x03, x08, x20, x25, x27, x29
2a01, a02, a10, b02, c05, c08, x05, x06, x13, x16, x30, x34a06, a08, a11, a16, b04, c03, c07, x07, x18, x19, x24, x28, x32, x35
3a05, a06, a14, b04, c07, c08, x04, x10, x11, x19, x21, x24a02, a09, a15, a19, b03, c05, c09, x06, x12, x15, x16, x22, x26, x33
4a04, a10, a15, b05, c06, c07, x07, x13, x16, x19, x20, x33a01, a05, a07, a12, b02, c08, c10, x02, x04, x05, x14, x21, x23, x31
5a15, a16, a17, b03, c05, c09, x04, x08, x14, x19, x20, x23a03, a04, a10, a13, b05, c02, c04, x09, x10, x11, x13, x17, x30, x34
Table 2. Subject-Independent (SI) 5-fold distribution.
Table 2. Subject-Independent (SI) 5-fold distribution.
TrainingValidationTest
FoldNormalApneaNormalApneaNormalApnea
112,86786264463136738523063
213,09485103896208841922458
312,25894174188151647362123
413,98473923590225036083414
513,01285833376247547941998
Table 3. Subject-Dependent (SD) 5-fold distribution.
Table 3. Subject-Dependent (SD) 5-fold distribution.
TrainingValidationTest
FoldNormalApneaNormalApneaNormalApnea
112,66577974252262242652637
212,69278234238261142522622
312,72378464221259942382611
412,75578704206258742212599
512,71178324265263742062587
Table 4. Comparison of offline and on-the-fly ECG augmentation strategies used in this study.
Table 4. Comparison of offline and on-the-fly ECG augmentation strategies used in this study.
StrategyTarget SamplesTransformations
OfflineApnea onlyTemporal shift ( ± 2 % )
Amplitude scaling ( ± 5 % )
Gaussian noise (0.5–1.5% of s.d.)
Low-frequency drift
Local dropout (0.5–1.0%)
Global time warping ( ± 1 % )
On-the-flyAll trainingTemporal shift ( ± 3 % )
Amplitude jitter [0.9, 1.1]
Gaussian noise (∼2% of s.d.)
Random masking (1–3%)
Table 5. Subject-independent (SI) performance across label-retention ratios r. All values are reported as the fold-averaged mean with 95% CI. Within each cell, the upper value indicates the baseline (without augmentation), and the lower value indicates the performance after offline augmentation.
Table 5. Subject-independent (SI) performance across label-retention ratios r. All values are reported as the fold-averaged mean with 95% CI. Within each cell, the upper value indicates the baseline (without augmentation), and the lower value indicates the performance after offline augmentation.
r (%)F1-ScoreAUPRCAUROCAccuracy
50.565 [0.472–0.649]
0.716 [0.654–0.769]
0.563 [0.396–0.734]
0.747 [0.638–0.838]
0.671 [0.522–0.812]
0.832 [0.780–0.883]
0.571 [0.437–0.701]
0.792 [0.703–0.861]
100.630 [0.580–0.677]
0.764 [0.725–0.798]
0.698 [0.514–0.858]
0.828 [0.783–0.872]
0.782 [0.701–0.853]
0.876 [0.831–0.914]
0.706 [0.600–0.802]
0.825 [0.774–0.868]
200.696 [0.614–0.773]
0.754 [0.712–0.789]
0.800 [0.699–0.882]
0.779 [0.687–0.857]
0.877 [0.835–0.913]
0.855 [0.786–0.919]
0.805 [0.767–0.840]
0.802 [0.735–0.859]
400.742 [0.655–0.818]
0.768 [0.712–0.816]
0.817 [0.739–0.889]
0.823 [0.743–0.887]
0.873 [0.818–0.917]
0.877 [0.811–0.932]
0.819 [0.769–0.865]
0.821 [0.756–0.879]
600.768 [0.731–0.799]
0.773 [0.726–0.812]
0.858 [0.822–0.892]
0.859 [0.797–0.912]
0.896 [0.865–0.924]
0.902 [0.846–0.948]
0.838 [0.801–0.869]
0.825 [0.769–0.874]
800.770 [0.707–0.825]
0.752 [0.673–0.820]
0.844 [0.793–0.886]
0.852 [0.795–0.903]
0.887 [0.824–0.939]
0.887 [0.828–0.936]
0.829 [0.772–0.878]
0.820 [0.758–0.873]
1000.789 [0.766–0.810]0.865 [0.833–0.894]0.906 [0.882–0.926]0.842 [0.817–0.865]
Table 6. Subject-dependent (SD) performance across label-retention ratios r. All values are reported as the fold-averaged mean with 95% CI. Within each cell, the upper value indicates the baseline (without augmentation), and the lower value indicates the performance after offline augmentation.
Table 6. Subject-dependent (SD) performance across label-retention ratios r. All values are reported as the fold-averaged mean with 95% CI. Within each cell, the upper value indicates the baseline (without augmentation), and the lower value indicates the performance after offline augmentation.
r (%)F1-ScoreAUPRCAUROCAccuracy
50.609 [0.568–0.646]
0.878 [0.872–0.884]
0.727 [0.492–0.913]
0.950 [0.946–0.954]
0.796 [0.653–0.919]
0.963 [0.958–0.967]
0.671 [0.602–0.732]
0.910 [0.907–0.914]
100.762 [0.744–0.779]
0.893 [0.889–0.897]
0.908 [0.883–0.928]
0.958 [0.957–0.959]
0.936 [0.917–0.951]
0.970 [0.969–0.971]
0.847 [0.823–0.868]
0.919 [0.918–0.920]
200.851 [0.833–0.868]
0.904 [0.901–0.906]
0.930 [0.921–0.937]
0.964 [0.963–0.966]
0.957 [0.951–0.963]
0.974 [0.973–0.975]
0.893 [0.880–0.905]
0.927 [0.926–0.928]
400.883 [0.879–0.887]
0.913 [0.909–0.916]
0.944 [0.941–0.947]
0.969 [0.967–0.971]
0.966 [0.964–0.969]
0.979 [0.977–0.981]
0.911 [0.908–0.914]
0.933 [0.931–0.936]
600.893 [0.889–0.897]
0.918 [0.914–0.921]
0.952 [0.946–0.957]
0.972 [0.969–0.975]
0.970 [0.967–0.972]
0.981 [0.979–0.983]
0.918 [0.914–0.921]
0.937 [0.934–0.939]
800.897 [0.860–0.927]
0.920 [0.915–0.925]
0.956 [0.936–0.970]
0.972 [0.969–0.974]
0.972 [0.963–0.980]
0.982 [0.980–0.984]
0.921 [0.901–0.937]
0.939 [0.938–0.940]
1000.899 [0.895–0.902]0.956 [0.952–0.959]0.973 [0.970–0.975]0.922 [0.918–0.925]
Table 7. Spectral similarity between the full-real apnea PSD and the mixed real + augmented PSD under the SI setting (fold 1).
Table 7. Spectral similarity between the full-real apnea PSD and the mixed real + augmented PSD under the SI setting (fold 1).
r (%)Pearson rKL (real || mix)KL (mix || real)LF Δ P (%)MF Δ P (%)HF Δ P (%)
50.99814.7190 × 10 3 4.4930 × 10 3 −2.40−3.10−13.46
100.99844.5270 × 10 3 4.2910 × 10 3 −0.01−3.31−12.25
200.99892.9320 × 10 3 2.8160 × 10 3 +0.07−2.99−10.82
400.99891.7900 × 10 3 1.7530 × 10 3 +1.10−2.05−8.20
600.99977.4800 × 10 4 7.3200 × 10 4 −0.19−1.51−5.56
800.99992.3100 × 10 4 2.2800 × 10 4 +0.29−0.48−2.80
Table 8. Comparison of QRS duration, R–R interval, and EDR variance between real and augmented apnea segments under the SI setting (fold 1). Reference values (100% real data): QRS = 86.81 ms, RRI = 679.67 ms, EDR variance = 2.72 × 10 1 .
Table 8. Comparison of QRS duration, R–R interval, and EDR variance between real and augmented apnea segments under the SI setting (fold 1). Reference values (100% real data): QRS = 86.81 ms, RRI = 679.67 ms, EDR variance = 2.72 × 10 1 .
r (%)QRSmix (ms) Δ QRS (%)RRImix (ms) Δ RRI (%)EDRmix Δ EDR (%)
586.890.09689.281.412.48 × 10 1 −8.66
1088.321.74677.51−0.322.51 × 10 1 −7.49
2087.080.32679.980.052.61 × 10 1 −3.88
4087.460.75679.670.002.61 × 10 1 −3.91
6087.140.38680.790.172.67 × 10 1 −1.93
8087.220.47679.43−0.042.70 × 10 1 −0.69
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ryu, S.; Koh, J.; Jeong, I.c. Evaluation of Data Augmentation Under Label Scarcity for ECG-Based Detection of Sleep Apnea. Appl. Sci. 2025, 15, 13231. https://doi.org/10.3390/app152413231

AMA Style

Ryu S, Koh J, Jeong Ic. Evaluation of Data Augmentation Under Label Scarcity for ECG-Based Detection of Sleep Apnea. Applied Sciences. 2025; 15(24):13231. https://doi.org/10.3390/app152413231

Chicago/Turabian Style

Ryu, Semin, Jeonghwan Koh, and In cheol Jeong. 2025. "Evaluation of Data Augmentation Under Label Scarcity for ECG-Based Detection of Sleep Apnea" Applied Sciences 15, no. 24: 13231. https://doi.org/10.3390/app152413231

APA Style

Ryu, S., Koh, J., & Jeong, I. c. (2025). Evaluation of Data Augmentation Under Label Scarcity for ECG-Based Detection of Sleep Apnea. Applied Sciences, 15(24), 13231. https://doi.org/10.3390/app152413231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop