End-to-End Learnable Recurrence Plot for Sleep Stage Classification Using Non-Contact Ballistocardiography

Jeong, Jiseong; Yoo, Sunyong

doi:10.3390/electronics15091798

Open AccessArticle

End-to-End Learnable Recurrence Plot for Sleep Stage Classification Using Non-Contact Ballistocardiography

by

Jiseong Jeong

^1,2 and

Sunyong Yoo

^2,3,*

¹

Korea Electronics Technology Institute, 226, Cheomdangwagi-ro, Buk-gu, Gwangju 61011, Republic of Korea

²

Department of Intelligent Electronics and Computer Engineering, Chonnam National University, Gwangju 61186, Republic of Korea

³

R&D Center, MATILO AI Inc., Gwangju 61186, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(9), 1798; https://doi.org/10.3390/electronics15091798

Submission received: 18 March 2026 / Revised: 17 April 2026 / Accepted: 21 April 2026 / Published: 23 April 2026

(This article belongs to the Section Bioelectronics)

Download

Browse Figures

Versions Notes

Abstract

Accurate sleep stage classification is essential for evaluating sleep quality, yet clinical polysomnography is impractical for continuous home-based monitoring. Ballistocardiography (BCG) enables unobtrusive sleep monitoring through sensors embedded in sleep furniture; however, existing BCG-based approaches either rely on complex physiological feature extraction or employ fixed-parameter signal-to-image transformations that cannot adapt to inter-subject variability. This study proposes a learnable recurrence plot (RP) framework for three-stage sleep classification (Wake, NREM, REM) from single-channel BCG signals. The Learnable RP introduces three innovations: multi-scale phase-space reconstruction at physiologically motivated time delays (τ = 5, 10, 20), differentiable per-scale thresholds optimized end-to-end, and attention-based spatial fusion of multi-scale recurrence maps. The framework was evaluated through 10-fold stratified cross-validation across six backbone architectures using 50 overnight recordings. The Learnable RP consistently outperformed four baseline transformation methods (GAF, MTF, Classical RP, Modified RP), achieving an aggregate mean accuracy of 73.60%, with EfficientNet-B5 reaching 78.91%. and 78.91%. Statistical validation across all 24 pairwise comparisons (4 baselines × 6 backbones) confirmed consistent superiority (all p < 0.001). The proposed framework achieves competitive performance without explicit physiological feature engineering, offering a viable path toward end-to-end unobtrusive sleep monitoring.

Keywords:

ballistocardiography; sleep stage classification; recurrence plot; learnable transformation; deep learning; unobtrusive monitoring; time-series imaging

1. Introduction

Approximately 16% of the global adult population, corresponding to over 800 million individuals, are reported to suffer from sleep disorders, underscoring that sleep health has emerged as a critical public health concern at the global scale [1]. Degraded sleep quality exerts wide-ranging effects on physiological functions throughout the body and elevates the risk of premature mortality [2,3,4]. Accurate classification of sleep stages is essential for evaluating sleep quality, because representative metrics such as wake after sleep onset (WASO) and sleep efficiency can only be computed when sleep stages have been correctly identified [5,6]. Structural alterations in sleep architecture have been closely associated with a broad spectrum of clinical conditions, including neurodegenerative diseases, dementia risk, obstructive sleep apnea, type 2 diabetes, cardiovascular disease, and traumatic brain injury [7,8,9,10,11,12,13].

Currently, polysomnography (PSG) performed according to standardized scoring criteria published by the American Academy of Sleep Medicine (AASM, Darien, IL, USA) is recognized as the gold standard for evaluating sleep stages [14]. PSG simultaneously records multiple physiological signals, including electroencephalography (EEG), electromyography (EMG), electrooculography (EOG), respiratory effort, and peripheral oxygen saturation (SpO₂), enabling precise analysis of sleep structure [15]. However, PSG is typically performed for only a single overnight session in a hospital laboratory setting, making it difficult to capture the long-term patterns of habitual sleep that reflect changes in sleep habits, environmental conditions, and circadian rhythm. Moreover, the discomfort caused by the attachment of numerous sensors, the artificiality of the testing environment, high costs, and the need for expert scoring all limit the feasibility of repeated PSG use in the home setting [16,17]. For these reasons, home-based continuous monitoring technologies capable of observing natural sleep over extended periods during daily life are urgently needed.

To address these limitations, sleep stage estimation using wearable devices in the form of rings, watches, and headbands has been actively investigated [18,19,20,21,22]. Yet these devices rely on contact-based sensors such as photoplethysmography (PPG), EEG, and skin thermometers, which introduce discomfort during prolonged wear. Such constraints have driven ongoing research into unobtrusive sleep monitoring approaches based on ballistocardiography (BCG), ultra-wideband (UWB) radar, cameras, and similar technologies [23,24,25,26,27]. BCG signals, in particular, can be unobtrusively acquired by embedding piezoelectric sensors, air pressure sensors, or polyvinylidene fluoride (PVDF) films into sleep furniture such as mattresses and pillows [23,28,29,30]. Because no cumbersome sensors are attached to the body, subjects can be monitored continuously over long periods in their natural sleeping environment.

Sustained efforts have been devoted to realizing automatic sleep stage classification from BCG signals. One classical strategy is to directly feed the raw one-dimensional (1D) BCG waveform into a learning model. Ahmed et al. employed a 1D convolutional neural network (CNN) for binary sleep/wake classification from BCG signals and reported favorable performance on this relatively simple two-class task [31]. Yang et al. trained a CNN–LSTM ensemble on 1D time-series data to perform sleep/wake discrimination [23]. A clear limitation of these studies, however, is that they are confined to binary classification and therefore fail to capture the full sleep stage structure. Additionally, learning directly from raw time-series input without dedicated feature extraction tends to be inefficient.

Sleep stages are known to exhibit strong correlations with autonomic nervous system–derived indices such as heart rate variability (HRV), R–R interval variability, and breath-to-breath interval variability [32,33]. Extracting features through these indicators can therefore serve as an important strategy for improving the accuracy of sleep stage classification models. Several studies have computed secondary physiological indices, including HRV, respiratory rate variability (RRV), and respiratory patterns, from BCG signals to enhance classification performance [28,29,30,34]. These methods, however, require complex signal processing pipelines for heart rate and respiration rate extraction. Furthermore, the conversion process may discard the rich fine-grained vibration information originally embedded in the raw BCG signal.

As an alternative that minimizes information loss without elaborate preprocessing, techniques that transform 1D time series into two-dimensional (2D) images, such as Gramian Angular Fields (GAF), Markov Transition Fields (MTF), and Recurrence Plots (RP), have gained widespread adoption [35,36,37,38,39]. These approaches tend to yield higher accuracy than direct 1D BCG classification while being more efficient than methods based on derived physiological features. Padovani et al. applied GAF, MTF, and RP transformations to smartwatch data and demonstrated improved 2D CNN-based sleep stage classification accuracy [39]. Beyond replacing complex physiological-feature-based preprocessing while maintaining strong classification performance, the 2D image input format is readily compatible with a wide range of CNN and Vision Transformer (ViT) backbone architectures, allowing flexible adoption of state-of-the-art models. Nonetheless, existing 2D transformation techniques operate with fixed parameters throughout the conversion process, which limits their ability to accommodate the subject-specific signal characteristics of BCG, characteristics that vary considerably with changes in sleeping posture, body weight, and body composition.

To address this gap, this study proposes an end-to-end Learnable Recurrence Plot (Learnable RP) framework for three-stage sleep classification (Wake, NREM (Non-Rapid Eye Movement), REM (Rapid Eye Movement)) from single-channel BCG signals. Unlike existing fixed-parameter transformations, the proposed method co-optimizes the signal-to-image transformation and the downstream classifier, enabling the representation to adapt to inter-subject BCG variability without any handcrafted feature engineering. The main contributions of this work are as follows:

(1): We propose a differentiable recurrence plot module with per-scale learnable thresholds, trained end-to-end with a standard image classification backbone, eliminating the need for manual threshold selection that has limited conventional RP-based approaches.
(2): We introduce a multi-scale phase-space reconstruction strategy using physiologically motivated time delays (τ = 5, 10, 20), combined with a convolutional attention mechanism that spatially fuses complementary recurrence maps from each scale, capturing BCG dynamics at multiple temporal resolutions simultaneously.
(3): We conduct a systematic evaluation involving 50 overnight BCG recordings, 10-fold stratified cross-validation, and six backbone architectures, demonstrating consistent and statistically significant superiority over four baseline transformation methods (GAF, MTF, Classical RP, Modified RP) across all 24 pairwise comparisons (all p < 0.001, Cohen’s d = 1.82–14.64).

2. Materials and Methods

2.1. Dataset

BCG signals were collected from 50 healthy adults (age range: 20–80 years), each recorded for one overnight session, yielding a total of 50 recording nights. Signals were acquired at 100 Hz using piezoelectric sensors embedded in a mattress-type sleep monitoring device. The hardware acquisition circuit incorporates an analogue bandpass filter (0.1–30 Hz), which attenuates low-frequency baseline drift and high-frequency noise outside the physiologically relevant BCG band prior to digitisation. The device housed five BCG sensor channels (bcg_1 through bcg_5), one PPG channel, and one force-sensing resistor (FSR) channel; only the primary BCG channel (bcg_1) was used for sleep stage classification in this study.

Reference sleep stage labels were obtained using the Dreem Headband (Rythm, Paris, France), a commercially available EEG-based wearable device that has been validated against expert-scored PSG [22]. The Dreem Headband estimates sleep stages from frontal EEG signals in accordance with AASM guidelines. Each recording was segmented into 30 s epochs following AASM conventions. The original five-class labels (Wake, N1, N2, N3, REM) were consolidated into three classes: Wake (class 0), NREM (class 1, merging N1, N2, and N3), and REM (class 2).

The study protocol was approved by the Institutional Review Board under three separate approvals: P01-202508-01-023 (24 participants), P01-202506-01-005 (13 participants), and P01-202308-01-041 (13 participants). All participants provided written informed consent prior to data collection.

2.2. Overview of the Proposed Framework

The proposed framework processes raw single-channel BCG epochs through five successive stages, as illustrated in Figure 1. First, each epoch is independently z-score normalized. Second, the normalized signal is embedded into phase space at three time delays (τ ∈ {5, 10, 20}) in parallel. Third, pairwise Euclidean distances are computed for each embedding and min–max normalized to [0, 1]. Fourth, each distance matrix is converted into a soft recurrence map via a learnable threshold ε_k = σ(α_k) × 0.5, and scaled by a learned importance weight w_k = σ(β_k). Fifth, the three weighted maps are fused through a convolutional spatial attention block and a fusion head into a single 224 × 224 image, which is subsequently classified into three sleep stages (Wake, NREM, REM) by a pretrained CNN backbone. The detailed formulation of each stage is provided in Section 2.4.2.

What distinguishes the Learnable RP from conventional transformations is that all internal parameters—threshold logits {α_k}, scale weights {β_k}, and convolutional weights in the attention-fusion block—are jointly optimized end-to-end with the backbone via backpropagation. This design is particularly relevant for BCG, where inter-subject amplitude differences, postural artifacts, and sensor-placement variability render any single fixed parameterization suboptimal.

2.3. Signal Preprocessing

BCG signals were sampled at 100 Hz. Each recording was segmented into non-overlapping 30 s epochs, yielding a signal vector x ∈ ℝ^L with L = 3000 samples per epoch. No additional software-level bandpass filtering was applied, as the acquisition hardware already incorporates an analogue bandpass filter (0.1–30 Hz) that suppresses low-frequency baseline drift and high-frequency noise prior to digitisation. Each epoch was independently z-score normalized to zero mean and unit variance prior to transformation, standardising inter-subject amplitude scaling differences arising from differences in body mass, sleeping posture, and sensor-mattress coupling without discarding any spectral content. For the Learnable RP, z-score normalization is performed within the transformation module itself, ensuring that the normalization step is part of the differentiable computational graph.

2.4. 1D-to-2D Transformation Methods

Five transformation strategies were compared. Four of them, namely GAF, MTF, Classical RP, and Modified RP, use fixed, predetermined parameters and serve as baselines. The fifth, the proposed Learnable RP, incorporates trainable parameters that are co-optimized with the classifier.

2.4.1. Baseline Methods: GAF, MTF, Classical RP, and Modified RP

Gramian Angular Field (GAF). GAF encodes temporal correlations by mapping normalized signal values to the angular domain and computing a cosine-based pairwise matrix [35,36]. The diagonal captures instantaneous values, while off-diagonal entries encode pairwise temporal dependencies.

Markov Transition Field (MTF). MTF takes a probabilistic approach: the signal is discretized into quantile bins, and a first-order Markov transition matrix is estimated [35,37]. Each element of the resulting image corresponds to the transition probability between the quantile bins of the respective time-point pair. Unlike GAF, MTF preserves the dynamic transition structure of the signal.

Classical Recurrence Plot (Classical RP). Recurrence plots were first introduced by Eckmann et al. as a graphical tool for visualizing recurrent states in dynamical systems [40]. Given a time series, the phase-space trajectory is reconstructed via Takens’ time-delay embedding with embedding dimension m and delay τ, producing a set of embedded state vectors. The Classical RP then applies a binary (Heaviside) threshold ε to the pairwise Euclidean distances between these embedded vectors: points whose distance falls below ε are marked as recurrent, while all others are marked as non-recurrent [38]. In this study, m = 3 and τ = 10 were used. Following Takens’ embedding theorem (m ≥ 2D₂ + 1) [41], the BCG cardiac ejection cycle is well approximated by a limit cycle attractor (D₂ ≈ 1), yielding a minimum sufficient embedding dimension of m = 3. The choice of τ = 10 is motivated by the temporal structure of the BCG waveform: Kim et al. (2016) reported the I–K interval—encompassing the complete primary ejection complex from the onset of the ascending aortic pressure gradient (I wave) to the peak decay of the descending aortic pressure gradient (K wave)—at approximately 158–163 ms [42]. Given m = 3, τ = 10 (100 ms) yields a state-vector span of (m − 1) × τ = 200 ms, which is sufficient to embed this complete ejection complex within a single phase-space point. While this binary representation provides a clean visualization of the system’s recurrence structure, it discards distance magnitude information that could aid downstream classification.

Modified Recurrence Plot (Modified RP). Modified RP retains richer textural detail by replacing the binary Heaviside function with min–max normalization of pairwise distances to [0, 1]. The continuous-valued output preserves gradient information in the distance landscape, which CNNs can exploit more effectively than the binary output of the Classical RP. However, the embedding parameters (m = 3, τ = 10) remain fixed, offering no mechanism for signal-specific adaptation.

2.4.2. Proposed: Learnable Recurrence Plot

The central contribution of this work is a recurrence plot module whose key parameters are learned rather than prescribed. Three design choices set it apart from fixed-parameter alternatives: (a) multi-scale phase-space reconstruction using several time delays in parallel; (b) soft, differentiable thresholding with trainable ε values; and (c) an attention-driven mechanism that fuses the resulting multi-scale maps into a single image.

Multi-scale time-delay embedding. Instead of committing to a single delay value, the signal is embedded at K = 3 different scales: τ ∈ {5, 10, 20}. At a 100 Hz sampling rate, these correspond to delays of 50, 100, and 200 ms, respectively. These delays should be interpreted as complementary embedding offsets that emphasize different local dynamical structures within the BCG waveform, rather than as direct representations of full cardiac or respiratory cycle periods. Specifically, τ = 5 (50 ms) emphasizes short-lag morphological structure within the BCG waveform, including the fine structure of the cardiac ejection complex; τ = 10 (100 ms) captures intermediate temporal dependencies that may reflect short-lag variations in cardiac dynamics [32,33]; and τ = 20 (200 ms) yields a state-vector span of 400 ms, encompassing beat-to-beat interval modulations attributable to respiratory sinus arrhythmia (RSA; 0.15–0.4 Hz) [32,33]—a well-established autonomic modulation of cardiac rhythm that decreases systematically from NREM to REM and is suppressed during wakefulness. The relevant temporal scale for each delay is the state-vector span (m − 1) × τ, not τ alone: at m = 3, the spans are 100 ms, 200 ms, and 400 ms for τ = 5, 10, and 20, respectively. The empirical validation of this τ selection is reported in Section 3.4. For each scale k, the standard time-delay embedding is applied with embedding dimension m = 3:

s_{i}^{(k)} = ({\tilde{x}}_{i}, {\tilde{x}}_{i} + τ_{k}, {\tilde{x}}_{i} + 2 τ_{k}), i = 1, 2, \dots, L - 2 τ_{k}

(1)

By running these three embeddings in parallel, the module constructs three distinct phase-space trajectories from the same epoch, each emphasizing different local dynamical structures without requiring explicit signal decomposition.

Differentiable threshold. A major shortcoming of the Classical RP is the need to choose ε manually. In the proposed method, this fixed threshold is replaced with a learned parameter. For each scale k, the pairwise Euclidean distances are computed and min–max normalized to [0, 1]. A sigmoid-parameterized threshold ε_k = σ(α_k) × 0.5 then converts these distances into soft recurrence values:

R^{(k)} = (i, j) = σ (- γ \cdot ({\tilde{D}}^{(k)} (i, j) - ε_{k}))

(2)

where α_k is a scalar learnable parameter, σ denotes the sigmoid function, and γ = 10 controls transition sharpness. Constraining ε_k to the interval (0, 0.5) keeps thresholds within a meaningful range. During training, gradients flow back through Equation (2) into α_k, allowing each scale’s threshold to converge to whatever value maximizes classification performance.

Scale-wise attention and fusion. Each scale-specific map R^(k) is first multiplied by a learned importance weight w_k = σ(β_k) to form the weighted map

{\tilde{R}}_{k}

= w_k · R_k. The three weighted maps are then stacked into a tensor

\tilde{R}

∈ ℝ^K×N×N and processed by a lightweight convolutional attention block:

A = Softmax ({Conv}_{1 \times 1} (ReLU (BN ({Conv}_{3 \times 3} (\tilde{R})))))

(3)

where BN denotes batch normalization, the softmax is applied along the scale dimension k at each spatial location (i, j), yielding A ∈ ℝ^K×N×N. The attention map A reweights each scale at every spatial position, and the attended tensor

\tilde{R}

⊙ A (element-wise multiplication) is then collapsed into a single-channel output through a fusion head consisting of a 3 × 3 convolution followed by batch normalization and ReLU activation, and a final 1 × 1 convolution with sigmoid activation, i.e., R_final = Sigmoid(Conv_1×1(ReLU(BN(Conv_3×3(

\tilde{R}

⊙ A))))). Both the attention and fusion blocks employ 16 intermediate feature channels, keeping the module lightweight to avoid overshadowing the backbone’s role in feature extraction.

End-to-end joint optimization. Every parameter in the Learnable RP module—threshold logits {α_k}, scale weights {β_k}, and all convolutional weights in the attention-fusion block—is updated through standard backpropagation alongside the backbone. A higher learning rate (6 × 10⁻⁴) is assigned to the transformation parameters than to the pretrained backbone (3 × 10⁻⁴), so that the module adapts quickly while the backbone fine-tunes gradually. It should be noted that ‘end-to-end’ here refers specifically to the joint optimization of the RP module parameters, scale weights, attention-fusion convolutional weights, and backbone classifier via a single backpropagation pass; z-score normalization and the fixed time-delay embedding geometry are standard preprocessing steps and are not subject to gradient-based optimization.

The fixed τ geometry does constrain the phase-space reconstruction prior to any learned transformation. However, inter-subject adaptability is provided by the learned components operating on the resulting distance matrices: the per-scale thresholds ε_k adapt the recurrence boundary for each subject’s BCG amplitude distribution, and the spatial attention weights β_k selectively emphasise temporal scales that are most discriminative for a given subject’s cardiac morphology. This is directly evidenced by the 5.02–7.24 pp accuracy gain over Modified RP, which employs an identical fixed phase-space geometry but uses non-adaptive parameters, confirming that learned thresholds and attention weights provide inter-subject adaptation that fixed parameterisation cannot achieve. Extension to a continuous, learnable τ space is identified as a direction for future work (Section 4.5).

2.5. Backbone Architectures

Transformed images (224 × 224, single channel) were fed to six pretrained image classification backbones spanning different design paradigms: EfficientNet-B5 and EfficientNet-B0 (compound scaling), ResNet-50 (skip connections), DenseNet-121 (dense connectivity), MobileNetV3-Large (depthwise-separable convolutions), Swin-Tiny (shifted-window self-attention), and ConvNeXt-Tiny (modernized convolutional design). All models were sourced from the timm library and initialized with ImageNet-pretrained weights. The first convolutional layer was modified to accept a single input channel, and the final fully connected layer was replaced with a three-class output head.

2.6. Training Configuration

Cross-entropy loss with label smoothing (s = 0.1) was employed. Because sleep datasets are inherently imbalanced, with NREM epochs typically far outnumbering Wake or REM, class weights were automatically computed from inverse class frequencies in the training fold. The AdamW optimizer was used with an initial learning rate of 3 × 10⁻⁴ and weight decay of 1 × 10⁻⁴. The learning rate was halved whenever the monitored metric plateaued for five consecutive epochs (ReduceLROnPlateau). Gradient norms were clipped at 5.0, and mixed-precision training (FP16) was enabled throughout. The batch size was set to 256.

Training was terminated after 15 epochs without improvement. Rather than tracking accuracy alone, a composite early stopping metric was monitored:

C = 0.5 · F1_macro + 0.3 · Accuracy + 0.2 · κ

(4)

with a minimum improvement threshold δ = 0.001. Assigning the largest weight to F1_macro guards against models that achieve high overall accuracy by neglecting minority classes. Cohen’s κ further penalizes agreement attributable to chance.

2.7. Evaluation Procedure

A 10-fold stratified cross-validation scheme was adopted. Within each outer fold, an inner 5-fold stratified split allocated 80% of the training partition for optimization and 20% for validation, thereby preventing data leakage and yielding robust performance estimates. The following metrics are reported: overall accuracy, macro-averaged F1-score, weighted F1-score, Cohen’s κ, and multi-class Receiver Operating Characteristic-Area Under the Curve (ROC-AUC), all expressed as mean ± standard deviation across the 10 folds. For pairwise comparisons between the Learnable RP and each baseline, paired t-tests were conducted on the per-fold accuracy values. Cohen’s d was computed to quantify practical significance, with effect sizes interpreted following standard thresholds: |d| < 0.2 negligible, 0.2–0.5 small, 0.5–0.8 medium, and >0.8 large. Bonferroni correction was applied to control the family-wise error rate across all 24 comparisons (4 baselines × 6 backbones).

3. Results

3.1. Aggregate Performance

Table 1 summarizes classification performance across all transformation–backbone combinations. A total of 350 experiments (5 methods × 7 backbones × 10 folds) were conducted. ConvNeXt-Tiny was excluded from the aggregate analysis because it failed to converge under any transformation method (see Section 4.3), yielding 300 runs (5 methods × 6 backbones × 10 folds) for the results reported below.

The Learnable RP achieved the highest performance across all metrics, attaining 73.60% accuracy, 68.87% macro F1-score, a Cohen’s κ of 0.5202, and an ROC-AUC of 0.8353. It outperformed the runner-up (Modified RP) by 5.35 percentage points (pp) in accuracy and 6.53 pp in macro F1-score. Notably, its standard deviation was the lowest among the five methods (3.19%), indicating that the learned parameters not only improve mean performance but also stabilize it across folds and backbones. Classical RP trailed by nearly 15 pp in accuracy, consistent with the expectation that its binary thresholding discards fine-grained distance information critical for distinguishing sleep stages in BCG signals.

Figure 2 presents the aggregate results across all four evaluation metrics. As shown in Figure 2a, the Learnable RP achieved 73.6% accuracy, exceeding Modified RP (68.2%) by 5.4 pp and Classical RP (58.7%) by 14.9 pp. A consistent pattern is observed for F1-macro (Figure 2b), where the Learnable RP attained 68.9% compared to 62.3% for Modified RP. The Learnable RP also demonstrated the largest Cohen’s κ (0.520 versus 0.428 for Modified RP; Figure 2c) and ROC-AUC (0.835 versus 0.790 for Modified RP; Figure 2d), confirming that the performance gains extend to class-balanced agreement and probabilistic discrimination.

The Learnable RP ranked first for all six backbones, with mean fold-wise accuracy ranging from 69.58% (MobileNetV3-Large) to 78.91% (EfficientNet-B5), followed by DenseNet-121 (74.64%), Swin-Tiny (74.62%), ResNet-50 (72.39%), and EfficientNet-B0 (71.43%). EfficientNet-B5 produced the best overall result, nearly 10 pp above MobileNetV3-Large, reflecting the larger model’s superior capacity for extracting fine-textured patterns from recurrence images. The Swin-Tiny transformer backbone reached 74.62%, suggesting that shifted-window attention can also exploit RP-derived textures effectively, though not as well as compound-scaled CNNs in this setting.

3.2. Statistical Validation

All 24 pairwise comparisons (4 baselines × 6 backbones) reached significance at p < 0.001 after Bonferroni correction (α = 0.05/24 = 0.00208), with Cohen’s d ranging from 1.82 to 14.64. Table 2 presents a representative subset ordered by effect size.

Effect sizes were uniformly large (all d > 1.8), with the most extreme values observed against Classical RP on EfficientNet-B5 (d = 14.64) and MTF on ResNet-50 (d = 14.45). We note that in a 10-fold cross-validation on 50 subjects, fold-to-fold performance differences exhibit low variance by construction, which mechanically inflates Cohen’s d; extreme values such as d = 14.64 should not be interpreted as population-level effect size estimates, but as indicators of directional consistency of fold-wise differences within this dataset. The practically meaningful quantities are the accuracy differences themselves: +17.26 pp over Classical RP and +5.02–7.24 pp over Modified RP across backbone combinations. The uniformly large d values nonetheless confirm that the performance advantage of the Learnable RP is consistent in direction across all folds and backbone combinations.

3.3. Best-Case Analysis: Confusion Matrix and ROC Curves

The single best model across all 300 analyzed runs was the Learnable RP paired with EfficientNet-B5 in Fold 8, achieving 80.24% test accuracy.

The confusion matrix (Figure 3) reveals per-class recall patterns. REM was classified most reliably (85.63%), followed by Wake (78.10%) and NREM (66.81%). The relatively lower NREM recall is primarily attributable to confusion with REM: 28.25% of NREM epochs were misclassified as REM. Wake and REM exhibited moderate mutual confusion (16.26% of Wake epochs misclassified as REM), which has a physiological basis, as both stages feature elevated sympathetic tone relative to NREM, producing overlapping BCG morphologies. Notably, misclassifications between Wake and NREM were comparatively rare (5.64% and 4.94%), suggesting that the model effectively distinguishes the strong physiological contrast between wakefulness and non-REM sleep.

The multi-class ROC curves (Figure 4) confirmed strong discriminative capacity across all three classes, with all per-class AUC values exceeding 0.83. Wake achieved the highest AUC (0.904), indicating that the model assigns high confidence scores to true Wake epochs with minimal false-positive trade-off. REM followed at 0.857, while NREM yielded 0.840, consistent with the confusion matrix observation that NREM is the most challenging class owing to its overlap with REM. The micro-average AUC of 0.854 indicates robust overall probabilistic discrimination well above the chance level of 0.5.

3.4. Ablation Study

To empirically validate the choice of τ ∈ {5, 10, 20} and to quantify the contribution of the learned transformation parameters, two ablation experiments were conducted using the best-performing backbone (EfficientNet-B5, 5-fold stratified cross-validation). Results are summarised in Table 3.

Regarding τ selection, the results reveal a consistent pattern centred on the necessity of τ = 5. The two configurations that exclude τ = 5 produced the largest accuracy degradations: S1 (τ = {10} only, −0.90 pp, p = 0.025, d = 1.562) and S5 (τ = {10, 20, 40}, −1.30 pp, p = 0.068, d = 1.112). S1 reached statistical significance, while S5 showed a trend-level effect (p = 0.068) that did not reach α = 0.05; the limited power of 5-fold CV (n = 5 paired observations) is consistent with d ≈ 1.1 being detectable as a trend rather than a confirmed effect. Both configurations eliminate the early-ejection temporal scale associated with the I–J interval (~68–75 ms; see Section 4.2). In contrast, all configurations retaining τ = 5 − S2 (τ = {5, 10}), S4 (τ = {5, 10, 30}), and S6 (τ = {1, 5, 10})—showed no statistically significant difference from the proposed set (all p > 0.20). Although S6 yielded a marginally higher point estimate (+0.47 pp), this difference was not significant (p = 0.205), and τ = 1 (10 ms) falls below the cardiac refractory period (~200 ms), providing no physiological basis for inclusion. Furthermore, the proposed set demonstrates lower fold-to-fold variance (±0.37 pp) compared to the two-scale alternative S2 (±0.44 pp), indicating that τ = 20 contributes representational stability. Together, these results confirm that τ = 5 is the necessary and sufficient component for the multi-scale advantage, and that the three-scale combination τ = {5, 10, 20} represents the minimum physiologically complete and empirically stable design.

To quantify the independent and synergistic contributions of the three proposed components, a second ablation was conducted using five configurations (EfficientNet-B5, 5-fold stratified CV). Results are summarised in Table 4.

4. Discussion

4.1. Why Learnable Parameters Matter

The results demonstrate that making the RP transformation learnable yields consistent accuracy gains of 5.02–17.26 pp across all 24 backbone combinations (Table 2). Across every backbone tested, the Learnable RP outperformed all fixed-parameter alternatives, with accuracy gains ranging from approximately 5 pp (versus Modified RP) to over 17 pp (versus Classical RP).

The dominant contributor is multi-scale phase-space reconstruction combined with spatial attention fusion (+1.03 pp when comparing C0 and C2): BCG signals encode cardiac and respiratory rhythms at different temporal scales, and a single fixed delay inevitably prioritizes one rhythm over the other. The learnable threshold provides negligible gain in a single-scale setting (−0.06 pp when comparing C0 and C1), but yields a meaningful synergistic contribution within a multi-scale context (+0.76 pp when comparing C2 and C4), accompanied by a reduction in fold-to-fold variance from ±0.85% to ±0.37%. This variance reduction is particularly relevant given the small cohort size (50 subjects), where prediction stability across folds is a practically important property alongside mean accuracy. These results confirm that the three components interact synergistically, and that the 5.02–7.24 pp gap over Modified RP (Table 2) reflects the combined effect of adaptive thresholding and multi-scale attention operating on a richer structural basis.

These performance patterns are consistent with sleep-stage-specific differences in BCG recurrence structure. During NREM sleep, parasympathetic dominance produces regular beat-to-beat intervals, manifesting as long continuous diagonal lines in the recurrence image [41]. During REM sleep, intermittent sympathetic activation disrupts this regularity, yielding fragmented diagonal structures. During wakefulness, gross body movements eliminate quasi-periodicity, producing matrices dominated by isolated recurrent points. These stage-dependent textural characteristics are consistent with the confusion matrix pattern observed: the elevated NREM-to-REM confusion rate (28.25%) reflects genuine autonomic overlap between N2 and light REM sleep, whereas Wake-to-NREM and NREM-to-Wake misclassifications were comparatively rare (5.64% and 4.94%, respectively), in accordance with the pronounced physiological contrast between wakefulness and non-REM sleep.

4.2. Physiological Grounding of the Multi-Scale Design

The choice of τ ∈ {5, 10, 20} was motivated by the temporal structure of the BCG ejection complex. Kim et al. (2016) reported the I–J interval (peak ascending aortic pressure gradient, early ejection phase) at approximately 68–75 ms, the J–K interval (descending aortic pressure gradient relaxation) at approximately 88–91 ms, and the full I–K interval (complete primary ejection complex) at approximately 158–163 ms [42]. Given m = 3, τ = 5 (50 ms) yields a state-vector span of 100 ms, sensitive to the early ejection phase corresponding to the I–J interval; τ = 10 (100 ms) yields a span of 200 ms, covering the complete I–K ejection complex; and τ = 20 (200 ms) yields a span of 400 ms, capturing post-ejection oscillations and beat-to-beat interval modulations attributable to respiratory sinus arrhythmia (RSA; 0.15–0.4 Hz) [32,33]. RSA magnitude decreases systematically from NREM to REM and is suppressed during wakefulness, providing a physiologically specific basis for the 400 ms temporal scale. Each delay thus provides a structurally distinct phase-space projection of the cardiac cycle, without claiming to directly represent full cardiac or respiratory cycle periods.

Previous BCG-based sleep staging studies have relied on explicit extraction of HRV and RRV features to capture these dynamics [28,29,30,34]. Rather than requiring explicit signal decomposition or peak detection, the Learnable RP encodes temporal structure at timescales consistent with cardiac and respiratory dynamics through gradient-based optimization. Direct validation through feature-level correspondence analysis with extracted HRV or RRV features remains a direction for future work.

4.3. The Role of Backbone Architecture

Although the Learnable RP was the top-performing transformation method for all backbones, absolute accuracy ranged from 69.58% (MobileNetV3-Large) to 78.91% (EfficientNet-B5), a span of nearly 10 pp that underscores backbone selection as a consequential design decision. The compound scaling of EfficientNet-B5 yields approximately 30 million parameters, providing greater capacity for building hierarchical features from textured recurrence images. Lighter architectures such as MobileNetV3, while appealing for edge deployment, appear to lack sufficient depth to fully exploit the fine spatial patterns that differentiate sleep stages.

ConvNeXt-Tiny failed to converge under any transformation method and was consequently excluded. To investigate this failure, supplementary trials were conducted using learning rates from 1 × 10⁻³ to 1 × 10⁻⁵ (with and without linear warmup and reduced weight decay of 1 × 10⁻⁵) for all five transformation methods. Convergence failure persisted across all configurations and all transformation methods, indicating that the issue is not specific to the Learnable RP but reflects a general architectural incompatibility. Gradient norm monitoring revealed near-zero gradients in the early depthwise convolutional stages, consistent with the known sensitivity of LayerScale initialization—which scales residual branch outputs by near-zero learnable scalars—to inputs whose pixel statistics deviate substantially from ImageNet distributions. Recurrence plots, bounded to [0, 1] with strong spatial symmetry and diagonal structure, satisfy this condition. This finding is consistent with prior reports that time-series image representations induce architecture-dependent performance [38]. Domain-adaptive initialization strategies for LayerScale-based architectures are noted as a direction for future work.

To quantify the computational overhead of the Learnable RP module, per-sample inference latency was measured on an NVIDIA RTX A6000 (NVIDIA Corporation, Santa Clara, CA, USA) (48 GB GDDR6) (batch size = 1, n = 100 repeated measurements after 20 warmup iterations). The Learnable RP achieved 15.55 ± 0.27 ms, compared to 14.10 ± 0.51 ms for Modified RP, 15.44 ± 2.42 ms for Classical RP, 14.31 ± 0.40 ms for GAF, and 13.95 ± 0.19 ms for MTF. The RP transformation module itself accounts for 1.32 ms (8.5% of total inference time). The total parameter count is identical across all methods (28.35 M), as the Learnable RP module introduces fewer than 600 additional parameters relative to the EfficientNet-B5 backbone. The incremental latency of 1.45 ms over Modified RP is negligible relative to the 30 s epoch granularity of the target application under server-grade GPU conditions. We acknowledge that the measured latency (RTX A6000, 48 GB GDDR6) does not directly characterize embedded deployment scenarios. However, given that the Learnable RP module itself accounts for only 1.32 ms (8.5% of total inference time) and introduces fewer than 600 additional parameters relative to the backbone, the module’s computational footprint is minimal; its contribution to edge-device latency would scale proportionally with backbone inference time. Optimization strategies such as parameter freezing after training, pre-computation of recurrence representations, and knowledge distillation—noted in Section 4.5—remain the primary path toward real-time embedded deployment.

4.4. Comparison with Existing BCG-Based Methods

Direct comparison with published BCG sleep staging studies is not appropriate owing to fundamental differences across multiple confounding dimensions listed below. The following figures are provided for contextual orientation only and should not be interpreted as evidence of competitive equivalence. Wang et al. (2025) employed a 5-class scheme on 50 subjects with PSG reference labels and 167 manually engineered HRV/RRV features [28]; Li et al. (2024) achieved 5-class staging on over 9600 subjects with PSG reference labels via cross-modality transfer learning [30]; and Mitsukura et al. (2020) used a 5-class scheme on 22 subjects with PSG reference labels and HRV-based logistic regression [29]. The present study, by contrast, uses a 3-class scheme on 50 subjects with Dreem surrogate labels and no feature engineering. A 3-class problem is structurally easier than a 5-class problem because the merged NREM class subsumes N1/N2/N3 distinctions; any numerical comparison between 3-class and 5-class accuracies is therefore uninformative.

A notable practical characteristic of the proposed framework is its conceptual simplicity and end-to-end trainability. The pipeline requires no peak detection, no beat-to-beat interval computation, and no handcrafted feature selection: the raw BCG epoch serves as the sole input and a sleep-stage label as the output. For deployment scenarios such as home sleep monitors and consumer wellness devices, where signal-processing expertise may be limited, this streamlined architecture may reduce the barrier to deployment—though its practical utility requires validation in real-world home environments.

4.5. Limitations and Future Work

Several limitations should be acknowledged. First, the present study should be interpreted as a proof-of-concept demonstration rather than a large-scale validation. The cohort of 50 subjects is consistent with prior BCG sleep staging studies—Mitsukura et al. (2020) used 22 subjects [29] and Wang et al. (2025) used 50 subjects [28]—but is insufficient to establish generalizability across diverse populations, sensor configurations, and clinical settings. The fold-to-fold standard deviations in Table 1 (e.g., ±3.19% for the Learnable RP) reflect this variability, and the cross-validated mean accuracy of 78.91% (EfficientNet-B5) is the representative performance estimate; the best-case single-fold result (80.24%, Fold 8) should not be interpreted as typical performance. Multi-site replication with larger cohorts remains the primary direction for future work.

Second, and most critically, reference sleep-stage labels were obtained using the Dreem Headband rather than manually scored polysomnography (PSG). This constitutes a primary limitation of the present work. Although the Dreem system has been validated against PSG in a 5-class scoring scheme—achieving an overall accuracy of 83.5 ± 6.4% and F1 score of 83.8 ± 6.3%, compared to 86.4 ± 8.0% and 86.3 ± 7.4% for the average of five expert scorers [22]; note that the 3-class consolidation used in the present study (merging N1, N2, N3 into NREM) would be expected to yield higher Dreem–PSG agreement than the 5-class figures above, as the most ambiguous inter-class boundaries (N1/N2/N3) are subsumed—it does not replicate gold-standard PSG annotations, and its inter-rater agreement with expert PSG is lowest at N1/N2-to-REM transitional epochs. Mechanistically, because the per-scale thresholds ε_k and spatial attention weights are optimized against training labels, systematic mislabeling of these transitional epochs directly biases the learned parameters toward recurrence structures associated with ambiguous stage boundaries rather than physiologically distinct states. This is the most plausible mechanistic explanation for the elevated NREM-to-REM confusion rate of 28.25% observed in the confusion matrix. All reported accuracy figures reflect agreement with Dreem-derived labels and should not be interpreted as PSG-level clinical staging performance. Furthermore, because the input modality is BCG, whereas conventional sleep-stage definitions are grounded in EEG/EOG/EMG signals, prospective validation against manually scored PSG is necessary before clinical relevance can be established.

Third, the current three-class scheme merges all non-REM stages; extending the framework to four- or five-class classification (e.g., separating N1/N2 from N3) would enhance clinical relevance.

Fourth, the Learnable RP generates recurrence representations during both training and inference, introducing additional computational overhead. Therefore, further optimization strategies, such as parameter freezing, pre-computation of recurrence representations, or knowledge distillation, should be explored to facilitate real-time embedded deployment.

Fifth, the embedding time delays τ ∈ {5, 10, 20} are fixed across all subjects, constraining the phase-space geometry prior to any learnable transformation. Although the learned per-scale thresholds, scale weights, and spatial attention weights provide inter-subject adaptation within this fixed geometry—inter-subject differences in BCG morphology arising from body habitus, posture, and sensor-mattress coupling may not be fully captured by threshold and weight adaptation alone. Future work includes extending the time-delay representation to a continuous, learnable τ space, integrating temporal sequence models (e.g., LSTM or Transformer) to capture inter-epoch dependencies, leveraging self-supervised learning on large-scale unlabeled BCG data, and conducting prospective validation in real-world home environments.

5. Conclusions

This study presented a learnable recurrence plot (Learnable RP) framework for three-stage sleep classification (Wake, NREM, REM) from single-channel BCG signals. Its three core innovations, namely differentiable per-scale thresholds, multi-scale phase-space reconstruction at physiologically motivated delays (τ = 5, 10, 20), and attention-based spatial fusion, work in concert to produce image representations that adapt to the characteristics of the input signal during end-to-end training.

The framework was evaluated through 10-fold stratified cross-validation across six backbone architectures using 50 overnight BCG recordings. The Learnable RP consistently outperformed four baseline transformation methods (GAF, MTF, Classical RP, Modified RP), attaining a mean accuracy of 73.60%, a Cohen’s κ of 0.520, and an ROC-AUC of 0.835, with the highest backbone-level mean accuracy reaching 78.91% on EfficientNet-B5. All 24 pairwise statistical comparisons were significant at p < 0.001, accompanied by uniformly large effect sizes (Cohen’s d = 1.82–14.64). The best single-fold model achieved 80.24% accuracy with per-class ROC-AUC values of 0.904 (Wake), 0.857 (REM), and 0.840 (NREM). These results demonstrate that replacing fixed transformation parameters with learned ones yields statistically significant and consistent accuracy gains of 5.02–17.26 pp across all 24 pairwise comparisons without any explicit physiological feature engineering, offering a viable path toward simpler and fully end-to-end unobtrusive sleep monitoring systems.

Author Contributions

Conceptualization, J.J. and S.Y.; methodology, J.J.; software, J.J.; validation, J.J. and S.Y.; formal analysis, J.J.; investigation, J.J.; resources, S.Y.; data curation, J.J.; writing—original draft preparation, J.J.; writing—review and editing, S.Y.; visualization, J.J.; supervision, S.Y.; project administration, S.Y.; funding acquisition, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Technology Innovation Program (RS-2024-00507555, Demonstration of development of domestic SoC-based on-device AI hyper-personalized home appliance system) funded by the Ministry of Trade, Industry & Energy (MOTIE, Republic of Korea).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Public Institutional Bioethics Committee (PIBC), Republic of Korea (protocol codes P01-202508-01-023, approved on 7 August 2025; P01-202506-01-005, approved on 23 December 2025; and P01-202308-01-041, approved on 17 September 2025).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are not publicly available due to privacy and ethical restrictions related to human participant data.

Conflicts of Interest

Author Sunyong Yoo was employed by the company MATILO AI Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AASM	American Academy of Sleep Medicine
AUC	Area Under the Curve
BCG	Ballistocardiography
BN	Batch Normalization
CNN	Convolutional Neural Network
EEG	Electroencephalography
EMG	Electromyography
EOG	Electrooculography
FSR	Force-Sensing Resistor
GAF	Gramian Angular Field
HRV	Heart Rate Variability
MTF	Markov Transition Field
NREM	Non-Rapid Eye Movement
PPG	Photoplethysmography
PSG	Polysomnography
PVDF	Polyvinylidene Fluoride
REM	Rapid Eye Movement
ROC	Receiver Operating Characteristic
RP	Recurrence Plot
RRV	Respiratory Rate Variability
UWB	Ultra-Wideband
ViT	Vision Transformer
WASO	Wake After Sleep Onset

References

Benjafield, A.V.; Kuniyoshi, F.H.S.; Malhotra, A.; Martin, J.L.; Morin, C.M.; Maurer, L.F.; Cistulli, P.A.; Pépin, J.-L.; Wickwire, E.M. Estimation of the Global Prevalence and Burden of Insomnia: A Systematic Literature Review-Based Analysis. Sleep Med. Rev. 2025, 82, 102121. [Google Scholar] [CrossRef]
Windred, D.P.; Burns, A.C.; Lane, J.M.; Saxena, R.; Rutter, M.K.; Cain, S.W.; Phillips, A.J.K. Sleep Regularity Is a Stronger Predictor of Mortality Risk than Sleep Duration: A Prospective Cohort Study. Sleep 2024, 47, zsad253. [Google Scholar] [CrossRef]
Zhang, W.; Sun, Q.; Chen, B.; Basta, M.; Xu, C.; Li, Y. Insomnia Symptoms Are Associated with Metabolic Syndrome in Patients with Severe Psychiatric Disorders. Sleep Med. 2021, 83, 168–174. [Google Scholar] [CrossRef] [PubMed]
Ali, E.; Shaikh, A.; Yasmin, F.; Sughra, F.; Sheikh, A.; Owais, R.; Raheel, H.; Virk, H.U.H.; Mustapha, J.A. Incidence of Adverse Cardiovascular Events in Patients with Insomnia: A Systematic Review and Meta-Analysis of Real-World Data. PLoS ONE 2023, 18, e0291859. [Google Scholar] [CrossRef]
Shrivastava, D.; Jung, S.; Saadat, M.; Sirohi, R.; Crewson, K. How to Interpret the Results of a Sleep Study. J. Community Hosp. Intern. Med. Perspect. 2014, 4, 24983. [Google Scholar] [CrossRef] [PubMed]
Reed, D.L.; Sacco, W.P. Measuring Sleep Efficiency: What Should the Denominator Be? J. Clin. Sleep Med. 2016, 12, 263–266. [Google Scholar] [CrossRef]
Zhang, F.; Zhong, R.; Li, S.; Fu, Z.; Wang, R.; Wang, T.; Huang, Z.; Le, W. Alteration in Sleep Architecture and Electroencephalogram as an Early Sign of Alzheimer’s Disease Preceding the Disease Pathology and Cognitive Decline. Alzheimer’s Dement. 2019, 15, 590–597. [Google Scholar] [CrossRef] [PubMed]
Wong, R.; Lovier, M.A. Sleep Disturbances and Dementia Risk in Older Adults: Findings from 10 Years of National U.S. Prospective Data. Am. J. Prev. Med. 2023, 64, 781–787. [Google Scholar] [CrossRef]
Bianchi, M.T.; Cash, S.S.; Mietus, J.; Peng, C.-K.; Thomas, R. Obstructive Sleep Apnea Alters Sleep Stage Transition Dynamics. PLoS ONE 2010, 5, e11356. [Google Scholar] [CrossRef]
Pallayova, M.; Donic, V.; Gresova, S.; Peregrim, I.; Tomori, Z. Do Differences in Sleep Architecture Exist between Persons with Type 2 Diabetes and Nondiabetic Controls? J. Diabetes Sci. Technol. 2010, 4, 344–352. [Google Scholar] [CrossRef]
Liu, H.; Zhu, H.; Lu, Q.; Ye, W.; Huang, T.; Li, Y.; Li, B.; Wu, Y.; Wang, P.; Chen, T. Sleep Features and the Risk of Type 2 Diabetes Mellitus: A Systematic Review and Meta-Analysis. Ann. Med. 2025, 57, 2447422. [Google Scholar] [CrossRef]
Nambiema, A.; Lisan, Q.; Vaucher, J.; Perier, M.-C.; Boutouyrie, P.; Danchin, N.; Thomas, F.; Guibout, C.; Solelhac, G.; Heinzer, R. Healthy Sleep Score Changes and Incident Cardiovascular Disease in European Prospective Community-Based Cohorts. Eur. Heart J. 2023, 44, 4968–4978. [Google Scholar] [CrossRef]
Mantua, J.; Grillakis, A.; Mahfouz, S.H.; Taylor, M.R.; Brager, A.J.; Yarnell, A.M.; Balkin, T.J.; Capaldi, V.F.; Simonelli, G. A Systematic Review and Meta-Analysis of Sleep Architecture and Chronic Traumatic Brain Injury. Sleep Med. Rev. 2018, 41, 61–77. [Google Scholar] [CrossRef] [PubMed]
American Academy of Sleep Medicine. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications; Version 3.0; American Academy of Sleep Medicine: Darien, IL, USA, 2023. [Google Scholar]
Rundo, J.V.; Downey, R., III. Polysomnography. In Handbook of Clinical Neurology; Elsevier: Amsterdam, The Netherlands, 2019; Volume 160, pp. 381–392. [Google Scholar]
Markun, L.C.; Sampat, A. Clinician-Focused Overview and Developments in Polysomnography. Curr. Sleep Med. Rep. 2020, 6, 309–321. [Google Scholar] [CrossRef] [PubMed]
Oeverland, B.; Akre, H.; Kvaerner, K.J.; Skatvedt, O. Patient Discomfort in Polysomnography with Esophageal Pressure Measurements. Eur. Arch. Oto-Rhino-Laryngol. 2005, 262, 241–245. [Google Scholar] [CrossRef] [PubMed]
Lee, T.; Cho, Y.; Cha, K.S.; Jung, J.; Cho, J.; Kim, H.; Kim, D.; Hong, J.; Lee, D.; Keum, M. Accuracy of 11 Wearable, Nearable, and Airable Consumer Sleep Trackers: Prospective Multicenter Validation Study. JMIR mHealth uHealth 2023, 11, e50983. [Google Scholar] [CrossRef]
Strumpf, Z.; Gu, W.; Tsai, C.-W.; Chen, P.-L.; Yeh, E.; Leung, L.; Cheung, C.; Wu, I.-C.; Strohl, K.P.; Tsai, T. Belun Ring (Belun Sleep System BLS-100): Deep Learning-Facilitated Wearable Enables Obstructive Sleep Apnea Detection, Apnea Severity Categorization, and Sleep Stage Classification in Patients Suspected of Obstructive Sleep Apnea. Sleep Health 2023, 9, 430–440. [Google Scholar] [CrossRef]
Altini, M.; Kinnunen, H. The Promise of Sleep: A Multi-Sensor Approach for Accurate Sleep Stage Detection Using the Oura Ring. Sensors 2021, 21, 4302. [Google Scholar] [CrossRef]
Song, T.-A.; Zhang, Y.; Zhou, Z.; Hou, L.; Malekzadeh, M.; Behzad, A.; Dutta, J. AI-Driven Sleep Staging Using Instantaneous Heart Rate and Accelerometry: Insights from an Apple Watch Study. IEEE Trans. Biomed. Eng. 2025; online ahead of print.
Arnal, P.J.; Thorey, V.; Debellemaniere, E.; Ballard, M.E.; Bou Hernandez, A.; Guillot, A.; Jourde, H.; Harris, M.; Guillard, M.; Van Beers, P. The Dreem Headband Compared to Polysomnography for Electroencephalographic Signal Acquisition and Sleep Staging. Sleep 2020, 43, zsaa097. [Google Scholar] [CrossRef]
Yang, C.; Ku, G.; Jung, J.; Choi, J.; Kim, K. A Study of BCG Signal-Based Sleep Classification Technology through Ensemble Running Signal Processing and Piezoelectric Sensor Surface Material Change. J. Electr. Eng. Technol. 2023, 18, 3881–3886. [Google Scholar] [CrossRef]
Kwon, H.B.; Choi, S.H.; Lee, D.; Son, D.; Yoon, H.; Lee, M.H.; Lee, Y.J.; Park, K.S. Attention-Based LSTM for Non-Contact Sleep Stage Classification Using IR-UWB Radar. IEEE J. Biomed. Health Inform. 2021, 25, 3844–3853. [Google Scholar] [CrossRef]
Park, J.; Yang, S.; Chung, G.; Zanghettin, I.J.L.; Han, J. Ultra-Wideband Radar-Based Sleep Stage Classification in Smartphone Using an End-to-End Deep Learning. IEEE Access 2024, 12, 61252–61264. [Google Scholar] [CrossRef]
Wang, Q.; Cheng, H.; Wang, W. Video-PSG: An Intelligent Contactless Monitoring System for Sleep Staging. IEEE Trans. Biomed. Eng. 2025; online ahead of print.
van Meulen, F.B.; Grassi, A.; van den Heuvel, L.; Overeem, S.; van Gilst, M.M.; van Dijk, J.P.; Maass, H.; van Gastel, M.J.H.; Fonseca, P. Contactless Camera-Based Sleep Staging: The HealthBed Study. Bioengineering 2023, 10, 109. [Google Scholar] [CrossRef]
Wang, K.; Zhu, B.; Liu, B.; Liang, J.; Lv, T.; Wu, J. Automatic Sleep Staging Based on Single-Channel Ballistocardiogram Signals and Multiple Scales Temporal Feature Analysis. Eng. Appl. Artif. Intell. 2025, 156, 111249. [Google Scholar] [CrossRef]
Mitsukura, Y.; Sumali, B.; Nagura, M.; Fukunaga, K.; Yasui, M. Sleep Stage Estimation from Bed Leg Ballistocardiogram Sensors. Sensors 2020, 20, 5688. [Google Scholar] [CrossRef]
Li, S.; Chen, Y.; Chen, X.; Gao, R.; Zhang, Y.; Yu, C.; Li, Y.; Ye, Z.; Huang, W.; Yi, H. SleepNetZero: Zero-Burden Zero-Shot Reliable Sleep Staging with Neural Networks Based on Ballistocardiograms. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies; Association for Computing Machinery: New York, NY, USA, 2024; Volume 8, pp. 1–25. [Google Scholar]
Ahmed, N.; Srivyshnav, K.S.; Chokalingam, K.; Rawooth, M.; Kumar, G.; Parchani, G.; Saran, V. Classification of Sleep-Wake State in Ballistocardiogram System Based on Deep Learning. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, UK, 11–15 July 2022; pp. 1944–1947. [Google Scholar]
Zschocke, J.; Bartsch, R.P.; Glos, M.; Penzel, T.; Mikolajczyk, R.; Kantelhardt, J.W. Long- and Short-Term Fluctuations Compared for Several Organ Systems across Sleep Stages. Front. Netw. Physiol. 2022, 2, 937130. [Google Scholar] [CrossRef]
Ma, Y.J.X.; Zschocke, J.; Glos, M.; Kluge, M.; Penzel, T.; Kantelhardt, J.W.; Bartsch, R.P. Automatic Sleep-Stage Classification of Heart Rate and Actigraphy Data Using Deep and Transfer Learning Approaches. Comput. Biol. Med. 2023, 163, 107193. [Google Scholar] [CrossRef] [PubMed]
Wu, L.; Ren, P.; Zhao, Y.; Lv, R.; Ding, Q.; Zuo, Y. Sleep Stage Classification Based on BCG Using Improved Deep Convolutional Generative Adversarial Networks. In Proceedings of the 2023 7th International Conference on Imaging, Signal Processing and Communications (ICISPC), Chongqing, China, 28–30 July 2023; pp. 65–69. [Google Scholar]
Wang, Z.; Oates, T. Imaging Time-Series to Improve Classification and Imputation. arXiv 2015, arXiv:1506.00327. [Google Scholar] [CrossRef]
Altunkaya, D.; Okay, F.Y.; Ozdemir, S. Image Transformation for IoT Time-Series Data: A Review. arXiv 2023, arXiv:2311.12742. [Google Scholar]
Quan, S.; Sun, M.; Zeng, X.; Wang, X.; Zhu, Z. Time Series Classification Based on Multi-Dimensional Feature Fusion. IEEE Access 2023, 11, 11066–11077. [Google Scholar] [CrossRef]
Hatami, N.; Gavet, Y.; Debayle, J. Classification of Time-Series Images Using Deep Convolutional Neural Networks. In Proceedings of the Tenth International Conference on Machine Vision (ICMV 2017), Vienna, Austria, 13–15 November 2017; SPIE: Bellingham, WA, USA, 2018; Volume 10696, pp. 242–249. [Google Scholar]
Padovani Ederli, R.; Vega-Oliveros, D.A.; Soriano-Vargas, A.; Rocha, A.; Dias, Z. Time-Series Visual Representations for Sleep Stages Classification. PLoS ONE 2025, 20, e0323689. [Google Scholar] [CrossRef] [PubMed]
Eckmann, J.-P.; Oliffson Kamphorst, S.; Ruelle, D. Recurrence Plots of Dynamical Systems. Europhys. Lett. 1987, 4, 973–977. [Google Scholar] [CrossRef]
Marwan, N.; Romano, M.C.; Thiel, M.; Kurths, J. Recurrence Plots for the Analysis of Complex Systems. Phys. Rep. 2007, 438, 237–329. [Google Scholar] [CrossRef]
Kim, C.-S.; Ober, S.L.; McMurtry, M.S.; Finegan, B.A.; Inan, O.T.; Mukkamala, R.; Hahn, J.-O. Ballistocardiogram: Mechanism and Potential for Unobtrusive Cardiovascular Health Monitoring. Sci. Rep. 2016, 6, 31297. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the proposed Learnable Recurrence Plot (Learnable RP) framework for BCG-based sleep stage classification. Multi-scale time-delay embedding follows Equation (1); the learnable soft threshold is defined in Equation (2); attention-based scale fusion is computed as in Equation (3).

Figure 2. Aggregate performance comparison of five transformation methods across six backbone architectures (ConvNeXt-Tiny excluded). (a) Classification accuracy; (b) F1-macro score; (c) Cohen’s κ; (d) ROC-AUC. Error bars indicate ± 1 SD. The red bar indicates the proposed Learnable RP; gray bars indicate baseline methods. Bold numbers above bars indicate the mean value for each method.

Figure 3. Confusion matrix of the best-performing model (Learnable RP + EfficientNet-B5, Fold 8, accuracy = 80.24%). Values represent row-normalized proportions (per-class recall).

Figure 4. Multi-class ROC curves for the best-performing model (Learnable RP + EfficientNet-B5, Fold 8). Per-class AUC: Wake = 0.904, REM = 0.857, NREM = 0.840; micro-average AUC = 0.854.

Table 1. Aggregate performance of five transformation methods (mean ± SD, N = 60 per method). ConvNeXt-Tiny excluded.

Method	Accuracy (%)	F1-Macro (%)	F1-Weighted (%)	Cohen’s κ	ROC-AUC
Learnable RP	73.60 ± 3.19	68.87 ± 3.66	73.54 ± 3.09	0.5202	0.8353
Modified RP	68.25 ± 3.32	62.34 ± 3.57	68.26 ± 3.06	0.4278	0.7902
GAF	67.52 ± 3.48	60.76 ± 3.86	67.26 ± 3.24	0.4048	0.7722
MTF	61.40 ± 3.74	53.15 ± 3.70	61.09 ± 3.23	0.2984	0.7129
Classical RP	58.69 ± 2.95	49.28 ± 1.97	58.09 ± 2.07	0.2380	0.6733

Table 2. Selected results from paired t-tests (10-fold CV). All 24 comparisons significant at p < 0.001 after Bonferroni correction.

Baseline	Backbone	ΔAccuracy (%)	p-Value	Cohen’s d
Classical RP	EfficientNet-B5	+17.26	5.13 × 10⁻¹²	14.64
MTF	ResNet-50	+10.95	5.74 × 10⁻¹²	14.45
MTF	EfficientNet-B5	+11.80	1.04 × 10⁻¹⁰	10.46
Classical RP	DenseNet-121	+13.98	1.06 × 10⁻¹⁰	10.44
MTF	DenseNet-121	+11.12	1.22 × 10⁻¹⁰	10.27
Classical RP	ResNet-50	+13.11	1.05 × 10⁻⁹	8.07
MTF	Swin-Tiny	+13.83	3.01 × 10⁻⁹	7.16
Modified RP	ResNet-50	+5.04	1.45 × 10⁻⁸	6.00
GAF	DenseNet-121	+5.34	2.37 × 10⁻⁷	5.17
Modified RP	EfficientNet-B5	+5.02	2.79 × 10⁻⁷	5.07

Table 3. Ablation study results for time-delay selection (EfficientNet-B5, 5-fold stratified CV). All comparisons are against the proposed configuration S3; ΔAcc denotes the accuracy difference. Statistical significance was assessed using paired t-tests on per-fold accuracy values.

Variant	Acc (%)	ΔAcc (pp)	p-Value	Cohen’s d
S1: τ = {10}	78.44 ± 0.55	−0.90	0.025	1.562
S2: τ = {5, 10}	78.92 ± 0.44	−0.42	0.290	0.544
S3: τ = {5, 10, 20}	79.34 ± 0.37	—	—	—
S4: τ = {5, 10, 30}	79.13 ± 0.84	−0.21	0.596	0.257
S5: τ = {10, 20, 40}	78.04 ± 0.88	−1.30	0.068	1.112
S6: τ = {1, 5, 10}	79.81 ± 0.33	+0.47	0.205	−0.677

Table 4. Component ablation results (EfficientNet-B5, 5-fold stratified CV).

Variant	Acc (%)	F1-Macro (%)	Cohen’s κ	ROC-AUC
C0: Fixed-ε/Single-scale/No-attn	77.55 ± 0.53	73.51	0.5867	0.8441
C1: Learnable-ε/Single-scale/No-attn	77.49 ± 0.70	73.45	0.5866	0.8449
C2: Fixed-ε/Multi-scale/Attention	78.58 ± 0.85	74.54	0.6031	0.8493
C3: Learnable-ε/Multi-scale/No-attn	78.81 ± 0.84	75.13	0.6140	0.8595
C4 (Proposed) *	79.34 ± 0.37	75.55	0.6197	0.8545

* Proposed configuration (Learnable-ε/Multi-scale/Attention).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jeong, J.; Yoo, S. End-to-End Learnable Recurrence Plot for Sleep Stage Classification Using Non-Contact Ballistocardiography. Electronics 2026, 15, 1798. https://doi.org/10.3390/electronics15091798

AMA Style

Jeong J, Yoo S. End-to-End Learnable Recurrence Plot for Sleep Stage Classification Using Non-Contact Ballistocardiography. Electronics. 2026; 15(9):1798. https://doi.org/10.3390/electronics15091798

Chicago/Turabian Style

Jeong, Jiseong, and Sunyong Yoo. 2026. "End-to-End Learnable Recurrence Plot for Sleep Stage Classification Using Non-Contact Ballistocardiography" Electronics 15, no. 9: 1798. https://doi.org/10.3390/electronics15091798

APA Style

Jeong, J., & Yoo, S. (2026). End-to-End Learnable Recurrence Plot for Sleep Stage Classification Using Non-Contact Ballistocardiography. Electronics, 15(9), 1798. https://doi.org/10.3390/electronics15091798

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

End-to-End Learnable Recurrence Plot for Sleep Stage Classification Using Non-Contact Ballistocardiography

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Overview of the Proposed Framework

2.3. Signal Preprocessing

2.4. 1D-to-2D Transformation Methods

2.4.1. Baseline Methods: GAF, MTF, Classical RP, and Modified RP

2.4.2. Proposed: Learnable Recurrence Plot

2.5. Backbone Architectures

2.6. Training Configuration

2.7. Evaluation Procedure

3. Results

3.1. Aggregate Performance

3.2. Statistical Validation

3.3. Best-Case Analysis: Confusion Matrix and ROC Curves

3.4. Ablation Study

4. Discussion

4.1. Why Learnable Parameters Matter

4.2. Physiological Grounding of the Multi-Scale Design

4.3. The Role of Backbone Architecture

4.4. Comparison with Existing BCG-Based Methods

4.5. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI