1. Introduction
Approximately 16% of the global adult population, corresponding to over 800 million individuals, are reported to suffer from sleep disorders, underscoring that sleep health has emerged as a critical public health concern at the global scale [
1]. Degraded sleep quality exerts wide-ranging effects on physiological functions throughout the body and elevates the risk of premature mortality [
2,
3,
4]. Accurate classification of sleep stages is essential for evaluating sleep quality, because representative metrics such as wake after sleep onset (WASO) and sleep efficiency can only be computed when sleep stages have been correctly identified [
5,
6]. Structural alterations in sleep architecture have been closely associated with a broad spectrum of clinical conditions, including neurodegenerative diseases, dementia risk, obstructive sleep apnea, type 2 diabetes, cardiovascular disease, and traumatic brain injury [
7,
8,
9,
10,
11,
12,
13].
Currently, polysomnography (PSG) performed according to standardized scoring criteria published by the American Academy of Sleep Medicine (AASM, Darien, IL, USA) is recognized as the gold standard for evaluating sleep stages [
14]. PSG simultaneously records multiple physiological signals, including electroencephalography (EEG), electromyography (EMG), electrooculography (EOG), respiratory effort, and peripheral oxygen saturation (SpO
2), enabling precise analysis of sleep structure [
15]. However, PSG is typically performed for only a single overnight session in a hospital laboratory setting, making it difficult to capture the long-term patterns of habitual sleep that reflect changes in sleep habits, environmental conditions, and circadian rhythm. Moreover, the discomfort caused by the attachment of numerous sensors, the artificiality of the testing environment, high costs, and the need for expert scoring all limit the feasibility of repeated PSG use in the home setting [
16,
17]. For these reasons, home-based continuous monitoring technologies capable of observing natural sleep over extended periods during daily life are urgently needed.
To address these limitations, sleep stage estimation using wearable devices in the form of rings, watches, and headbands has been actively investigated [
18,
19,
20,
21,
22]. Yet these devices rely on contact-based sensors such as photoplethysmography (PPG), EEG, and skin thermometers, which introduce discomfort during prolonged wear. Such constraints have driven ongoing research into unobtrusive sleep monitoring approaches based on ballistocardiography (BCG), ultra-wideband (UWB) radar, cameras, and similar technologies [
23,
24,
25,
26,
27]. BCG signals, in particular, can be unobtrusively acquired by embedding piezoelectric sensors, air pressure sensors, or polyvinylidene fluoride (PVDF) films into sleep furniture such as mattresses and pillows [
23,
28,
29,
30]. Because no cumbersome sensors are attached to the body, subjects can be monitored continuously over long periods in their natural sleeping environment.
Sustained efforts have been devoted to realizing automatic sleep stage classification from BCG signals. One classical strategy is to directly feed the raw one-dimensional (1D) BCG waveform into a learning model. Ahmed et al. employed a 1D convolutional neural network (CNN) for binary sleep/wake classification from BCG signals and reported favorable performance on this relatively simple two-class task [
31]. Yang et al. trained a CNN–LSTM ensemble on 1D time-series data to perform sleep/wake discrimination [
23]. A clear limitation of these studies, however, is that they are confined to binary classification and therefore fail to capture the full sleep stage structure. Additionally, learning directly from raw time-series input without dedicated feature extraction tends to be inefficient.
Sleep stages are known to exhibit strong correlations with autonomic nervous system–derived indices such as heart rate variability (HRV), R–R interval variability, and breath-to-breath interval variability [
32,
33]. Extracting features through these indicators can therefore serve as an important strategy for improving the accuracy of sleep stage classification models. Several studies have computed secondary physiological indices, including HRV, respiratory rate variability (RRV), and respiratory patterns, from BCG signals to enhance classification performance [
28,
29,
30,
34]. These methods, however, require complex signal processing pipelines for heart rate and respiration rate extraction. Furthermore, the conversion process may discard the rich fine-grained vibration information originally embedded in the raw BCG signal.
As an alternative that minimizes information loss without elaborate preprocessing, techniques that transform 1D time series into two-dimensional (2D) images, such as Gramian Angular Fields (GAF), Markov Transition Fields (MTF), and Recurrence Plots (RP), have gained widespread adoption [
35,
36,
37,
38,
39]. These approaches tend to yield higher accuracy than direct 1D BCG classification while being more efficient than methods based on derived physiological features. Padovani et al. applied GAF, MTF, and RP transformations to smartwatch data and demonstrated improved 2D CNN-based sleep stage classification accuracy [
39]. Beyond replacing complex physiological-feature-based preprocessing while maintaining strong classification performance, the 2D image input format is readily compatible with a wide range of CNN and Vision Transformer (ViT) backbone architectures, allowing flexible adoption of state-of-the-art models. Nonetheless, existing 2D transformation techniques operate with fixed parameters throughout the conversion process, which limits their ability to accommodate the subject-specific signal characteristics of BCG, characteristics that vary considerably with changes in sleeping posture, body weight, and body composition.
To address this gap, this study proposes an end-to-end Learnable Recurrence Plot (Learnable RP) framework for three-stage sleep classification (Wake, NREM (Non-Rapid Eye Movement), REM (Rapid Eye Movement)) from single-channel BCG signals. Unlike existing fixed-parameter transformations, the proposed method co-optimizes the signal-to-image transformation and the downstream classifier, enabling the representation to adapt to inter-subject BCG variability without any handcrafted feature engineering. The main contributions of this work are as follows:
- (1)
We propose a differentiable recurrence plot module with per-scale learnable thresholds, trained end-to-end with a standard image classification backbone, eliminating the need for manual threshold selection that has limited conventional RP-based approaches.
- (2)
We introduce a multi-scale phase-space reconstruction strategy using physiologically motivated time delays (τ = 5, 10, 20), combined with a convolutional attention mechanism that spatially fuses complementary recurrence maps from each scale, capturing BCG dynamics at multiple temporal resolutions simultaneously.
- (3)
We conduct a systematic evaluation involving 50 overnight BCG recordings, 10-fold stratified cross-validation, and six backbone architectures, demonstrating consistent and statistically significant superiority over four baseline transformation methods (GAF, MTF, Classical RP, Modified RP) across all 24 pairwise comparisons (all p < 0.001, Cohen’s d = 1.82–14.64).
2. Materials and Methods
2.1. Dataset
BCG signals were collected from 50 healthy adults (age range: 20–80 years), each recorded for one overnight session, yielding a total of 50 recording nights. Signals were acquired at 100 Hz using piezoelectric sensors embedded in a mattress-type sleep monitoring device. The hardware acquisition circuit incorporates an analogue bandpass filter (0.1–30 Hz), which attenuates low-frequency baseline drift and high-frequency noise outside the physiologically relevant BCG band prior to digitisation. The device housed five BCG sensor channels (bcg_1 through bcg_5), one PPG channel, and one force-sensing resistor (FSR) channel; only the primary BCG channel (bcg_1) was used for sleep stage classification in this study.
Reference sleep stage labels were obtained using the Dreem Headband (Rythm, Paris, France), a commercially available EEG-based wearable device that has been validated against expert-scored PSG [
22]. The Dreem Headband estimates sleep stages from frontal EEG signals in accordance with AASM guidelines. Each recording was segmented into 30 s epochs following AASM conventions. The original five-class labels (Wake, N1, N2, N3, REM) were consolidated into three classes: Wake (class 0), NREM (class 1, merging N1, N2, and N3), and REM (class 2).
The study protocol was approved by the Institutional Review Board under three separate approvals: P01-202508-01-023 (24 participants), P01-202506-01-005 (13 participants), and P01-202308-01-041 (13 participants). All participants provided written informed consent prior to data collection.
2.2. Overview of the Proposed Framework
The proposed framework processes raw single-channel BCG epochs through five successive stages, as illustrated in
Figure 1. First, each epoch is independently z-score normalized. Second, the normalized signal is embedded into phase space at three time delays (τ ∈ {5, 10, 20}) in parallel. Third, pairwise Euclidean distances are computed for each embedding and min–max normalized to [0, 1]. Fourth, each distance matrix is converted into a soft recurrence map via a learnable threshold ε
k = σ(α
k) × 0.5, and scaled by a learned importance weight w
k = σ(β
k). Fifth, the three weighted maps are fused through a convolutional spatial attention block and a fusion head into a single 224 × 224 image, which is subsequently classified into three sleep stages (Wake, NREM, REM) by a pretrained CNN backbone. The detailed formulation of each stage is provided in
Section 2.4.2.
What distinguishes the Learnable RP from conventional transformations is that all internal parameters—threshold logits {αk}, scale weights {βk}, and convolutional weights in the attention-fusion block—are jointly optimized end-to-end with the backbone via backpropagation. This design is particularly relevant for BCG, where inter-subject amplitude differences, postural artifacts, and sensor-placement variability render any single fixed parameterization suboptimal.
2.3. Signal Preprocessing
BCG signals were sampled at 100 Hz. Each recording was segmented into non-overlapping 30 s epochs, yielding a signal vector x ∈ ℝL with L = 3000 samples per epoch. No additional software-level bandpass filtering was applied, as the acquisition hardware already incorporates an analogue bandpass filter (0.1–30 Hz) that suppresses low-frequency baseline drift and high-frequency noise prior to digitisation. Each epoch was independently z-score normalized to zero mean and unit variance prior to transformation, standardising inter-subject amplitude scaling differences arising from differences in body mass, sleeping posture, and sensor-mattress coupling without discarding any spectral content. For the Learnable RP, z-score normalization is performed within the transformation module itself, ensuring that the normalization step is part of the differentiable computational graph.
2.4. 1D-to-2D Transformation Methods
Five transformation strategies were compared. Four of them, namely GAF, MTF, Classical RP, and Modified RP, use fixed, predetermined parameters and serve as baselines. The fifth, the proposed Learnable RP, incorporates trainable parameters that are co-optimized with the classifier.
2.4.1. Baseline Methods: GAF, MTF, Classical RP, and Modified RP
Gramian Angular Field (GAF). GAF encodes temporal correlations by mapping normalized signal values to the angular domain and computing a cosine-based pairwise matrix [
35,
36]. The diagonal captures instantaneous values, while off-diagonal entries encode pairwise temporal dependencies.
Markov Transition Field (MTF). MTF takes a probabilistic approach: the signal is discretized into quantile bins, and a first-order Markov transition matrix is estimated [
35,
37]. Each element of the resulting image corresponds to the transition probability between the quantile bins of the respective time-point pair. Unlike GAF, MTF preserves the dynamic transition structure of the signal.
Classical Recurrence Plot (Classical RP). Recurrence plots were first introduced by Eckmann et al. as a graphical tool for visualizing recurrent states in dynamical systems [
40]. Given a time series, the phase-space trajectory is reconstructed via Takens’ time-delay embedding with embedding dimension
m and delay τ, producing a set of embedded state vectors. The Classical RP then applies a binary (Heaviside) threshold ε to the pairwise Euclidean distances between these embedded vectors: points whose distance falls below ε are marked as recurrent, while all others are marked as non-recurrent [
38]. In this study,
m = 3 and τ = 10 were used. Following Takens’ embedding theorem (m ≥ 2D
2 + 1) [
41], the BCG cardiac ejection cycle is well approximated by a limit cycle attractor (D
2 ≈ 1), yielding a minimum sufficient embedding dimension of m = 3. The choice of τ = 10 is motivated by the temporal structure of the BCG waveform: Kim et al. (2016) reported the I–K interval—encompassing the complete primary ejection complex from the onset of the ascending aortic pressure gradient (I wave) to the peak decay of the descending aortic pressure gradient (K wave)—at approximately 158–163 ms [
42]. Given m = 3, τ = 10 (100 ms) yields a state-vector span of (m − 1) × τ = 200 ms, which is sufficient to embed this complete ejection complex within a single phase-space point. While this binary representation provides a clean visualization of the system’s recurrence structure, it discards distance magnitude information that could aid downstream classification.
Modified Recurrence Plot (Modified RP). Modified RP retains richer textural detail by replacing the binary Heaviside function with min–max normalization of pairwise distances to [0, 1]. The continuous-valued output preserves gradient information in the distance landscape, which CNNs can exploit more effectively than the binary output of the Classical RP. However, the embedding parameters (m = 3, τ = 10) remain fixed, offering no mechanism for signal-specific adaptation.
2.4.2. Proposed: Learnable Recurrence Plot
The central contribution of this work is a recurrence plot module whose key parameters are learned rather than prescribed. Three design choices set it apart from fixed-parameter alternatives: (a) multi-scale phase-space reconstruction using several time delays in parallel; (b) soft, differentiable thresholding with trainable ε values; and (c) an attention-driven mechanism that fuses the resulting multi-scale maps into a single image.
Multi-scale time-delay embedding. Instead of committing to a single delay value, the signal is embedded at K = 3 different scales: τ ∈ {5, 10, 20}. At a 100 Hz sampling rate, these correspond to delays of 50, 100, and 200 ms, respectively. These delays should be interpreted as complementary embedding offsets that emphasize different local dynamical structures within the BCG waveform, rather than as direct representations of full cardiac or respiratory cycle periods. Specifically, τ = 5 (50 ms) emphasizes short-lag morphological structure within the BCG waveform, including the fine structure of the cardiac ejection complex; τ = 10 (100 ms) captures intermediate temporal dependencies that may reflect short-lag variations in cardiac dynamics [
32,
33]; and τ = 20 (200 ms) yields a state-vector span of 400 ms, encompassing beat-to-beat interval modulations attributable to respiratory sinus arrhythmia (RSA; 0.15–0.4 Hz) [
32,
33]—a well-established autonomic modulation of cardiac rhythm that decreases systematically from NREM to REM and is suppressed during wakefulness. The relevant temporal scale for each delay is the state-vector span (m − 1) × τ, not τ alone: at m = 3, the spans are 100 ms, 200 ms, and 400 ms for τ = 5, 10, and 20, respectively. The empirical validation of this τ selection is reported in
Section 3.4. For each scale k, the standard time-delay embedding is applied with embedding dimension m = 3:
By running these three embeddings in parallel, the module constructs three distinct phase-space trajectories from the same epoch, each emphasizing different local dynamical structures without requiring explicit signal decomposition.
Differentiable threshold. A major shortcoming of the Classical RP is the need to choose ε manually. In the proposed method, this fixed threshold is replaced with a learned parameter. For each scale
k, the pairwise Euclidean distances are computed and min–max normalized to [0, 1]. A sigmoid-parameterized threshold ε
k = σ(α
k) × 0.5 then converts these distances into soft recurrence values:
where α
k is a scalar learnable parameter, σ denotes the sigmoid function, and γ = 10 controls transition sharpness. Constraining ε
k to the interval (0, 0.5) keeps thresholds within a meaningful range. During training, gradients flow back through Equation (2) into α
k, allowing each scale’s threshold to converge to whatever value maximizes classification performance.
Scale-wise attention and fusion. Each scale-specific map
R(k) is first multiplied by a learned importance weight
wk = σ(β
k) to form the weighted map
=
wk ·
Rk. The three weighted maps are then stacked into a tensor
∈ ℝ
K×N×N and processed by a lightweight convolutional attention block:
where BN denotes batch normalization, the softmax is applied along the scale dimension
k at each spatial location (
i,
j), yielding
A ∈ ℝ
K×N×N. The attention map
A reweights each scale at every spatial position, and the attended tensor
⊙
A (element-wise multiplication) is then collapsed into a single-channel output through a fusion head consisting of a 3 × 3 convolution followed by batch normalization and ReLU activation, and a final 1 × 1 convolution with sigmoid activation, i.e.,
Rfinal = Sigmoid(Conv
1×1(ReLU(BN(Conv
3×3(
⊙
A))))). Both the attention and fusion blocks employ 16 intermediate feature channels, keeping the module lightweight to avoid overshadowing the backbone’s role in feature extraction.
End-to-end joint optimization. Every parameter in the Learnable RP module—threshold logits {αk}, scale weights {βk}, and all convolutional weights in the attention-fusion block—is updated through standard backpropagation alongside the backbone. A higher learning rate (6 × 10−4) is assigned to the transformation parameters than to the pretrained backbone (3 × 10−4), so that the module adapts quickly while the backbone fine-tunes gradually. It should be noted that ‘end-to-end’ here refers specifically to the joint optimization of the RP module parameters, scale weights, attention-fusion convolutional weights, and backbone classifier via a single backpropagation pass; z-score normalization and the fixed time-delay embedding geometry are standard preprocessing steps and are not subject to gradient-based optimization.
The fixed τ geometry does constrain the phase-space reconstruction prior to any learned transformation. However, inter-subject adaptability is provided by the learned components operating on the resulting distance matrices: the per-scale thresholds ε
k adapt the recurrence boundary for each subject’s BCG amplitude distribution, and the spatial attention weights β
k selectively emphasise temporal scales that are most discriminative for a given subject’s cardiac morphology. This is directly evidenced by the 5.02–7.24 pp accuracy gain over Modified RP, which employs an identical fixed phase-space geometry but uses non-adaptive parameters, confirming that learned thresholds and attention weights provide inter-subject adaptation that fixed parameterisation cannot achieve. Extension to a continuous, learnable τ space is identified as a direction for future work (
Section 4.5).
2.5. Backbone Architectures
Transformed images (224 × 224, single channel) were fed to six pretrained image classification backbones spanning different design paradigms: EfficientNet-B5 and EfficientNet-B0 (compound scaling), ResNet-50 (skip connections), DenseNet-121 (dense connectivity), MobileNetV3-Large (depthwise-separable convolutions), Swin-Tiny (shifted-window self-attention), and ConvNeXt-Tiny (modernized convolutional design). All models were sourced from the timm library and initialized with ImageNet-pretrained weights. The first convolutional layer was modified to accept a single input channel, and the final fully connected layer was replaced with a three-class output head.
2.6. Training Configuration
Cross-entropy loss with label smoothing (s = 0.1) was employed. Because sleep datasets are inherently imbalanced, with NREM epochs typically far outnumbering Wake or REM, class weights were automatically computed from inverse class frequencies in the training fold. The AdamW optimizer was used with an initial learning rate of 3 × 10−4 and weight decay of 1 × 10−4. The learning rate was halved whenever the monitored metric plateaued for five consecutive epochs (ReduceLROnPlateau). Gradient norms were clipped at 5.0, and mixed-precision training (FP16) was enabled throughout. The batch size was set to 256.
Training was terminated after 15 epochs without improvement. Rather than tracking accuracy alone, a composite early stopping metric was monitored:
with a minimum improvement threshold δ = 0.001. Assigning the largest weight to F1
macro guards against models that achieve high overall accuracy by neglecting minority classes. Cohen’s κ further penalizes agreement attributable to chance.
2.7. Evaluation Procedure
A 10-fold stratified cross-validation scheme was adopted. Within each outer fold, an inner 5-fold stratified split allocated 80% of the training partition for optimization and 20% for validation, thereby preventing data leakage and yielding robust performance estimates. The following metrics are reported: overall accuracy, macro-averaged F1-score, weighted F1-score, Cohen’s κ, and multi-class Receiver Operating Characteristic-Area Under the Curve (ROC-AUC), all expressed as mean ± standard deviation across the 10 folds. For pairwise comparisons between the Learnable RP and each baseline, paired t-tests were conducted on the per-fold accuracy values. Cohen’s d was computed to quantify practical significance, with effect sizes interpreted following standard thresholds: |d| < 0.2 negligible, 0.2–0.5 small, 0.5–0.8 medium, and >0.8 large. Bonferroni correction was applied to control the family-wise error rate across all 24 comparisons (4 baselines × 6 backbones).
3. Results
3.1. Aggregate Performance
Table 1 summarizes classification performance across all transformation–backbone combinations. A total of 350 experiments (5 methods × 7 backbones × 10 folds) were conducted. ConvNeXt-Tiny was excluded from the aggregate analysis because it failed to converge under any transformation method (see
Section 4.3), yielding 300 runs (5 methods × 6 backbones × 10 folds) for the results reported below.
The Learnable RP achieved the highest performance across all metrics, attaining 73.60% accuracy, 68.87% macro F1-score, a Cohen’s κ of 0.5202, and an ROC-AUC of 0.8353. It outperformed the runner-up (Modified RP) by 5.35 percentage points (pp) in accuracy and 6.53 pp in macro F1-score. Notably, its standard deviation was the lowest among the five methods (3.19%), indicating that the learned parameters not only improve mean performance but also stabilize it across folds and backbones. Classical RP trailed by nearly 15 pp in accuracy, consistent with the expectation that its binary thresholding discards fine-grained distance information critical for distinguishing sleep stages in BCG signals.
Figure 2 presents the aggregate results across all four evaluation metrics. As shown in
Figure 2a, the Learnable RP achieved 73.6% accuracy, exceeding Modified RP (68.2%) by 5.4 pp and Classical RP (58.7%) by 14.9 pp. A consistent pattern is observed for F1-macro (
Figure 2b), where the Learnable RP attained 68.9% compared to 62.3% for Modified RP. The Learnable RP also demonstrated the largest Cohen’s κ (0.520 versus 0.428 for Modified RP;
Figure 2c) and ROC-AUC (0.835 versus 0.790 for Modified RP;
Figure 2d), confirming that the performance gains extend to class-balanced agreement and probabilistic discrimination.
The Learnable RP ranked first for all six backbones, with mean fold-wise accuracy ranging from 69.58% (MobileNetV3-Large) to 78.91% (EfficientNet-B5), followed by DenseNet-121 (74.64%), Swin-Tiny (74.62%), ResNet-50 (72.39%), and EfficientNet-B0 (71.43%). EfficientNet-B5 produced the best overall result, nearly 10 pp above MobileNetV3-Large, reflecting the larger model’s superior capacity for extracting fine-textured patterns from recurrence images. The Swin-Tiny transformer backbone reached 74.62%, suggesting that shifted-window attention can also exploit RP-derived textures effectively, though not as well as compound-scaled CNNs in this setting.
3.2. Statistical Validation
All 24 pairwise comparisons (4 baselines × 6 backbones) reached significance at
p < 0.001 after Bonferroni correction (α = 0.05/24 = 0.00208), with Cohen’s
d ranging from 1.82 to 14.64.
Table 2 presents a representative subset ordered by effect size.
Effect sizes were uniformly large (all d > 1.8), with the most extreme values observed against Classical RP on EfficientNet-B5 (d = 14.64) and MTF on ResNet-50 (d = 14.45). We note that in a 10-fold cross-validation on 50 subjects, fold-to-fold performance differences exhibit low variance by construction, which mechanically inflates Cohen’s d; extreme values such as d = 14.64 should not be interpreted as population-level effect size estimates, but as indicators of directional consistency of fold-wise differences within this dataset. The practically meaningful quantities are the accuracy differences themselves: +17.26 pp over Classical RP and +5.02–7.24 pp over Modified RP across backbone combinations. The uniformly large d values nonetheless confirm that the performance advantage of the Learnable RP is consistent in direction across all folds and backbone combinations.
3.3. Best-Case Analysis: Confusion Matrix and ROC Curves
The single best model across all 300 analyzed runs was the Learnable RP paired with EfficientNet-B5 in Fold 8, achieving 80.24% test accuracy.
The confusion matrix (
Figure 3) reveals per-class recall patterns. REM was classified most reliably (85.63%), followed by Wake (78.10%) and NREM (66.81%). The relatively lower NREM recall is primarily attributable to confusion with REM: 28.25% of NREM epochs were misclassified as REM. Wake and REM exhibited moderate mutual confusion (16.26% of Wake epochs misclassified as REM), which has a physiological basis, as both stages feature elevated sympathetic tone relative to NREM, producing overlapping BCG morphologies. Notably, misclassifications between Wake and NREM were comparatively rare (5.64% and 4.94%), suggesting that the model effectively distinguishes the strong physiological contrast between wakefulness and non-REM sleep.
The multi-class ROC curves (
Figure 4) confirmed strong discriminative capacity across all three classes, with all per-class AUC values exceeding 0.83. Wake achieved the highest AUC (0.904), indicating that the model assigns high confidence scores to true Wake epochs with minimal false-positive trade-off. REM followed at 0.857, while NREM yielded 0.840, consistent with the confusion matrix observation that NREM is the most challenging class owing to its overlap with REM. The micro-average AUC of 0.854 indicates robust overall probabilistic discrimination well above the chance level of 0.5.
3.4. Ablation Study
To empirically validate the choice of τ ∈ {5, 10, 20} and to quantify the contribution of the learned transformation parameters, two ablation experiments were conducted using the best-performing backbone (EfficientNet-B5, 5-fold stratified cross-validation). Results are summarised in
Table 3.
Regarding τ selection, the results reveal a consistent pattern centred on the necessity of τ = 5. The two configurations that exclude τ = 5 produced the largest accuracy degradations: S1 (τ = {10} only, −0.90 pp, p = 0.025, d = 1.562) and S5 (τ = {10, 20, 40}, −1.30 pp, p = 0.068, d = 1.112). S1 reached statistical significance, while S5 showed a trend-level effect (p = 0.068) that did not reach α = 0.05; the limited power of 5-fold CV (n = 5 paired observations) is consistent with d ≈ 1.1 being detectable as a trend rather than a confirmed effect. Both configurations eliminate the early-ejection temporal scale associated with the I–J interval (~68–75 ms; see
Section 4.2). In contrast, all configurations retaining τ = 5 − S2 (τ = {5, 10}), S4 (τ = {5, 10, 30}), and S6 (τ = {1, 5, 10})—showed no statistically significant difference from the proposed set (all
p > 0.20). Although S6 yielded a marginally higher point estimate (+0.47 pp), this difference was not significant (
p = 0.205), and τ = 1 (10 ms) falls below the cardiac refractory period (~200 ms), providing no physiological basis for inclusion. Furthermore, the proposed set demonstrates lower fold-to-fold variance (±0.37 pp) compared to the two-scale alternative S2 (±0.44 pp), indicating that τ = 20 contributes representational stability. Together, these results confirm that τ = 5 is the necessary and sufficient component for the multi-scale advantage, and that the three-scale combination τ = {5, 10, 20} represents the minimum physiologically complete and empirically stable design.
To quantify the independent and synergistic contributions of the three proposed components, a second ablation was conducted using five configurations (EfficientNet-B5, 5-fold stratified CV). Results are summarised in
Table 4.
4. Discussion
4.1. Why Learnable Parameters Matter
The results demonstrate that making the RP transformation learnable yields consistent accuracy gains of 5.02–17.26 pp across all 24 backbone combinations (
Table 2). Across every backbone tested, the Learnable RP outperformed all fixed-parameter alternatives, with accuracy gains ranging from approximately 5 pp (versus Modified RP) to over 17 pp (versus Classical RP).
The dominant contributor is multi-scale phase-space reconstruction combined with spatial attention fusion (+1.03 pp when comparing C0 and C2): BCG signals encode cardiac and respiratory rhythms at different temporal scales, and a single fixed delay inevitably prioritizes one rhythm over the other. The learnable threshold provides negligible gain in a single-scale setting (−0.06 pp when comparing C0 and C1), but yields a meaningful synergistic contribution within a multi-scale context (+0.76 pp when comparing C2 and C4), accompanied by a reduction in fold-to-fold variance from ±0.85% to ±0.37%. This variance reduction is particularly relevant given the small cohort size (50 subjects), where prediction stability across folds is a practically important property alongside mean accuracy. These results confirm that the three components interact synergistically, and that the 5.02–7.24 pp gap over Modified RP (
Table 2) reflects the combined effect of adaptive thresholding and multi-scale attention operating on a richer structural basis.
These performance patterns are consistent with sleep-stage-specific differences in BCG recurrence structure. During NREM sleep, parasympathetic dominance produces regular beat-to-beat intervals, manifesting as long continuous diagonal lines in the recurrence image [
41]. During REM sleep, intermittent sympathetic activation disrupts this regularity, yielding fragmented diagonal structures. During wakefulness, gross body movements eliminate quasi-periodicity, producing matrices dominated by isolated recurrent points. These stage-dependent textural characteristics are consistent with the confusion matrix pattern observed: the elevated NREM-to-REM confusion rate (28.25%) reflects genuine autonomic overlap between N2 and light REM sleep, whereas Wake-to-NREM and NREM-to-Wake misclassifications were comparatively rare (5.64% and 4.94%, respectively), in accordance with the pronounced physiological contrast between wakefulness and non-REM sleep.
4.2. Physiological Grounding of the Multi-Scale Design
The choice of τ ∈ {5, 10, 20} was motivated by the temporal structure of the BCG ejection complex. Kim et al. (2016) reported the I–J interval (peak ascending aortic pressure gradient, early ejection phase) at approximately 68–75 ms, the J–K interval (descending aortic pressure gradient relaxation) at approximately 88–91 ms, and the full I–K interval (complete primary ejection complex) at approximately 158–163 ms [
42]. Given m = 3, τ = 5 (50 ms) yields a state-vector span of 100 ms, sensitive to the early ejection phase corresponding to the I–J interval; τ = 10 (100 ms) yields a span of 200 ms, covering the complete I–K ejection complex; and τ = 20 (200 ms) yields a span of 400 ms, capturing post-ejection oscillations and beat-to-beat interval modulations attributable to respiratory sinus arrhythmia (RSA; 0.15–0.4 Hz) [
32,
33]. RSA magnitude decreases systematically from NREM to REM and is suppressed during wakefulness, providing a physiologically specific basis for the 400 ms temporal scale. Each delay thus provides a structurally distinct phase-space projection of the cardiac cycle, without claiming to directly represent full cardiac or respiratory cycle periods.
Previous BCG-based sleep staging studies have relied on explicit extraction of HRV and RRV features to capture these dynamics [
28,
29,
30,
34]. Rather than requiring explicit signal decomposition or peak detection, the Learnable RP encodes temporal structure at timescales consistent with cardiac and respiratory dynamics through gradient-based optimization. Direct validation through feature-level correspondence analysis with extracted HRV or RRV features remains a direction for future work.
4.3. The Role of Backbone Architecture
Although the Learnable RP was the top-performing transformation method for all backbones, absolute accuracy ranged from 69.58% (MobileNetV3-Large) to 78.91% (EfficientNet-B5), a span of nearly 10 pp that underscores backbone selection as a consequential design decision. The compound scaling of EfficientNet-B5 yields approximately 30 million parameters, providing greater capacity for building hierarchical features from textured recurrence images. Lighter architectures such as MobileNetV3, while appealing for edge deployment, appear to lack sufficient depth to fully exploit the fine spatial patterns that differentiate sleep stages.
ConvNeXt-Tiny failed to converge under any transformation method and was consequently excluded. To investigate this failure, supplementary trials were conducted using learning rates from 1 × 10
−3 to 1 × 10
−5 (with and without linear warmup and reduced weight decay of 1 × 10
−5) for all five transformation methods. Convergence failure persisted across all configurations and all transformation methods, indicating that the issue is not specific to the Learnable RP but reflects a general architectural incompatibility. Gradient norm monitoring revealed near-zero gradients in the early depthwise convolutional stages, consistent with the known sensitivity of LayerScale initialization—which scales residual branch outputs by near-zero learnable scalars—to inputs whose pixel statistics deviate substantially from ImageNet distributions. Recurrence plots, bounded to [0, 1] with strong spatial symmetry and diagonal structure, satisfy this condition. This finding is consistent with prior reports that time-series image representations induce architecture-dependent performance [
38]. Domain-adaptive initialization strategies for LayerScale-based architectures are noted as a direction for future work.
To quantify the computational overhead of the Learnable RP module, per-sample inference latency was measured on an NVIDIA RTX A6000 (NVIDIA Corporation, Santa Clara, CA, USA) (48 GB GDDR6) (batch size = 1,
n = 100 repeated measurements after 20 warmup iterations). The Learnable RP achieved 15.55 ± 0.27 ms, compared to 14.10 ± 0.51 ms for Modified RP, 15.44 ± 2.42 ms for Classical RP, 14.31 ± 0.40 ms for GAF, and 13.95 ± 0.19 ms for MTF. The RP transformation module itself accounts for 1.32 ms (8.5% of total inference time). The total parameter count is identical across all methods (28.35 M), as the Learnable RP module introduces fewer than 600 additional parameters relative to the EfficientNet-B5 backbone. The incremental latency of 1.45 ms over Modified RP is negligible relative to the 30 s epoch granularity of the target application under server-grade GPU conditions. We acknowledge that the measured latency (RTX A6000, 48 GB GDDR6) does not directly characterize embedded deployment scenarios. However, given that the Learnable RP module itself accounts for only 1.32 ms (8.5% of total inference time) and introduces fewer than 600 additional parameters relative to the backbone, the module’s computational footprint is minimal; its contribution to edge-device latency would scale proportionally with backbone inference time. Optimization strategies such as parameter freezing after training, pre-computation of recurrence representations, and knowledge distillation—noted in
Section 4.5—remain the primary path toward real-time embedded deployment.
4.4. Comparison with Existing BCG-Based Methods
Direct comparison with published BCG sleep staging studies is not appropriate owing to fundamental differences across multiple confounding dimensions listed below. The following figures are provided for contextual orientation only and should not be interpreted as evidence of competitive equivalence. Wang et al. (2025) employed a 5-class scheme on 50 subjects with PSG reference labels and 167 manually engineered HRV/RRV features [
28]; Li et al. (2024) achieved 5-class staging on over 9600 subjects with PSG reference labels via cross-modality transfer learning [
30]; and Mitsukura et al. (2020) used a 5-class scheme on 22 subjects with PSG reference labels and HRV-based logistic regression [
29]. The present study, by contrast, uses a 3-class scheme on 50 subjects with Dreem surrogate labels and no feature engineering. A 3-class problem is structurally easier than a 5-class problem because the merged NREM class subsumes N1/N2/N3 distinctions; any numerical comparison between 3-class and 5-class accuracies is therefore uninformative.
A notable practical characteristic of the proposed framework is its conceptual simplicity and end-to-end trainability. The pipeline requires no peak detection, no beat-to-beat interval computation, and no handcrafted feature selection: the raw BCG epoch serves as the sole input and a sleep-stage label as the output. For deployment scenarios such as home sleep monitors and consumer wellness devices, where signal-processing expertise may be limited, this streamlined architecture may reduce the barrier to deployment—though its practical utility requires validation in real-world home environments.
4.5. Limitations and Future Work
Several limitations should be acknowledged. First, the present study should be interpreted as a proof-of-concept demonstration rather than a large-scale validation. The cohort of 50 subjects is consistent with prior BCG sleep staging studies—Mitsukura et al. (2020) used 22 subjects [
29] and Wang et al. (2025) used 50 subjects [
28]—but is insufficient to establish generalizability across diverse populations, sensor configurations, and clinical settings. The fold-to-fold standard deviations in
Table 1 (e.g., ±3.19% for the Learnable RP) reflect this variability, and the cross-validated mean accuracy of 78.91% (EfficientNet-B5) is the representative performance estimate; the best-case single-fold result (80.24%, Fold 8) should not be interpreted as typical performance. Multi-site replication with larger cohorts remains the primary direction for future work.
Second, and most critically, reference sleep-stage labels were obtained using the Dreem Headband rather than manually scored polysomnography (PSG). This constitutes a primary limitation of the present work. Although the Dreem system has been validated against PSG in a 5-class scoring scheme—achieving an overall accuracy of 83.5 ± 6.4% and F1 score of 83.8 ± 6.3%, compared to 86.4 ± 8.0% and 86.3 ± 7.4% for the average of five expert scorers [
22]; note that the 3-class consolidation used in the present study (merging N1, N2, N3 into NREM) would be expected to yield higher Dreem–PSG agreement than the 5-class figures above, as the most ambiguous inter-class boundaries (N1/N2/N3) are subsumed—it does not replicate gold-standard PSG annotations, and its inter-rater agreement with expert PSG is lowest at N1/N2-to-REM transitional epochs. Mechanistically, because the per-scale thresholds ε
k and spatial attention weights are optimized against training labels, systematic mislabeling of these transitional epochs directly biases the learned parameters toward recurrence structures associated with ambiguous stage boundaries rather than physiologically distinct states. This is the most plausible mechanistic explanation for the elevated NREM-to-REM confusion rate of 28.25% observed in the confusion matrix. All reported accuracy figures reflect agreement with Dreem-derived labels and should not be interpreted as PSG-level clinical staging performance. Furthermore, because the input modality is BCG, whereas conventional sleep-stage definitions are grounded in EEG/EOG/EMG signals, prospective validation against manually scored PSG is necessary before clinical relevance can be established.
Third, the current three-class scheme merges all non-REM stages; extending the framework to four- or five-class classification (e.g., separating N1/N2 from N3) would enhance clinical relevance.
Fourth, the Learnable RP generates recurrence representations during both training and inference, introducing additional computational overhead. Therefore, further optimization strategies, such as parameter freezing, pre-computation of recurrence representations, or knowledge distillation, should be explored to facilitate real-time embedded deployment.
Fifth, the embedding time delays τ ∈ {5, 10, 20} are fixed across all subjects, constraining the phase-space geometry prior to any learnable transformation. Although the learned per-scale thresholds, scale weights, and spatial attention weights provide inter-subject adaptation within this fixed geometry—inter-subject differences in BCG morphology arising from body habitus, posture, and sensor-mattress coupling may not be fully captured by threshold and weight adaptation alone. Future work includes extending the time-delay representation to a continuous, learnable τ space, integrating temporal sequence models (e.g., LSTM or Transformer) to capture inter-epoch dependencies, leveraging self-supervised learning on large-scale unlabeled BCG data, and conducting prospective validation in real-world home environments.
5. Conclusions
This study presented a learnable recurrence plot (Learnable RP) framework for three-stage sleep classification (Wake, NREM, REM) from single-channel BCG signals. Its three core innovations, namely differentiable per-scale thresholds, multi-scale phase-space reconstruction at physiologically motivated delays (τ = 5, 10, 20), and attention-based spatial fusion, work in concert to produce image representations that adapt to the characteristics of the input signal during end-to-end training.
The framework was evaluated through 10-fold stratified cross-validation across six backbone architectures using 50 overnight BCG recordings. The Learnable RP consistently outperformed four baseline transformation methods (GAF, MTF, Classical RP, Modified RP), attaining a mean accuracy of 73.60%, a Cohen’s κ of 0.520, and an ROC-AUC of 0.835, with the highest backbone-level mean accuracy reaching 78.91% on EfficientNet-B5. All 24 pairwise statistical comparisons were significant at p < 0.001, accompanied by uniformly large effect sizes (Cohen’s d = 1.82–14.64). The best single-fold model achieved 80.24% accuracy with per-class ROC-AUC values of 0.904 (Wake), 0.857 (REM), and 0.840 (NREM). These results demonstrate that replacing fixed transformation parameters with learned ones yields statistically significant and consistent accuracy gains of 5.02–17.26 pp across all 24 pairwise comparisons without any explicit physiological feature engineering, offering a viable path toward simpler and fully end-to-end unobtrusive sleep monitoring systems.