Automatic Sleep Staging Using SleepXLSTM Based on Heterogeneous Representation of Heart Rate Data

Wu, Tianlong; Mao, Zisen; Shi, Luyang; Zhou, Huaren; Xie, Chaohua; Ran, Bowen

doi:10.3390/electronics15030505

Open AccessArticle

Automatic Sleep Staging Using SleepXLSTM Based on Heterogeneous Representation of Heart Rate Data

by

Tianlong Wu

¹

,

Zisen Mao

^1,*,

Luyang Shi

²,

Huaren Zhou

¹,

Chaohua Xie

² and

Bowen Ran

²

¹

Basic Department, Army Engineering University of PLA, Nanjing 210007, China

²

College of National Defence Engineering, Army Engineering University of PLA, Nanjing 210007, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 505; https://doi.org/10.3390/electronics15030505

Submission received: 21 December 2025 / Revised: 18 January 2026 / Accepted: 21 January 2026 / Published: 23 January 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Automatic sleep staging technology based on wearable photoplethysmography can provide a non-invasive and continuous solution for large-scale sleep health monitoring. This study accordingly developed a novel cross-scale dynamically coupled extended long short-term memory network (SleepXLSTM) to realize automatic sleep staging based on heart rate signals collected by wearable devices. SleepXLSTM models the relationship between heart rate fluctuations and sleep stage labels by correlating physiological features with clinical semantics using a knowledge graph neural network. Furthermore, an excitation–inhibition dual-effect regulator is applied in an improved multiplicative long short-term memory network along with memory mixing in a scalar long short-term memory network to extract and strengthen the key heart rate timing features while filtering out noise produced by motion artifacts, thereby facilitating subsequent high-precision sleep staging. The benefits and functions of this comprehensive heart rate feature extraction were demonstrated using sleep staging prediction and ablation experiments. The proposed model exhibited a superior accuracy of 91.25% and Cohen’s kappa coefficient of 0.876 compared to an extant state-of-the-art neural network sleep staging model with an accuracy of 69.80% and kappa coefficient of 0.040. On the ISRUC-Sleep dataset, the model achieved an accuracy of 87.51% and F1 score of 0.8760. The dynamic coupling strategy employed by SleepXLSTM for automatic sleep staging using the heterogeneous temporal representation of heart rate data can promote the development of smart wearable devices to provide early warning of sleep disorders and realize cost-effective technical support for sleep health management.

Keywords:

sleep stage; heterogeneous temporal representation; extended long short-term memory network; dual-acting excitation-inhibitory regulator

1. Introduction

The maintenance of sleep homeostasis is a critical regulatory mechanism of human physiological functions, and sleep disorders are associated with a variety of chronic diseases. Many studies have confirmed that a long-term lack of sleep or abnormal sleep structure can lead to sympathetic hyperactivity [1] and metabolic regulation imbalance [2], which can significantly increase the risk of cardiovascular and cerebro-vascular diseases [3], diabetes [4], and neurodegenerative diseases [5]. According to statistics from the World Health Organization, approximately 27% of adults worldwide have at least one symptom of a sleep disorder; failure to diagnose these cases will accelerate the progression of chronic disease owing to continuous physiological decompensation [6]. Therefore, the accurate monitoring of sleep quality and early identification of abnormal sleep have become key aspects of preventive medicine and chronic disease management.

Polysomnography (PSG), which is accepted as the gold standard for clinical sleep monitoring, synchronizes the acquisition of multimodal physiological signals, including electroencephalogram (EEG), eye movement (EOG), and electromyography (EMG) data, to realize sleep staging [7]; however, the subject must be fixed in the laboratory environment and connected to equipment using invasive electrodes. Consequently, the monitoring cost is high, subject comfort is low, and the results cannot reflect the daily natural sleep state [8].

Wearable devices (e.g., smart wristbands collecting photoplethysmography data) can provide an alternative method for continuous sleep monitoring at home [9,10,11]. For example, Walch et al. [12] collected raw acceleration and heart rate data from an Apple Watch, then applied different algorithms (the logistic regression, K-nearest neighbors, random forest, and neural network) to classify sleep stages as “Wake,” “Non-Rapid Eye Movement” (NREM) and “Rapid Eye Movement” (REM). The neural network exhibited the best performance, classifying sleep stages with an accuracy of 62.4%. We note that the sleep data collected by wearable devices is personal health information. In the context of the Medical Internet of Things (IoMT), addressing the issue of data privacy protection during data transmission and storage is as important as ensuring accuracy [13]. Although traditional machine learning-based methods initially showed potential in wearable sleep monitoring, their performance (62.4% accuracy) limits practicability. The characteristics of physiological signals such as heart rate, acceleration, and sleep stages are heterogeneous—that is, the data of different modes have essential differences in their distribution morphology and dynamic mode. Heart rate signals are a continuous numerical sequence, and their fluctuations are regulated by the autonomic nervous system. The corresponding relationship between heart rate signals and sleep stages is a nonlinear and high-dimensional mapping, and a diagnosis depends on neurophysiological markers such as EEG signals. This heterogeneity makes it difficult to establish cross-modal feature alignment using manual feature engineering, on which traditional methods rely. Especially when logistic regression and random forest models are used, their limited representation learning ability cannot capture the complex coupling mechanism between heart rate dynamics and sleep macrostructure.

Because of the rapid development of deep learning theory, end-to-end feature learning provides a new way to handle heterogeneous representation data. Ref. [14] employed a deep learning model to realize automatic sleep staging (“Wake,” “Light Sleep,” “Deep Sleep,” and “REM”) by extracting the instantaneous heart rate from electrocardiograms (ECGs); the algorithm exhibited an accuracy of 77% and a Cohen’s kappa coefficient of 0.66 when trained and tested using the Sleep Heart Health Study dataset, confirming that predictions made by the algorithm replicated the results of previous clinical studies. Ref. [15] trained a deep learning algorithm on heart rate data to realize accurate and precise automatic sleep staging classification into two (“Wake” and “Sleep”), three (“Wake,” “NREM,” and “REM”), or four (“Wake,” “Light Sleep” (N1), “Moderate Sleep” (N2), and “REM”) levels. The accuracy of the two-level model was 87.97%, the kappa coefficient of the three-level model was 0.6025, and the sensitivity and specificity values of the four-level model were 0.3812 and 0.9744, respectively. Critically, this study demonstrated a cost-effective and non-invasive solution that can be deployed to automate sleep classification in the home environment. Furthermore, Ref. [16] proposed a sequence-to-sequence long short-term memory (LSTM) model for automatic sleep staging capable of performing three-level (“Awake,” “NREM,” and “REM”) and four-level (“Awake,” “N1,” “N2,” and “REM”) sleep staging with accuracies of 71–82% and 60–79%, respectively. This method accurately estimated deep sleep using data from wearable devices, exhibiting promise for use in clinical applications that require long-term deep-sleep monitoring.

Although sleep staging provides the core data for evaluating sleep quality and neurophysiological state, the automatic identification of sleep stages has been consistently challenged by their weak correlation with physiological signals [17,18]. Previous studies have attempted to directly input heart rate signals into deep learning models, but have overlooked two key scientific issues: first, there is a nonlinear time-varying coupling relationship between the heart rate and sleep stage [19], which is difficult to capture using traditional end-to-end modeling; second, time-varying noise such as motion artifacts and respiratory interference are common in the heart rate signals collected by wearable devices, and direct modeling based on these signals can easily lead to a spatial offset of features [20]. These limitations significantly limit the practical application of wearable technologies in sleep medicine.

In light of these problems, the knowledge graph neural network (KGNN) may provide a theoretical breakthrough for capturing the implicit correlation between physiological signals and sleep stages by fusing a knowledge graph with a neural network [21]. The principle underlying the application of the KGNN model for automatic sleep staging is to construct a dynamic knowledge topology comprising the physiological signal, regulatory relationship, and sleep event that not only considers information from neighboring nodes to influence target nodes according to the message passing rules [22], but also performs nonlinear entity fitting on the graph structure to realize a model with excellent classification accuracy and robustness [23].

Typically, deep learning models developed for sleep staging employ homogenization analysis to process the time-domain features of heart rate signals [24], but the resulting staging effect is easily influenced by dynamic noise, such as motion artifacts and respiratory interference. The multiplicative LSTM (MLSTM) and scalar LSTM (SLSTM) components included in the extended LSTM (XLSTM) model provide the basic ability to address such noise problems, with the former adopting the dynamic attention mechanism constructed by matrix memory and covariance updating rules [25] and the latter capitalizing on the feature processing efficiency realized by scalar memory and mixed updating [26]. However, the ability of these components to suppress noise remains limited: the attention mechanism in the MLSTM can weaken the weight of other features [27] (asymmetric feature selection) when enhancing key features, and the high-speed updating of the SLSTM may amplify the influence of noise on the model. Therefore, this study proposed an excitation–inhibition dual-effect regulator and implemented secondary feature modulation based on the original attention mechanism to realize a model that can not only dynamically capture key heart rate characteristics, but also actively suppress the interference of time-varying noise such as motion artifacts.

A cross-scale dynamically coupled XLSTM network (SleepXLSTM) was accordingly developed in this study to achieve automatic sleep staging based on heart rate signals, as shown in Figure 1. A KGNN was applied to construct a dynamic knowledge topology, and cross-scale learning of the correlation between heart rate data and sleep stage labels (Wake, N1, N2, Deep Sleep (N3), and REM) was realized through entity and relation embeddings. Furthermore, an improved MLSTM module was applied to extract heart rate features by suppressing the interference of time-varying noise such as motion artifacts. The features obtained by the KGNN and improved MLSTM were input into the SLSTM via linear concatenation to improve feature processing efficiency. Finally, the sleep staging results were output through the fully connected layer of the XLSTM. To the best of our knowledge, this is the first study to apply heterogeneous temporal representations for sleep staging using data collected by a wearable device through a dynamic coupling strategy.

The primary contributions of this study are as follows:

(1): The cross-scale dynamically coupled SleepXLSTM network was developed to enable automatic sleep staging based on a single physiological feature (heart rate signal).
(2): The MLSTM model in the network was improved by introducing a dual-effect excitation–inhibition regulator to suppress the interference of time-varying noise, such as motion artifacts.
(3): The performance advantage and superiority of SleepXLSTM over previously proposed state-of-the-art automated sleep staging algorithms was demonstrated using heart rate data.

The remainder of this paper is organized as follows: Section 2 describes the proposed algorithm, Section 3 details the experiments performed in this study, Section 4 presents the experimental results, Section 5 analyzes the obtained results and discusses the validity of the proposed method, and concluding remarks are provided in Section 6.

2. Methods

2.1. Proposed Method

This section describes the proposed deep fusion method for identifying sleep stages based on the heterogeneous features of heart rate signals. The proposed SleepXLSTM model comprises KGNN, SleepMLSTM, and SleepSLSTM components with deep fusion and inference modules. (Figure 1) illustrates the workflow of the proposed method.

The SleepXLSTM model contains two modifications to overcome the limitations of the conventional LSTM network: exponential gating and new storage structures. These modifications are realized using an SLSTM with a mixture of scalar updating and storage and an MLSTM with matrix storage and covariance (outer product) updating rules [25]. Note that both the SLSTM and MLSTM enhance the XLSTM through exponential gating and can be extended to multiple memory cells. Notably, the combination of the multiple SLSTM magnetic heads and exponential gating establishes a new memory mixing method in which memory can be mixed across cells within each head. The integration of these new LSTM variants into the residual block module resulted in blocks that could be residually stacked to construct the XLSTM architecture [26,27,28].

Heart rate data and the corresponding sleep labels were used as input data for the SleepXLSTM model, and entity and relation embeddings were calculated by the KGNN, which output the context vector

o \in R^{B \times h}

. Following the time step calculation by the SleepMLSTM model and the application of the improved attention mechanism within, the output hidden state

{h_{t}}^{(l)'} \in R^{B \times h}

,

[o] [{h_{t}}^{(l)'}] \in R^{B \times h}

, was input to the Sleep-SLSTM module to obtain

{[h_{t}]}_{SleepSLSTM} \in R^{B \times h}

through feature fusion. Finally, the fully connected layers and loss function were employed to obtain the output classification value.

2.2. KGNN Learning

Knowledge graphs can represent structured relationships between entities and have become a critical subject of research in the fields of cognition and artificial intelligence accordingly. The KGNN uses a deep neural network to integrate topological information and attribute the features in graph data, then provides a more refined feature representation of nodes, allowing for it to learn the attribute and structural features of entities and relations from end-to-end [21]. This section describes the embedding of the entities and relations in the knowledge graph into the continuous vector space as well as the undertaking of information fusion and inference through the neural network structure.

The nodes and edges in the graph structure used in this study are defined as follows. For the input heart rate time-series sliding window

X \in R^{B \times T \times 1}

(where B denotes the batch, T denotes the timestamp, and 1 denotes the number of features), we construct a bipartite graph

G = (V, E)

, where the node set

V

consists of two types of nodes: entity nodes

V_{e}

and relationship nodes

V_{r}

. The entity nodes correspond to the current heart rate value of each sample. That is, the entity-node feature of the

i

th sample is the scalar

v_{i} = X [i, T - 1] \in R

, where

T - 1

represents the last time step of the window. The relationship nodes correspond to the physiological correlation between sleep stages and heart rate and are defined as a single relationship type

r \in R^{h}

. Edge set

E

connects each entity node with the relationship node—that is,

E = {(v_{i}, r) ∣ i = 1, \dots, B}

. Each edge

(v_{i}, r)

represents the correlation between the heart rate value

v_{i}

and sleep stage

r

. This graph structure encodes the domain knowledge, discretizing the continuous heart rate signal into a combination of entities and relationships.

Figure 2 shows the architecture of the KGNN used in this study. First, given the input data

X \in R^{B \times T \times 1}

, the KGNN maps the entities (heart rate data) in the knowledge graph to a high-dimensional vector space by defining a linear layer and setting the entity embedding operation to

Entity_embedding : W_{e} \in R^{h \times 1}, b_{e} \in R^{h}

, where

W_{e}

and

b_{e}

are unform distributions obtained via

X a v i e r

and h = 200 is the hidden layer size. The Entity_embedding operation preserves the basic characteristics of the entity and provides the basis for subsequent feature fusion. This operation can be expressed as follows:

e = {(W_{e})}^{T} \times v + b_{e}

(1)

where

v

is the heart rate of the previous T, for which

v \in R^{B \times 1}

, and

e

is the value obtained after Entity_embedding, for which

e \in R^{B \times h}

.

Simultaneously, the KGNN maps the relation

f

in the knowledge graph to the same vector space through the embedding layer. This Relation_embedding operation ensures that the entity and relation can be directly operated in the same space and is defined as

r = E [f]

(2)

where

E

expresses the relationship, i.e., the heart rate corresponding to the sleep stage, for which

E \in R^{1 \times h}

, initialized by the

ℵ (0, 1)

normal distribution; and

r

is the value obtained after Relation_embedding, for which

r \in R^{h}

.

Next, the KGNN performs linear transformation and feature fusion of the Entity_embedding and Relation_embedding vectors through a linear layer defined as follows:

[e] [r] = [e ∥ r^{\oplus B}]

(3)

o = [e] [r] \times {(W_{l})}^{T} + b_{l}

(4)

where

[e ∥ r^{\oplus B}]

is

e ∥ r

after

e \in R^{B \times h}

and

r \in R^{B \times h}

are spliced, and the combination embedding

[e] [r] \in R^{B \times 2 h}

is obtained by repeated calculations for

B

iterations;

W_{l}

is the linear layer weight, for which

W_{l} \in R^{h \times 2 h}

,

b_{l}

is the linear layer bias term, for which

b_{l} \in R^{h}

, and the context vector

o \in R^{B \times h}

.

During forward propagation in the KGNN, the entity and relation embedding vectors are concatenated, then fused by a context vector generated using a linear layer transformation. This context vector is subsequently added to the hidden state

h_{t}

of the final SleepMLSTM layer, which affects the final prediction results. This step captures the complex interactions and implicit associations between entities and extracts and reinforces higher-order semantic information in the knowledge graph.

2.3. SleepMLSTM Learning

The SleepMLSTM model employed in this study improved upon the attention mechanism included in the conventional MLSTM by introducing a novel dual-effect factor for regulating excitatory and inhibitory effects, shown in Figure 3A, into the multilayer SleepMLSTM architecture as shown in Figure 3B. Figure 3C schematically depicts the function of the dual-effect regulator for global excitatory and inhibitory effects; in this regulator, the attention mechanism is used for secondary feature extraction, and its output

\tilde{v}

is fused with the hidden state

{h_{t}}^{(l)}

to obtain the final output

{h_{t}}^{(l)'}

(Regarding the stability of this network, the analysis of dynamic characteristics can be found in Sections S2–S4).

(1) Time Step Calculation: Based on the MLSTM structure in the XLSTM [29], this study proposed an improved SleepMLSTM model fusing the MLSTM and KGNN models. First, the input weights of the layer

l

input gate, forget gate, cell state, and output gate are defined as

W_{i}, W_{f}, W_{g}, W_{o} \in R^{h \times d}

, respectively. Distributed uniformly through

X a v i e r

[30], the cyclic weights of these gate responses are given by

R_{i}, R_{f}, R_{g}, R_{o} \in R^{h \times h}

, respectively, and the bias terms are given by

b_{i}, b_{f}, b_{g}, b_{o} \in R^{h}

, respectively, which are generated into

b \sim constant (0)

through zero initialization. Furthermore, the weights of the excitation and inhibition effects are defined as

w_{e} \in R^{h}

and

w_{i} \in R^{h}

, respectively, both of which are generated through the standard normal distribution.

The input, forget, cell, and output states are respectively defined as follows:

i_{t} = σ (W_{i} x_{t} + R_{i} h_{t - 1} + b_{i})

(5)

f_{t} = σ (W_{f} x_{f} + R_{f} h_{t - 1} + b_{f})

(6)

g_{t} = \tanh (W_{g} x_{t} + R_{g} h_{t - 1} + b_{g})

(7)

o_{t} = σ (W_{o} x_{t} + R_{o} h_{t - 1} + b_{o})

(8)

and the hidden state is given by

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ g_{t}

(9)

h_{t} = o_{t} ⊙ \tanh (c_{t})

(10)

The improved MLSTM processing of time-series data is realized using multiple LSTM layers to transfer

h_{t}

and

g_{t}

layer by layer to introduce shortcut and direct connections in the network, allowing information to skip some layers and directly transfer to subsequent layers, thereby shortening the path of information transmission and improving network efficiency.

(2) Attention Mechanism: The hidden state of layer

l

is defined as

{h_{T}}^{(l)}

; the query

q^{(l)}

, key

k^{(l)}

, and value

v^{(l)}

vector weights are defined as

W_{q} \in R^{h \times h}

,

W_{k} \in R^{h \times h}

, and

W_{v} \in R^{h \times h}

, respectively; and the excitatory and inhibitory effect weight matrices are defined as

{{W_{e}}^{(l)}}_{l = 1}^{8} \in R^{h \times 1}

and

{W_{i}^{(l)}}_{l = 1}^{8} \in R^{h \times 1}

, respectively. The original formula for calculating attention is given in Equations (11)–(13), and the excitation–inhibition regulatory factor is calculated as shown in Equations (14) and (15). That is, we replace the traditional attention with excitatory–inhibitory attention

Attention (q, k, v, W_{e}, W_{i})

.

The

q^{(l)} \in R^{B \times h}

,

k^{(l)} \in R^{B \times h}

and

v^{(l)} \in R^{B \times h}

vectors are computed as

q^{(l)} = {h_{T}}^{(l)} \times {(W_{q})}^{T}

(11)

k^{(l)} = {h_{T}}^{(l)} \times {(W_{k})}^{T}

(12)

v^{(l)} = {h_{T}}^{(l)} \times {(W_{v})}^{T}

(13)

The excitatory

e^{(l)} \in R^{B \times 1}

and inhibitory

i^{(l)} \in R^{B \times 1}

effect weights are calculated as

e^{(l)} = q^{(l)} \times {{W_{e}}^{(l)}}

(14)

i^{(l)} = k^{(l)} \times {{W_{i}}^{(l)}}

(15)

The attention values

\partial \in R^{B \times 1}

are calculated as

\partial = softmax (\frac{e^{(l)} - i^{(l)}}{\sqrt{h}})

(16)

The context vector

\tilde{v} \in R^{B \times h}

is computed for the time step corresponding

{\tilde{v}}^{'} \in R^{B \times T \times h}

as follows:

\tilde{v} = \partial ⊙ v^{(l)}

(17)

Finally, the connection hidden state

{h_{t}}^{(l)}

and value

\tilde{v}

constitute the residual connection

{h_{t}}^{{(l)}^{'}} \in R^{B \times T \times h}

as follows:

{h_{t}}^{{(l)}^{'}} \leftarrow {h_{t}}^{(l)} + {\tilde{v}}^{'}

(18)

The original MLSTM attention mechanism enhances the expressive ability of the XLSTM model [31]. The improved MLSTM in this study considers

e^{(l)}

and

i^{(l)}

to calculate the

q^{(l)}

,

k^{(l)}

and

v^{(l)}

vectors through the attention mechanism, then calculates the attention score and weight using the excitation and inhibition weights, respectively. These weights are subsequently applied to the output

{h_{t}}^{(l)}

of the MLSTM layer to enhance or suppress the expression of specific features.

2.4. SleepSLSTM Learning

Similarly to the conventional LSTM, the SLSTM can contain multiple memory cells connecting

R_{z}

,

R_{i}

,

R_{f}

,

R_{o}

and gates

i

,

f

and

o

via recursion from the hidden state vector

h

to the memory cell input

z

. One novel aspect of memory mixing in the SLSTM is the effect of exponential gating. The SLSTM incorporates memory mixing within each of multiple heads without crossing between heads; together with exponential gating, this establishes a new method for memory mixing [32]. The SleepXLSTM model evaluated in this study incorporated the SLSTM by passing data through a fully connected layer, a dropout layer, then a second fully connected layer to output the final sleep staging sequence. The architecture of SleepSLSTM is shown in Figure 4.

(1) Calculation of Gating: The input weights of the middle gate, defined as

W_{z} \in R^{h \times d}

, are uniformly distributed through

X a v i e r

and the bias term

b_{z} \in R^{h}

is generated by the zero-initialization

b \sim constant (0)

.

The input state

i_{t}

, forget state

f_{t}

, and intermediate state

z_{t}

are respectively defined as follows:

i_{t} = \exp (W_{i} x_{t} + R_{i} h_{t - 1} + b_{i})

(19)

f_{t} = σ (W_{f} x_{f} + R_{f} h_{t - 1} + b_{f})

(20)

z_{t} = \tanh (W_{z} x_{z} + R_{z} h_{t - 1} + b_{z})

(21)

(2) Stabilization of State Transitions: The stabilization factor

m_{t} \in R^{B \times h}

stabilizes

{\tilde{i}}_{t}

and

{\tilde{f}}_{t}

, updates the cell state

c_{t}

, normalizes the state

n_{t} \in R^{B \times h}

and hides the state output

h_{t}

as follows:

m_{t} = \max (\log f_{t} + m_{t - 1}, \log i_{t})

(22)

{\tilde{i}}_{t} = \exp (\log i_{t} - m_{t})

(23)

{\tilde{f}}_{t} = \exp (\log f_{t} + m_{t - 1} - m_{t})

(24)

c_{t} = {\tilde{f}}_{t} ⊙ c_{t - 1} + \tilde{i} ⊙ z_{t}

(25)

n_{t} = {\tilde{f}}_{t} ⊙ n_{t - 1} + {\tilde{f}}_{t}

(26)

h_{t} = σ (o_{t}) ⊙ \tanh (c_{t} / n_{t})

(27)

The stability factor

m_{t}

converts the multiplication operation into addition by logarithmic transformation to avoid numerical underflow owing to the continuous gating operation, and the normalized state

n_{t}

is used to eliminate the amplitude drift of the cell state

c_{t}

. The output of the SLSTM in this study is the

h_{t}

in the hidden gate.

2.5. Deep Fusion and Inference

The improved MLSTM output

{h_{t}}^{(l)} = H_{L} [;, - 1,;] \in R^{B \times h}

, where

L

is the improved MLSTM layer number, is combined with the KGNN output

o = K G N N (e, r) \in R^{B \times h}

through feature fusion to obtain

[{h_{t}}^{(l)}] [o] = [h_{(K G N N + M L S T M)}] \in R^{B \times h}

, and the features of

[h_{S L S T M}]

are fused to obtain

h_{c o n c a t} \in R^{B \times 2 h}

. These fused features are input into the fully connected layer of SleepXLSTM, the loss function is optimized while the gradient is trimmed, and the automatic sleep staging results are output.

(1) Fully Connected Layer: The classification weight matrix of the first fully connected layer is defined as

W_{1} \in R^{256 \times 2 h}

, which is generated through the Kaiming normal distribution

W \sim ℵ (0, \sqrt{2 / n_{i n}})

, where

n_{i n}

is the dimension of the weight matrix and the bias term is

b_{1} \in R^{256}

.

Given the input

h_{c o n c a t} = [h_{(K G N N + M L S T M)}] [h_{S L S T M}]

, for which

h_{c o n c a t} \in R^{B \times 2 h}

, the first fully connected layer

h_{1} \in R^{B \times 256}

is represented by

h_{1} = ReLU (h_{c o n c a t} \times {W_{1}}^{T} + b_{1})

(28)

To prevent overfitting,

h_{1}

is expressed as

{h_{1}}^{'} \in R^{B \times 256}

[33] after the dropout layer, where

p = 0.6

, as follows:

{h_{1}}^{'} = h_{1} ⊙ Bernoulli (p)

(29)

Bernoulli (p) = \{\begin{cases} 1 & p \\ 0 & 1 - p \end{cases}

(30)

The classification weight matrix of the second fully connected layer is defined as

W_{2} \in R^{5 \times 256}

, which is generated through the

X a v i e r

uniform distribution, and its output bias term is

b_{2} \in R^{5}

; the second fully connected layer is defined as

o = {h_{1}}^{'} \times {W_{2}}^{T} + b_{2}

(31)

and the output probability value

p

is given by

p = Softmax (o)

(32)

where

o \in R^{B \times 5}

.

(2) Loss Function and Optimization: The weighted cross-entropy loss function [34] is used to alleviate the long-tail distribution bias of the sleep stage data via adaptive weight allocation based on the inverse square root of the class sample size. Furthermore, the AdamW optimizer is employed to decouple the weight decay and gradient update paths, thereby enhancing the generalization performance of the model [35]. Finally, the global gradient norm constraint (threshold

γ = 1.0

) is supplemented to suppress the gradient explosion phenomenon in the deep recursive architecture and improve the adaptability of sleep staging tasks while ensuring convergence [36].

The weighted cross-entropy loss is defined as

Γ = - 1 / B \sum_{i = 1}^{B} W_{c} \log p_{i, c}

(33)

where

W_{c} = 1 / \sqrt{N_{c}}

, in which

N_{c}

is the number of training samples, and

p_{i, c}

is the probability of predicting the sample class, in which

i \in c

(

c

denotes the sleep stage).

The AdamW optimizer is defined as

θ_{t} = θ_{t - 1} - η ({\hat{m}}_{t} / \sqrt{{\hat{v}}_{t} + ε} + χ θ_{t - 1})

(34)

{\hat{m}}_{t} = m_{t} / (1 - {β_{1}}^{t})

(35)

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

(36)

{\hat{v}}_{t} = v_{t} / (1 - {β_{2}}^{t})

(37)

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) {g_{t}}^{2}

(38)

where the decay index

λ = 1 \times 10^{- 3}

, momentum values

β_{1} = 0.9

and

β_{2} = 0.999

, learning rate

η = 3 \times 10^{- 4}

, and stability coefficient

ε = 1 \times 10^{- 8}

.

Finally, the gradient is cropped such that when the gradient

{‖g‖}_{2} ≻ 1.0

,

g \leftarrow 1.0 \times g / {‖g‖}_{2}

, where

g \leq 1.0

.

3. Experimental Evaluation

3.1. Dataset and Feature Preparation

The public dataset used in this contained labeled sleep stages (Wake, N1, N2, N3, and REM) based on PSG data [37] as well as acceleration (in g) and heart rate (in bpm, measured via photoplethysmography) data from 31 subjects collected between June 2017 and March 2019 at the University of Michigan [37]. The subjects wore an Apple Watch to collect their dynamic activity patterns (number of steps) for at least a week (7–14 d) before spending a night in a sleep lab sleeping for 8 h as acceleration (g) and heart rate (bpm) were collected by the Apple Watch and PSG data were collected by laboratory equipment. Each type of data recorded by the Apple Watch was paired with the labeled sleep data based on the PSG results. For a complete description of these methods, see the database description [12]. The normal heart rates for each subject and corresponding sleep stages were used to train the SleepXLSTM model in this study. To further verify the performance of the algorithm and enhance the persuasiveness of the results, a second public dataset called ISRUC-Sleep [38] was used. The data were collected from adults, including healthy subjects, subjects with sleep disorders, and subjects affected by sleeping pills. Each data point was randomly selected from the PSG recordings obtained at the Sleep Medicine Center of the University of Coimbra Hospital, Coimbra, Portugal. This dataset consists of data from 100 subjects, each of whom underwent a session during which the heart rate and corresponding sleep stages were recorded. The sleep stage annotations were made by two human experts using visual scoring.

Given the heart rate signal sequence

X = x_{1}, x_{2}, \dots, x_{T}

, where

x_{t} \in R

represents the heart rate value at time

t

, the goal for this task was to predict the corresponding sleep stage sequence

Y = y_{1}, y_{2}, \dots, y_{T}

, where

y_{t} \in 0, 1, \dots, C - 1

represents

C

sleep stage categories. To prevent leakage of the time dimension, a gap-based sequence sampling strategy was adopted, that is, the input sequence was

X^{(i)} = [x_{t_{0}}, x_{t_{0} + 1}, \dots, x_{t_{0} + n - 1}]

, the target label was

y^{(i)} = y_{t_{0} + n}

, and the sampling index was

t_{0} = k \cdot g, k = 0, 1, 2, \dots, ⌊\frac{T - n}{g}⌋

. Here,

n = 30

was the sequence length and

g = 10

is the gap step size.

A min–max linear transformation was applied to eliminate dimensional differences in the heart rate data and avoid data leakage in the test set [30]. The original heart rate series

X \in R^{N}

was normalized as follows:

X_{norm}^{(i)} = \frac{X^{(i)} - X_{\min}^{train}}{X_{\max}^{train} - X_{\min}^{train}}

(39)

where

X_{\max}^{train}

and

X_{\min}^{train}

are the extreme values in the training set.

The sleep stages in this study were ordinally coded according to the sleep staging rules defined by the American Academy of Sleep Medicine [39] as follows:

\begin{array}{l} C = {C_{1}, C_{2}, \dots, C_{K}} \\ (C_{j} \in {Wake = 0, N 1 = 1, N 2 = 2, N 3 = 3, REM = 5}) \end{array}

and an injective map of sleep stages was established through LabelEncoder [40] as follows:

ϕ : C \to ℤ^{+}, ϕ (C_{j}) = j - 1, (j = 1, 2, \dots, K)

(40)

After encoding, the labels were given by

y^{(i)} \in {0, 1, \dots, K - 1}

and their ordinal relations were preserved.

A hierarchical multi-stage strategy was used for data partitioning. First, at the subject level, we randomly divided all subjects into a training set (80%), validation set (10%), and test set (10%), ensuring that subjects in the three sets were mutually exclusive. Next, within the training set, we used a Bernoulli sampling process to sample the time series data of each subject independently with sampling probability

p = 0.2

.

For category

C_{j}

and training set subject

s_{i}

, the Bernoulli sampling process is expressed as

D_{train_val}^{(j, s)} ~ Bernoulli (p; N_{j, s}),

(41)

where

N_{j, s}

is the number of samples of subject

s_{i}

in category

C_{j}

. The sampled data were used for model monitoring and the early stopping strategy during training, whereas the unsampled data were used for parameter updating (For details, Section S1).

3.2. Ablation Experiment Design

Four ablation experiments were designed to systematically evaluate the synergistic effects and independent contributions of the key components of the SleepXLSTM architecture. These experiments considered the full model to comprise the complete SleepXLSTM architecture integrating the KGNN, improved MLSTM, and SLSTM to realize joint modeling of spatiotemporal features and domain knowledge through cascade feature fusion. Ablation Model 1 removed the KGNN module and retained the improved MLSTM and SLSTM to quantify the effect of domain knowledge injection on classification performance. Ablation Model 2 removed the improved MLSTM module while retaining the KGNN and SLSTM to verify the contribution of the graph structural attention mechanism to temporal dependence modeling. Ablation Model 3 removed the SLSTM module while retaining the KGNN and improved MLSTM to evaluate the optimization effect of the stable memory unit on gradient propagation.

The control variable method was applied with consistent hyperparameters (

h = 200

,

l = 8

, and

B = 256

) as well as optimization (AdamW,

η = 3 \times 10^{- 4}

, and

λ = 1 \times 10^{- 3}

) and data division (stratified sampling to assemble a test set accounting for

20 %

of total data) strategies to ensure that any observed differences in performance resulted from architectural changes. The evaluation considered the accuracy, macro-F1 score, and class-wise precision and recall metrics to comprehensively evaluate the performance of each ablation model. These experiments were conducted to reveal: (1) the boundary effect of the KGNN in introducing prior knowledge, (2) the improved functional complementarity of MLSTM and SLSTM in time-series modeling, and (3) the performance gain of the full model compared with the simple superposition of several components.

3.3. Implementation

The experimental platform employed an NVIDIA RTX-3060 GPU with 12 GB VRAM in an Intel Core i5-12490F machine with 32 GB DDR4 and used CUDA 11.8 to accelerate computing.

During the data preprocessing stage, the heart rate time-series data were normalized using min–max processing, and the sleep stage labels were coded using LabelEncoder. The datasets were split into training and testing sets at a 4:1 ratio using stratified sampling to ensure consistent category distribution. The model was trained using an end-to-end optimization strategy employing the AdamW optimizer and a gradient clipping threshold of

τ = 1.0

. The class-weighted cross-entropy loss function was employed, and the weight was calculated dynamically using

W_{c} = 1 / \sqrt{N_{c}}

. Learning rate scheduling adopted the ReduceLROnPlateau strategy based on the accuracy of the verification set (patience value = 2, attenuation factor = 0.5) [41]. The training period was set to

E p o c h_{\max} = 200

and the batch size was fixed at

B = 256

to ensure optimal memory utilization (peak VRAM occupancy of 15.9 GB). Model parameter initialization followed a normal distribution [42] (in the SLSTM and improved MLSTM layers) with orthogonal initialization (recursive weights) [43].

Ablation Models 1, 2, and 3 were implemented using a modular algorithm structure to ensure that all hyperparameters except those changed to realize the target architecture were consistent.

3.4. Evaluation Metrics

This study employed a stratified retention validation strategy to ensure the rigor and reproducibility of the assessment. This strategy divides the data at the subject level, allocating 80% of the subject data for training, and evenly distributing the remaining 20% of the subjects to the validation set and the test set, thereby preventing data leakage between individuals. This design is based on two key considerations. Firstly, time series data exhibits significant autocorrelation. The traditional cross-validation method leads to overlapping time windows, resulting in evaluation bias. Secondly, retaining verification is more in line with clinical practical application scenarios. The performance of the model on an independent group of subjects can truly reflect its generalization ability, providing an unbiased estimate of the model’s performance.

The global accuracy (ACC) [44], class-weighted F1 score [45] and normalized confusion entropy (NCE) [46] were used as multilevel quantitative metrics to evaluate the classification performance of each ablation model. The ACC was defined as follows:

A C C = 1 / N_{test} \sum_{i = 1}^{N_{test}} ψ ({\hat{y}}_{i} = y_{i})

(42)

where

ψ ({\hat{y}}_{i} = y_{i})

is the indicative function and

{\hat{y}}_{i}

and

y_{i}

are the predicted and true labels, respectively, which reflect the overall classification accuracy but are sensitive to class imbalance.

The F1 score was calculated by

F_{1}^{w e i g h t e d} = \sum_{c = 1}^{K} w_{c} \times (2 \times P_{c} \times R_{c}) / (P_{c} + R_{c})

(43)

w_{c} = N_{c} / N_{test}

(44)

where

P_{c} = T P_{c} / T P_{c} + F P_{c}

,

R_{c} = T P_{c} / T P_{c} + F N_{c}

, and

N_{c}

denotes the number of test samples of class

c

. Note that the weighting strategy employed in this study assigned higher weights to most classes, which is suitable for the long-tailed distribution characteristics of clinical data.

Finally, the NCE was determined as follows:

N C E = - 1 / \log K \sum_{i = 1}^{K} \sum_{j = 1}^{K} \frac{C_{i j}}{{N_{c}}^{(i)}} \log \frac{C_{i j}}{{N_{c}}^{(i)}}

(45)

where

C_{i j}

represents the number of samples for which the true class

i

was predicted to be

j

and quantifies the degree of confusion misjudged by the model (

N C E \in [0, 1]

, where 0 indicates no confusion).

These indices comprise an orthogonal evaluation space incorporating two dimensions—classification accuracy and robustness—and strictly satisfies scale invariance (NCE normalization). All calculations were based on nonparametric estimations to avoid distribution assumption bias.

4. Results

4.1. SleepXLSTM Performance Analysis (Using the First Dataset)

The performance of the SleepXLSTM model was systematically evaluated in a multi-classification task via NCE matrix and training dynamics analyses to investigate the internal relationships between the heterogeneous responses among different sleep stages and parameter values, then explore the potential for model architecture optimization and training strategy adjustment based on the classifier decision boundary stability, feature space interpretability, and training convergence characteristics.

The raw heart rate, min–max normalized heat rate [30], true sleep label, and predicted sleep label are shown in Figure 5, the corresponding real and predicted sleep stage NCE matrix is shown in Figure 6, and the classification evaluation is shown in Figure 7. The SleepXLSTM model identified significant differences between the sleep stages, exhibiting the highest recognition ability for the N2 and REM stages, with recall rates reaching 92.65% and 93.58%, respectively, and F1 scores exceeding 0.92; this success was related to the large sample sizes for these two stages (4600 and 2400, respectively). In contrast, the recall rates for the Wake and N1 stages were relatively low at 86.09% and 87.75%, respectively, and there was a slight imbalance between the recall and precision rates for N1 (89.03% vs. 87.75%, respectively), indicating that the model could be further optimized for feature capture during the end stage of sleep. Notably, although the recall rate for N3 reached 88.52%, its precision was only 81.57%, indicating that the model had a tendency to misjudge other stages as mild sleep; this may be related to the limited ability of the model to capture low-frequency features.

The analysis of training dynamics presented in Figure 8 reveals the optimization process employed by the model. As the number of training epochs increased, the cross-entropy loss decreased from an initial value of 1.56 to 0.03 and ACC reached 91.25%, indicating that the network exhibited a strong representation learning ability.

This evaluation indicates that SleepXLSTM exhibited suitable baseline performance in sleep stage identification. The strong ability of the model to recognize major sleep stages (REM and N2) confirms the effectiveness of the deep temporal model, whereas the suboptimal recognition performance for N3 and N1 may indicate that the feature extraction does not sufficiently represent subtle changes in the physiological signals. In future work, two approaches could be adopted to improve performance. (1) Multi-scale feature extraction and local attention mechanisms: In future, we could incorporate multi-resolution convolutional modules into the existing time series modeling and simultaneously extract the waveform and rhythm information of different time scales to enhance the model’s sensitivity to and discrimination ability for deep sleep features. (2) Segmented weight loss functions: The samples of the N3 and N1 stages are few in number and easily confused with adjacent stages, which may mean that the model is unable to sufficiently learn their features during training. We plan to adopt a segmented weighted loss function that assigns higher loss weights to categories N1 and N3, thereby improving the model’s ability to recognize sleep transition stages.

4.2. SleepXLSTM Performance Analysis (Using the Second Dataset)

This study used the second dataset (ISRUC-Sleep) to verify the performance of the algorithm. A confusion matrix can reveal the key patterns of the model in the classification of sleep stages, and the original matrix is shown in Figure 9. Stage 5 (R) has the fewest samples compared with the other stages, which results in a lower classification probability (0.86) in the normalized confusion matrix. The classification performance of stage 0 (Wake), stage 1 (N1), and stage 2 (N2) is higher, with diagonal values of 0.86, 0.94 and 0.91, respectively, indicating that the model can reliably distinguish between the awake and deep sleep stages. However, there was a small number of misclassifications (probability: 0.08) from stage 3 (N3) to stage 2 (N2), suggesting that the model was not sensitive enough to the boundary features of SWS.

The performance metrics are shown in Figure 10 to further validate the performance of the model, where the precision, recall, and F1 score values reveal the classification reliability of stage 0 and stage 2, which is consistent with the high diagonal values in the confusion matrix. Metrics from other stages such as stages 1 and 5 showed moderate performance, but the low accuracy of stage 1 (72.2%) implies that N1 sleep classification may be affected by noise or sample imbalance. These metrics collectively demonstrate ability of the SleepXLSTM model to discriminate sleep stages.

4.3. Ablation Experiment Comparison (Using the First Dataset)

The results of the ablation experiments revealed the different contributions of the SleepXLSTM components to the full model performance as well as their synergistic mechanisms. The ablation models exhibited significantly lower ACC values than the full model (91.25%), with Model 3 (SLSTM removed) exhibiting the lowest ACC of 46.00%, demonstrating the critical role of the SLSTM in ensuring gradient stability. Indeed, the NCE matrices in Figure 11 for this model indicate that serious diffuse errors were present in the prediction results for the Awake, N1 and REM stages (e.g., the misclassification rate of the Awake stage as N2 was 46.70%), suggesting that the lack of an SLSTM prevented the model from propagating normal gradients, thereby weakening its ability to discriminate the boundaries between stages. Furthermore, the F1 score for the full model was 0.923, whereas that for Model 2 (improved MLSTM removed) was 0.787, that for Model 1 (KGNN removed) was 0.781, and that for Model 3 was only 0.358. Thus, the dual necessity of including the improved MLSTM to model long-range dependence and the KGNN to characterize the structure of feature space was confirmed.

The comparison of training dynamics in Figure 12 indicates that the removal of components directly affected the optimization trajectory of the model. When the improved MLSTM was absent, the loss increased from 0.03 for the full model to 0.309 for Model 2 and the learning rate decreased from 2.34 × 10⁻⁶ to 1.63 × 10⁻²³, revealing a loss of deep time-series modeling ability that forced the network to rely on extremely low learning rates to maintain parameter stability. A gentle learning rate decay (7.15 × 10⁻¹¹) was observed when the KGNN was absent in Model 1, but the model triggered the early stop mechanism at epoch #45 and ACC stalled at 80%, indicating that the removal of the KGNN destroyed the interpretability of feature embedding and caused the model to fall into local extremes. Finally, note that the training of Model 3 failed completely with the cross-entropy loss decreasing from 1.57 to 1.38, indicating abnormal gradient propagation owing to the degradation of memory units and reflecting the limited optimization ability of this model.

The functional complementarity between model components is highlighted in the performance comparison shown in Figure 13. The full SleepXLSTM model achieved a synergistic enhancement effect by employing the improved multiscale feature extraction of the MLSTM, gradient path optimization of the SLSTM, and physiological constraint injection of the KGNN. Indeed, its N3 stage recognition F1 score of 84.9% represented an increase of 17.0% and 17.5% over that for Model 1 (67.9%) and Model 2 (67.4%), respectively. Thus, the prior topological constraints provided by the KGNN and deep temporal modeling ability of the MLSTM significantly alleviated the confusion of features present when a subject enters the deep sleep stage. In addition, the inclusion of the SLSTM module improved the N1 stage recall rate from 12.1% for Model 3 to 87.7% for the full model—an improvement of 75.7% that highlights the effectiveness of stable memory units when capturing temporal patterns in transitional sleep stages. These findings collectively indicate that SleepXLSTM systematically integrated the three dimensions of feature extraction (MLSTM), optimization process stability (SLSTM), and domain knowledge fusion (KGNN) and confirms that the loss of any single component destroys the architectural balance of the system, leading to model degradation and suboptimal solutions.

4.4. Ablation Experiment Comparison (Using the Second Dataset)

A comprehensive evaluation of the three ablation models was performed, and the accuracy, macro F1 score and recall yielded significant differences, reflecting the contribution of each module to the overall performance, as shown in Figure 14 and Figure 15. Ablation group B (Model 2) achieved the best performance, and its accuracy (70.99%), macro F1 score (0.695), and recall (0.710) were significantly higher than those of the other two groups, indicating that the synergistic effect of the KGNN and SLSTM models was crucial for modeling temporal dependence. In ablation group A (Model 1), after removing the KGNN model, the accuracy and macro F1 score decreased significantly (59.99% and 0.586, respectively), indicating that the fusion of domain knowledge effectively improves classification robustness. However, the performance of ablation group C (Model 3) was clearly reduced (accuracy: 40.99%, macro F1 score: 0.401). In particular, the recall rate decreased substantially, confirming the crucial ability of SLSTM to optimize gradient propagation and learn long-term dependencies.

The classification performance of the five sleep stages shown in Figure 16 reveals that in stages 0 and 2 (which have the largest sample sizes), Model 2 achieved high accuracies (84.2%, 82.1%, respectively) and F1 scores (0.770, 0.762), indicating its superior ability to classify the dominant sleep stages. Model 1 achieved moderate F1 score in these two categories (0.673, 0.662), whereas Model 3 yielded much lower scores (0.487, 0.476). For the minority stage 1, the accuracy of all models was low (Model 2: 46.6%, Model 1: 34.8%, Model 3: 19.8%), but Model 2 achieved a relatively balanced F1 score (0.563) with a high recall rate (0.710), revealing that KGNN may alleviate class imbalance. For stages 3 and 5, the F1 scores of Model 2 (0.675, 0.703) were still higher than the F1 scores of Model 3 (0.371, 0.403), further revealing that the lack of SLSTM caused the model to be unable to capture medium- and long-term features. In summary, the removal of SLSTM attenuated the universal performance of all classes, while the absence of KGNN mainly affected the discrimination accuracy of most classes.

4.5. Comparison with State-of-the-Art Methods

4.5.1. Model Comparison Based on the First Dataset

The SleepXLSTM model proposed in this study exhibited excellent performance in automated sleep staging, with an ACC of 91.25% and an F1 score of 0.8976 (on the first dataset). Therefore, experiments were undertaken in this study to compare its performance with that of various state-of-the-art automatic sleep staging methods using the same parameters and heart rate signals. The results are summarized in Table 1.

Note that all performance comparisons in Table 1 were obtained under consistent experimental conditions. All baseline models employed exactly the same data partitioning strategy, using subject-independent partitioning to ensure that data from the same subjects do not span different sets. All models were trained and tested using the same preprocessing regimen, input features, and evaluation metrics.

The performance improvement achieved by SleepXLSTM stems from three aspects of its architectural design. (1) The KGNN embedding module incorporates prior knowledge of heart rate and sleep stages, providing a strong inductive bias. (2) The multi-layer excitation-inhibition attention mechanism dynamically weights the key nodes in the sequence, enhancing the model’s ability to distinguish sleep stage transitions. (3) Adoption of the numerically stable SLSTM design alleviates the gradient issues in long-term dependency modeling and enhances the model’s ability to represent long-term physiological signals.

To systematically evaluate the performance advantages of the SleepXLSTM model, it was compared with a variety of classical LSTM variant models, as shown in Figure 17. The accuracy results reveal that the SleepXLSTM model achieved an excellent performance of 91.25%, significantly exceeding the accuracies of all other comparison models (the next highest was only 40.41%). The accuracy of traditional LSTM models, including StandardLSTM, BiLSTM, and DeepResidualLSTM, fluctuated between 37.20% and 40.41%, revealing that the improvements of network depth and bidirectional structure achieved limited performance increases. AttentionLSTM and CNNLSTM were optimized by an attention mechanism and convolutional feature extraction, respectively, but their accuracy results were still in the range of 37.76–40.41%, indicating that these improvements did not increase the accuracy of automatic sleep staging. In contrast, SleepXLSTM improved accuracy through the synergistic effect of an excitation-inhibition attention mechanism and SLSTM, which demonstrates the advantages of the model in processing complex physiological signals.

An analysis of the prediction results based on the first dataset showed that traditional deep learning models generally perform poorly on sleep stage classification tasks (Table 1). The accuracy of the CNN, Transformer, and RNN did not exceed 47% on the Awake, NREM, and REM phases, especially when classifying REM phase. The accuracy of all comparison models was below 36%, reflecting the limitations of traditional architectures in capturing the complex transition patterns between sleep stages. In contrast, the SleepXLSTM model achieved more than 87% accuracy across all sleep stages, effectively handling the difficulties of traditional models in sleep stage boundary recognition and long-range dependence modeling.

The ACC values obtained for the SleepXLSTM model [37] in all stages indicate that it exhibited a significant accuracy advantage over the other models, with a 21.45% improvement over the next-best neural network model. Furthermore, the Cohen’s kappa coefficient was calculated to evaluate the consistency of classification considering the possibility of chance agreement; (Figure 18) compares the results. Clearly, the proposed SleepXLSTM model provided far more consistent classification results than the other models.

4.5.2. Comparison of the Models Based on ISRUC-SLEEP

On the second dataset (ISRUC-SLEEP), the accuracy of the model was 87.51%, and its F1 score was 0.8760. To further evaluate the performance of the model proposed in this study, it was compared with other automatic sleep staging methods, and the results are listed in Table 2.

The validation results on the ISRUC-SLEEP dataset show that the SleepXLSTM model demonstrated the best generalization ability. Its classification accuracy was 91.51%, 72.22%, and 92.38% for the Awake, NREM, and REM phases, respectively, exceeding the results of all the comparison models listed in Table 2. This result reveals the robustness of the model in cross-dataset scenarios and verifies the effectiveness of the algorithm.

5. Discussion

5.1. Model Performance

The SleepXLSTM model was proposed in this study to extract heart rate signal features for automatic sleep staging. The results of various evaluation experiments demonstrated that the introduction of physiological constraints using the KGNN complemented the effects of the XLSTM components, allowing SleepXLSTM to exhibit advanced automatic sleep staging capabilities.

The experimental results indicate that SleepXLSTM exhibited more advanced performance in terms of ACC, recall, and F1 score than either of its substituent algorithms designed in this study or the conventional XLSTM algorithm. The root of this performance advantage lies in cross-scale dynamic coupling at the network architecture level through the design of the SLSTM gating system, improved MLSTM excitation–inhibition dual-channel regulation, and sparse feature mapping at the fully connected layer.

The internal parameter distribution characteristics and functional implementation mechanism of the MLSTM network can be revealed through an analysis of its input hidden weight matrix, as shown in Figure 19 (left). The weight range was concentrated within [−0.1, +0.1] with a horizontal band-like distribution indicating that the weight values of the same hidden unit exhibited continuous gradient changes for different input dimensions. However, the subtle weight distribution patterns between adjacent hidden units indicate that the network formed a regional preference for feature encoding during the learning process and suggest that a specific hidden unit group tended to produce a systematic response to a continuous interval of input features; this is related to the soft selection of features at the input gate using the Sigmoid function and the generation of candidate memory units using the Tanh function to ensure that adjacent hidden units followed a smooth transition during feature encoding.

In the hidden state weight transition matrix shown in Figure 19 (right), which relates the hidden units of the previous time step to those of the current time step, the weight distribution presents unstructured characteristics with no obvious trends, and its range of [−0.15, +0.15] is wider than that for the input hidden weight matrix. This wider dispersion reveals two key properties of the hidden state updates: (1) there was no evident directional constraint on the connection strength between units, and (2) the state update of each hidden unit was simultaneously affected by multiple predecessor units. Furthermore, the density of this matrix indicates that the model realized nonlinear state evolution through global parameter optimization, which is consistent with the mechanism of coordination between the forget and output gates in the improved MLSTM and SLSTM.

These different matrix distribution characteristics reflect the design goals of the model architecture. The band-like structure of the input weight matrix reflects the spatial continuity requirement of feature encoding, with the local correlation constraint improving feature extraction efficiency. The unstructured distribution of the hidden state weight transition matrix reflects the global coupling requirements of the state evolution and confirms that the complex time-series pattern was modeled using a high-degree-of-freedom parameter space.

Note that the use of the dual-channel design employing excitatory and inhibitory weights is equivalent to introducing a dynamic bias term corresponding to the excitation–inhibition balancing mechanism in neuroscience. As shown in Figure 20, the positive bias of the excitatory weights enhanced the signal-to-noise ratio of feature transmission, whereas the negative bias of the inhibitory weights actively inhibited irrelevant features. Critically, integrating this coupled method into the attention mechanism of the MLSTM provided more refined temporal feature extraction than MLSTM alone.

The weight distributions of the fully connected layers in the proposed network are shown in Figure 21 to exhibit multimodal characteristics in which the weights at the end of the classifier present an obvious heavy-tailed distribution. The Gaussian distribution of shallower weights in the model ensures smooth mapping of the initial features, whereas the peaked distribution of deeper weights in the model indicates the establishment of sparse discriminative feature connections in the feature space. Furthermore, the correlation heat map in Figure 22 suggests that the correlation coefficients in the MLSTM incorporating the enhanced excitation–inhibition regulator reached 0.11, 0.22 and 0.07, respectively, which are higher than the correlation coefficients inside the fully connected layer (−0.03, 0.04 and 0.16). In particular, the asymmetric structure of the weight correlation matrix (i.e., the correlation between the hidden state weights and attention in the first modified MLSTM layer is higher than that in the fully connected SLSTM layer) suggests a clear direction for information flow. This cross-module correlation feature confirms the effectiveness of the proposed architecture. Notably, the temporal features extracted by the improved MLSTM and the spatial features screened by the attention mechanism realized complementary enhancement through orthogonal projection rather than simple feature stacking.

At the micro-level, the control of excitation–inhibition weights in the improved MLSTM achieved a dynamic balance of feature transfer. At the meso-level, the multimodal distribution of the fully connected layer constructed a hierarchical feature extraction path. At the macro-level, intermodule correlation ensured the collaborative optimization of multiscale features. This multiscale coupling design can have particularly significant impacts on small-sample time-series classification tasks, such as sleep staging. Furthermore, the attention coupling layer of the MLSTM was improved to reduce the complexity of the model through parameter sharing, and the sparse connection of the fully connected layer prevented overfitting. The dynamic balance of the MLSTM and SLSTM improved the ACC of the model by 26.2% compared with that of the neural network model demonstrated by [12], as shown in Table 1, confirming state-of-the-art automatic sleep staging performance. Furthermore, the SleepXLSTM model demonstrated its superiority compared to other state-of-the-art models in Figure 17 in terms of kappa coefficient. The observed improvement in multiple performance indicators reflects the ability of SleepXLSTM to analyze sleep stages accurately and reliably using heart rate signals, informing assessments of sleep quality.

5.2. Subject-Wise Performance Variance and Robustness Analyses

To evaluate the generalization ability of the model across different individuals and detect potential overfitting, we conducted subject-level cross-validation and robustness tests, as described by Mu-Jiang-Shan Wang et al. [47]. The data were divided by individuals (based on the first dataset), and the leave-one-subject-out (LOSO) strategy was adopted. For each subject, the data of that subject were used as the test set, while the data of the remaining subjects were used as the training set. The model was repeatedly trained and validated, and the accuracy, precision, recall rate, and F1 score of each subject on the test set are shown in Figure 23. After completing all rounds, the mean, standard deviation, minimum value, and maximum value of each performance indicator were calculated to quantify the inter-subject variance of model performance, as shown in Figure 24. The distribution histogram and box plot of the performance indicators are displayed in Figure 25 to visualize the differences.

The LOSO cross-validation results in Figure 23 demonstrate that the model maintained a high level of performance consistency across all subjects. Moreover, the summary of the model’s performance indicators in Figure 24 indicates that the mean values are high and standard deviations are low, indicating a limited variance among subjects.

To assess the robustness of the model, during the test phase, Gaussian noise (with a mean of 0 and standard deviation of 5–10% of the standard deviation in the original data) was added to the input heart rate data. The boxplot in Figure 25 intuitively reflects the centralized distribution of the performance metrics, which further indicates the robust generalization ability of the model. The decrease in model performance was used to calculate the robustness coefficient (the ratio of the accuracy rates before and after added noise; Figure 26).

In the robustness test, the model still maintains relatively stable accuracy under different noise levels, which shows its good fault tolerance—that is, its ability to moderate interference.

5.3. Hyperparameter Sensitivity Analysis

To optimize the hyperparameter configuration, we conducted a hyperparameter sensitivity analysis. The hidden layer size, number of network layers, learning rate, weight decay coefficient, and batch size were analyzed in the multilevel experimental scheme. The hidden layer size was measured in steps of 50 from 100 to 300. The number of network layers was measured in steps of 2 from 4 to 12. The learning rate was evaluated at five levels using a log scale (1 × 10⁻⁵, 1 × 10⁻⁴, 3 × 10⁻⁴, 5 × 10⁻⁴, 1 × 10⁻³) and the weight decay was evaluated at five levels (0, 1 × 10⁻⁴, 1 × 10⁻³, 5 × 10⁻³, 1 × 10⁻²). Five batch sizes were assessed (64, 128, 256, 512, 1024). A two-stage experimental strategy was used. In the first stage, a grid search of the size of the hidden layer and number of network layers was performed, and 25 parameter combinations were generated by fixing the other parameters to their optimal values. In the second stage, a univariate sensitivity analysis centered on the optimal configuration was performed while changing only one hyperparameter at a time. The model was trained on fixed data splits for 100 epochs per configuration, and performance metrics such as accuracy and F1 score for the validation and test sets were recorded.

Univariate line plots are shown in Figure 27, and the performance elasticity (the ratio of accuracy change to hyperparameter change) for each hyperparameter is plotted in Figure 28 to identify the hyperparameters that have the greatest impact on the model.

The results in Figure 28 show that the model is sensitive to changes in structural parameters such as learning rate and hidden layer size, but it is relatively robust to training hyperparameters such as batch size and weight decay. This analysis verifies the robustness of the selected parameter configuration and the rationality of the model design.

5.4. Limitations and Future Work

Although the incorporation of more parameters into SleepXLSTM increased its accuracy, this also increased its complexity; this increase was considered justified by the accuracy with which the model automatically identified sleep stages (Awake, N1, N2, N3 and REM). Notably, the network architecture, which comprises the KGNN, MLSTM and SLSTM modules, addresses current challenges and establishes a foundation for future improvements. These improvements may include strategies such as network lightweighting to further optimize sleep staging performance while reducing complexity. Furthermore, note that this study preprocessed heart rate data using min–max linear transformation [30] to eliminate dimensional differences. However, recent studies have shown that heart rate variability can be used to enrich data features and thereby realize additional prediction functions (e.g., sleep pressure) [48]. Therefore, future research should adopt richer heart rate signal preprocessing to explore additional prediction functions.

There is further room for improvement of the proposed SleepXLSTM model. At present, the model uses a 30-step time window for feature extraction as this is sufficient to effectively capture short-term and medium-term heart rate fluctuation patterns. However, limitations on modeling the long-term dependence between periods (e.g., the REM to NREM transition) remain. Therefore, future research should introduce a hierarchical temporal attention mechanism [49,50] and/or the WaveNet convolutional structure [51,52] to expand the scope of temporal perception while maintaining computational efficiency; however, a progressive training strategy must be designed to prevent the gradient explosion problem under this approach. In addition, verification of the clinical applicability of the proposed model remains limited by its ability to only interpret single-mode ECG data. Future research is planned to simultaneously collect physiological parameters such as ECG, EEG, respiratory, and blood oxygen signals to construct a cross-modal association verification system.

Furthermore, research on model compression and quantification is necessary to satisfy the embedded deployment requirements of medical devices. We analyzed the complexity of the SleepXLSTM model. The total parameter count includes the elements in all weight matrices and bias vectors. The model size was calculated based on the theoretical memory footprint, assuming 32-bit floating-point precision. FLOPs were estimated using a layer-wise theoretical analysis that sums the operations required for a single forward pass. CPU inference time was measured by averaging 100 runs with a batch size of 1 and a sequence length of 30. The analysis reveals that while the current architecture does not directly satisfy the stringent real-time constraints of wearable devices, Configuration 1 (in Table 3) shows clear potential for deployment with further optimization. Therefore, the hierarchical redundancy of the MLSTM and SLSTM could be reduced through model optimization (selective pruning), and the mixed-precision quantization technique can be used to compress the number of model parameters to less than 1/5 of the existing parameters while maintaining the classification accuracy threshold

kappa \geq 0.8

. Notably, the dynamic sparse training paradigm [53] and hardware-aware distillation [54] techniques proposed in recent studies may provide new technical paths for deploying high-precision sleep-monitoring models in resource-constrained environments, with the expectation that future models will function accurately and be sufficiently lightweight to be installed in contactless devices.

The automatic sleep staging model based on SleepXLSTM will be able to be integrated into the flexible biosensor platform in the future, successfully realizing the intelligent upgrade of non-contact wristband devices. This device, through the analysis of a single physiological signal, breaks through the physical limitations of traditional contact electrodes, significantly enhancing user compliance while maintaining medical-grade monitoring accuracy. Particularly worth noting is that its non-intrusive design can continuously collect physiological parameters of infants and young children during sleep, effectively overcoming the technical bottleneck of traditional polysomnography (PSG) being easily disturbed by limb movements, and establishing a safe and reliable long-term sleep monitoring system for children in their critical development period.

From a public health perspective, the widespread adoption of this technology will complement the existing sleep health management models. By establishing a cloud-based sleep quality assessment platform, it is possible to achieve dynamic tracking of residents’ sleep parameters and early warning of abnormal fluctuations. It is worth noting that the long-term sleep data it has accumulated not only provides a scientific basis for individualized sleep intervention plans, but also, through the analysis of group sleep characteristics, can reveal the potential correlations between environmental factors, lifestyles and sleep disorders. This “prevention—monitoring—intervention” full-chain management model is expected to reduce the treatment costs for chronic insomnia and circadian rhythm disorders, etc. What is particularly important is that this highly sensitive identification capability enables community medical institutions to utilize low-cost wearable devices to conduct large-scale screening for sleep disorders. In particular, it shows significant application value in the early detection of sleep apnea syndrome (SAS) among the elderly population.

Looking forward to future development, with the continuous breakthroughs in Internet of Things technology, the new generation of devices may integrate environmental sensor modules to achieve simultaneous monitoring of external factors that affect sleep, such as light intensity, temperature and humidity. This multi-dimensional data analysis capability, combined with the continuous optimization of artificial intelligence algorithms, will drive the evolution of personalized sleep health management systems, upgrading from simple stage identification to intelligent terminals with predictive intervention suggestions.

6. Conclusions

This study proposes and demonstrates a cross-scale dynamically coupled extended long short-term memory network (SleepXLSTM) for automatic sleep staging based on heart rate data collected by wearable devices. The primary conclusions are as follows:

(1): SleepXLSTM constructs a knowledge graph relating heart rate data to sleep labels and integrates an improved MLSTM attention mechanism and SLSTM model to realize automatic sleep staging (W, N1, N2, N3, and REM) as a function of heart rate signals. When we used the first dataset, The ACC of the full model was 91.25%. When we used the ISRUC-SLEEP dataset, the model accuracy reached 87.51%.
(2): The results of model ablation experiments confirmed the contributions of the KGNN and dual-effect excitation–inhibition attention mechanism in the MLSTM to the performance of SleepXLSTM. It exhibited a 44.55% increase in ACC, 12.55% increase in minimum, and effectively improved classification performance when using the proposed model integrations.
(3): The proposed model was shown to be robust and achieve excellent automatic sleep staging performance based on heart rate signal, with a 21.45% higher ACC than that of the next-best state-of-the-art automatic sleep staging model.

Automatic sleep staging technology based on wearable devices represents a core approach for personalized health management and can provide a universal solution for sleep staging through the non-invasive monitoring of daily heart rate signals. Future research should focus on optimizing the SleepXLSTM algorithms for use in embedded scenarios, thereby promoting the development of smart wearable devices to power the leap from basic health monitoring to providing sleep disorder warnings and cost-effective support for sleep health management in family scenarios.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics15030505/s1, Figure S1: Time Series 1; Figure S2: Time Series 2; Figure S3: Time Series 3; Figure S4: Basin of Attraction. Section S1: Supplementary Verification of Strategy Design; Section S2: Proof of Stability; Section S3: Stability proof based on input-output; Section S4: Analysis of dynamic properties.

Author Contributions

Conceptualization, T.W., Z.M. and H.Z.; Methodology, T.W., Z.M., L.S., H.Z. and C.X.; Software, T.W., C.X. and B.R.; Validation, T.W., L.S. and B.R.; Formal analysis, T.W., L.S., H.Z. and B.R.; Investigation, T.W.; Resources, T.W.; Data curation, T.W.; Writing—original draft, T.W.; Writing—review & editing, T.W., L.S., H.Z. and C.X.; Visualization, T.W., Z.M. and C.X.; Supervision, T.W.; Project administration, T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financially supported by the National Natural Science Foundation of China (No. 52208135) and the Natural Science Foundation of Jiangsu Province, China (No. BK20221056).

Data Availability Statement

The datasets used in this article are all publicly available on the Internet and are accompanied by citation instructions as required by the database providers, which can be found in the references.

Acknowledgments

Thank Zisen Mao and Luyang Shi for their knowledge of mathematical theories.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gottlieb, D.J.; Redline, S.; Nieto, F.J.; Baldwin, C.M.; Newman, A.B.; Resnick, H.E.; Punjabi, N.M. Association of usual sleep duration with hypertension: The Sleep Heart Health Study. Sleep 2006, 29, 1009–1014. [Google Scholar] [CrossRef] [PubMed]
Kim, T.W.; Jeong, J.-H.; Hong, S.-C. The impact of sleep and circadian disturbance on hormones and metabolism. Int. J. Endocrinol. 2015, 2015, 591729. [Google Scholar] [CrossRef] [PubMed]
Grandner, M.A.; Alfonso-Miller, P.; Fernandez-Mendoza, J.; Shetty, S.; Shenoy, S.; Combs, D. Sleep: Important considerations for the prevention of cardiovascular disease. Curr. Opin. Cardiol. 2016, 31, 551–565. [Google Scholar] [CrossRef] [PubMed]
Barone, M.T.; Menna-Barreto, L. Diabetes and sleep: A complex cause-and-effect relationship. Diabetes Res. Clin. Pract. 2011, 91, 129–137. [Google Scholar] [CrossRef]
Abbott, S.M.; Videnovic, A. Chronic sleep disturbance and neural injury: Links to neurodegenerative disease. Nat. Sci. Sleep 2016, 8, 55–61. [Google Scholar] [CrossRef]
World Health Organization. World Health Statistics 2016 [OP]: Monitoring Health for the Sustainable Development Goals (SDGs); World Health Organization: Geneva, Switzerland, 2016. [Google Scholar]
Berry, R.B.; Brooks, R.; Gamaldo, C.E.; Harding, S.M.; Marcus, C.; Vaughn, B.V. The AASM manual for the scoring of sleep and associated events. In Rules, Terminology and Technical Specifications; American Academy of Sleep Medicine: Darien, IL, USA, 2012; Volume 176, p. 7. [Google Scholar]
Younes, M.; Raneri, J.; Hanly, P. Staging sleep in polysomnograms: Analysis of inter-scorer variability. J. Clin. Sleep Med. 2016, 12, 885–894. [Google Scholar] [CrossRef]
Palotti, J.; Mall, R.; Aupetit, M.; Rueschman, M.; Singh, M.; Sathyanarayana, A.; Taheri, S.; Fernandez-Luque, L. Benchmark on a large cohort for sleep-wake classification with machine learning techniques. npj Digit. Med. 2019, 2, 50. [Google Scholar] [CrossRef]
Roberts, D.M.; Schade, M.M.; Mathew, G.M.; Gartenberg, D.; Buxton, O.M. Detecting sleep using heart rate and motion data from multisensor consumer-grade wearables, relative to wrist actigraphy and polysomnography. Sleep 2020, 43, zsaa045. [Google Scholar] [CrossRef]
Perez-Pozuelo, I.; Zhai, B.; Palotti, J.; Mall, R.; Aupetit, M.; Garcia-Gomez, J.M.; Taheri, S.; Guan, Y.; Fernandez-Luque, L. The future of sleep health: A data-driven revolution in sleep science and medicine. npj Digit. Med. 2020, 3, 42. [Google Scholar] [CrossRef]
Walch, O.; Huang, Y.; Forger, D.; Goldstein, C. Sleep stage prediction with raw acceleration and photoplethysmography heart rate data derived from a consumer wearable device. Sleep 2019, 42, zsz180. [Google Scholar] [CrossRef]
Almotairi, S.; Addula, S.R.; Alharbi, O.; Alzaid, Z.; Hausawi, Y.M.; Almutairi, J. Personal data protection model in IOMT-blockchain on secured bit-count transmutation data encryption approach. Fusion Pract. Appl. 2024, 16, 152–170. [Google Scholar]
Sridhar, N.; Shoeb, A.; Stephens, P.; Kharbouch, A.; Shimol, D.B.; Burkart, J.; Ghoreyshi, A.; Myers, L. Deep learning for automated sleep staging using instantaneous heart rate. npj Digit. Med. 2020, 3, 106. [Google Scholar] [CrossRef] [PubMed]
Pini, N.; Ong, J.L.; Yilmaz, G.; Chee, N.I.; Siting, Z.; Awasthi, A.; Biju, S.; Kishan, K.; Patanaik, A.; Fifer, W.P.; et al. An automated heart rate-based algorithm for sleep stage classification: Validation using conventional polysomnography and an innovative wearable electrocardiogram device. Front. Neurosci. 2022, 16, 974192. [Google Scholar] [CrossRef] [PubMed]
Song, T.A.; Chowdhury, S.R.; Malekzadeh, M.; Harrison, S.; Hoge, T.B.; Redline, S.; Stone, K.L.; Saxena, R.; Purcell, S.M.; Dutta, J. AI-Driven sleep staging from actigraphy and heart rate. PLoS ONE 2023, 18, e0285703. [Google Scholar] [CrossRef] [PubMed]
Topalidis, P.; Heib, D.P.; Baron, S.; Eigl, E.S.; Hinterberger, A.; Schabus, M. The virtual sleep lab—A novel method for accurate four-class sleep staging using heart-rate variability from low-cost wearables. Sensors 2023, 23, 2390. [Google Scholar] [CrossRef]
Imtiaz, S.A. A systematic review of sensing technologies for wearable sleep staging. Sensors 2021, 21, 1562. [Google Scholar] [CrossRef]
Ebrahimi, F.; Setarehdan, S.K.; Ayala-Moyeda, J.; Nazeran, H. Automatic sleep staging using empirical mode decomposition, discrete wavelet transform, time-domain, and nonlinear dynamics features of heart rate variability signals. Comput. Methods Programs Biomed. 2013, 112, 47–57. [Google Scholar] [CrossRef]
Zhang, Z.; Silva, I.; Wu, D.; Zheng, J.; Wu, H.; Wang, W. Adaptive motion artefact reduction in respiration and ECG signals for wearable healthcare monitoring systems. Med. Biol. Eng. Comput. 2014, 52, 1019–1030. [Google Scholar] [CrossRef]
Ye, Z.; Kumar, Y.J.; Sing, G.O.; Song, F.; Wang, J. A comprehensive survey of graph neural networks for knowledge graphs. IEEE Access 2022, 10, 75729–75741. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Humayun, A.I.; Ghaffarzadegan, S.; Ansari, M.I.; Feng, Z.; Hasan, T. Towards domain invariant heart sound abnormality detection using learnable filterbanks. IEEE J. Biomed. Health Inform. 2020, 24, 2189–2198. [Google Scholar] [CrossRef]
Beck, M.; Pöppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended long short-term memory. arXiv 2024, arXiv:2405.04517. [Google Scholar]
Alharthi, M.; Mahmood, A. xLSTMTime: Long-term time series forecasting with xLSTM. AI 2024, 5, 1482–1495. [Google Scholar] [CrossRef]
Pöppel, K.; Beck, M.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.K.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended long short-term memory. In Proceedings of the First Workshop on Long-Context Foundation Models@ ICML 2024, Vienna, Austria, 25 July 2024. [Google Scholar]
Schmied, T.; Adler, T.; Patil, V.; Beck, M.; Pöppel, K.; Brandstetter, J.; Klambauer, G.; Pascanu, R.; Hochreiter, S. A large recurrent action model: xLSTM enables fast inference for robotics tasks. arXiv 2024, arXiv:2410.22391. [Google Scholar] [CrossRef]
Fan, X.; Tao, C.; Zhao, J. Advanced stock price prediction with xLSTM-based models: Improving long-term forecasting. In 2024 11th International Conference on Soft Computing & Machine Intelligence; IEEE: New York, NY, USA, 2024; pp. 117–123. [Google Scholar]
Datta, L. A survey on activation functions and their relation with Xavier and He normal initialization. arXiv 2020, arXiv:2004.06632. [Google Scholar] [CrossRef]
Kraus, M.; Divo, F.; Dhami, D.S.; Kersting, K. xLSTM-Mixer: Multivariate time series forecasting by mixing via scalar memories. arXiv 2024, arXiv:2410.16928. [Google Scholar]
Schmidinger, N.; Schneckenreiter, L.; Seidl, P.; Schimunek, J.; Hoedt, P.J.; Brandstetter, J.; Mayr, A.; Luukkonen, S.; Hochreiter, S.; Klambauer, G. Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences. arXiv 2024, arXiv:2411.04165. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning; PMLR: London, UK, 2023; pp. 23803–23828. [Google Scholar]
Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards understanding convergence and generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]
Huber, F.; Yushchenko, A.; Stratmann, B.; Steinhage, V. Extreme Gradient boosting for yield estimation compared with deep learning approaches. Comput. Electron. Agric. 2022, 202, 107346. [Google Scholar] [CrossRef]
Walch, O. Motion and heart rate from a wrist-worn wearable and labeled sleep from polysomnography. PhysioNet 2019. Available online: https://physionet.org/content/sleep-accel/1.0.0/ (accessed on 28 June 2025).
Khalighi, S.; Sousa, T.; Santos, J.M.; Nunes, U. ISRUC-Sleep: A comprehensive public dataset for sleep researchers. Comput. Methods Programs Biomed. 2016, 124, 180–192. [Google Scholar] [CrossRef] [PubMed]
Schulz, H. Rethinking sleep analysis: Comment on the AASM manual for the scoring of sleep and associated events. J. Clin. Sleep Med. 2008, 4, 99–103. [Google Scholar] [CrossRef] [PubMed]
Bisong, E. Introduction to Scikit-learn. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Apress: Berkeley, CA, USA, 2019; pp. 215–229. [Google Scholar]
Thakur, A.; Gupta, M.; Sinha, D.K.; Mishra, K.K.; Venkatesan, V.K.; Guluwadi, S. Transformative breast cancer diagnosis using CNNs with optimized ReduceLROnPlateau and Early stopping enhancements. Int. J. Comput. Intell. Syst. 2024, 17, 14. [Google Scholar]
Lu, L.; Shin, Y.; Su, Y.; Karniadakis, G.E. Dying ReLU and initialization: Theory and numerical examples. arXiv 2019, arXiv:1903.06733. [Google Scholar] [CrossRef]
Hu, W.; Xiao, L.; Pennington, J. Provable benefit of orthogonal initialization in optimizing deep linear networks. arXiv 2020, arXiv:2001.05992. [Google Scholar] [CrossRef]
Stehman, S.V.; Foody, G.M. Accuracy assessment. In The SAGE Handbook of Remote Sensing; SAGE: London, UK, 2009; pp. 297–309. [Google Scholar]
Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 79–91. [Google Scholar]
Delgado, R.; Núñez-González, J.D. Enhancing Confusion Entropy (CEN) for binary and multiclass classification. PLoS ONE 2019, 14, e0210264. [Google Scholar] [CrossRef]
Wang, M.-J.-S.; Xiang, D.; Hsieh, S.-Y. G-good-neighbor diagnosability under the modified comparison model for multiprocessor systems. Theor. Comput. Sci. 2025, 1028, 115027. [Google Scholar] [CrossRef]
Haque, Y.; Zawad, R.S.; Rony, C.S.A.; Al Banna, H.; Ghosh, T.; Kaiser, M.S.; Mahmud, M. State-of-the-art of stress prediction from heart rate variability using artificial intelligence. Cogn. Comput. 2024, 16, 455–481. [Google Scholar] [CrossRef]
Li, Z.L.; Yu, J.; Zhang, X.L.; Xu, L.Y.; Jin, B.G. A Multi-Hierarchical attention-based prediction method on time series with spatio-temporal context among variables. Phys. A Stat. Mech. Its Appl. 2022, 602, 127664. [Google Scholar] [CrossRef]
Shao, P.; He, J.; Li, G.; Zhang, D.; Tao, J. Hierarchical graph attention network for temporal knowledge graph reasoning. Neurocomputing 2023, 550, 126390. [Google Scholar] [CrossRef]
Oh, S.L.; Jahmunah, V.; Ooi, C.P.; Tan, R.S.; Ciaccio, E.J.; Yamakawa, T.; Tanabe, M.; Kobayashi, M.; Acharya, U.R. Classification of heart sound signals using a novel deep WaveNet model. Comput. Methods Programs Biomed. 2020, 196, 105604. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph WaveNet for deep spatial-temporal graph modeling. arXiv 2019, arXiv:1906.00121. [Google Scholar]
Sokar, G.; Mocanu, E.; Mocanu, D.C.; Pechenizkiy, M.; Stone, P. Dynamic sparse training for deep reinforcement learning. arXiv 2021, arXiv:2106.04217. [Google Scholar]
Boix-Adsera, E. Towards a theory of model distillation. arXiv 2024, arXiv:2403.09053. [Google Scholar] [CrossRef]

Figure 1. Flow chart for automatic sleep staging using SleepXLSTM.

Figure 2. Architecture of the proposed KGNN. Notes: “Graph structure” refers to the different heart rates corresponding to different sleep stages, “Text structure” refers to the heart rate values corresponding to the sleep stage values, “Entity” denotes the Entity_embedding operation, and “Relation” denotes the Relation_embedding operation.

Figure 3. Architecture of the proposed SleepMLSTM. Notes: “E” denotes excitatory,

e^{(l)}

denotes the excitatory effect, “R” denotes inhibitory,

i^{(l)}

denotes the inhibitory effect, and “E-R Connect” denotes the sum of the excitatory and inhibitory effects.

Figure 3. Architecture of the proposed SleepMLSTM. Notes: “E” denotes excitatory,

e^{(l)}

denotes the excitatory effect, “R” denotes inhibitory,

i^{(l)}

denotes the inhibitory effect, and “E-R Connect” denotes the sum of the excitatory and inhibitory effects.

Figure 4. Architecture of SleepSLSTM. Notes: The candidate gate includes intermediate state

z_{t}

.

Figure 4. Architecture of SleepSLSTM. Notes: The candidate gate includes intermediate state

z_{t}

.

Figure 5. Heart rate signal, normalized heart rate signal, true sleep label, and predicted sleep label.

Figure 6. NCE matrix. Notes: “Proportion” denotes the NCE value and “Count” denotes the number of samples.

Figure 7. SleepXLSTM classification evaluation.

Figure 8. SleepSLSTM training dynamics.

Figure 9. NCE matrix. Notes: “Based on the ISRUC-Sleep dataset Proportion”.

Figure 10. SleepXLSTM classification evaluation.

Figure 11. Comparison of model NCE matrices.

Figure 12. Comparison of Model 1, 2, and 3 training dynamics.

Figure 13. Comparison of model classification performance.

Figure 14. Comparison of accuracy. Notes: “Based on the ISRUC-Sleep dataset Proportion”.

Figure 15. Comparison between Recall and F1 score. Notes: “Based on the ISRUC-Sleep dataset Proportion”.

Figure 16. Performance of the ablation model for sleep stage classification.

Figure 17. Different LSTM variants model classification performance.

Figure 18. Comparison of Cohen’s kappa coefficients for different state-of-the-art models.

Figure 19. Input hidden weight matrix (right) and hidden state weight transition matrix (left). Definitions: “Hidden Units” denotes the hidden unit index, “Previous Hidden” denotes the previous hidden unit index and “Current Hidden” denotes the current hidden unit index.

Figure 20. Improving the excitatory (left) and inhibitory (right) effects of the MLSTM.

Figure 21. Distribution of weights and bias terms in different network components.

Figure 22. Correlation of the weight and bias terms between different networks. FC: fully connected layer.

Figure 23. Performance metrics across subjects based on the first dataset (LOSO Cross-Validation).

Figure 24. The statistical data of each indicator.

Figure 25. The variance distribution of data for each indicator. ACC: accuracy; Pre: precision.

Figure 26. Robustness analysis under Gaussian noise.

Figure 27. Sensitivity analysis of each hyperparameter (based on the first dataset).

Figure 28. Elastic coefficients are calculated for each hyperparameter (based on the first dataset).

Table 1. Sleep staging performance of different state-of-the-art methods.

Method	Awake ACC	NREM ACC	REM ACC	Best ACC
Logistic regression [18]	0.6	0.452	0.453	0.698
KNN [18]	0.6	0.402	0.402	0.671
Random forest [18]	0.6	0.434	0.434	0.676
Neural network [18]	0.6	0.454	0.454	0.698
CNN	0.4	0.454	0.331	0.406
Transformer	0.4	0.468	0.352	0.422
RNN	0.4	0.423	0.288	0.387
SVM	0.3	0.392	0.254	0.365
SleepXLSTM	0.9	0.875	0.960	0.960

Table 2. Sleep staging performance of different state-of-the-art methods.

Method	Awake ACC	NREM ACC	REM ACC	Best ACC
Logistic regression	0.4	0.401	0.231	0.377
KNN	0.4	0.412	0.254	0.387
Random forest	0.4	0.427	0.434	0.676
Neural network	0.4	0.446	0.312	0.423
CNN	0.4	0.468	0.335	0.445
Transformer	0.5	0.468	0.352	0.422
RNN	0.4	0.435	0.299	0.412
SVM	0.4	0.418	0.268	0.392
SleepXLSTM	0.9	0.722	0.923	0.875

Table 3. Complexity assessment of the SleepXLSTM model under different parameters.

Number	Hidden Size	Num Layers	Total Parameters	Model Size (MB)	FLOPs (Millions)	Inference Time CPU (ms)
1	64	2	155,845	0.59	4.54	51.341
2	128	2	547,717	2.09	17.97	52.461
3	256	3	3,096,325	11.81	119.11	82.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, T.; Mao, Z.; Shi, L.; Zhou, H.; Xie, C.; Ran, B. Automatic Sleep Staging Using SleepXLSTM Based on Heterogeneous Representation of Heart Rate Data. Electronics 2026, 15, 505. https://doi.org/10.3390/electronics15030505

AMA Style

Wu T, Mao Z, Shi L, Zhou H, Xie C, Ran B. Automatic Sleep Staging Using SleepXLSTM Based on Heterogeneous Representation of Heart Rate Data. Electronics. 2026; 15(3):505. https://doi.org/10.3390/electronics15030505

Chicago/Turabian Style

Wu, Tianlong, Zisen Mao, Luyang Shi, Huaren Zhou, Chaohua Xie, and Bowen Ran. 2026. "Automatic Sleep Staging Using SleepXLSTM Based on Heterogeneous Representation of Heart Rate Data" Electronics 15, no. 3: 505. https://doi.org/10.3390/electronics15030505

APA Style

Wu, T., Mao, Z., Shi, L., Zhou, H., Xie, C., & Ran, B. (2026). Automatic Sleep Staging Using SleepXLSTM Based on Heterogeneous Representation of Heart Rate Data. Electronics, 15(3), 505. https://doi.org/10.3390/electronics15030505

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Sleep Staging Using SleepXLSTM Based on Heterogeneous Representation of Heart Rate Data

Abstract

1. Introduction

2. Methods

2.1. Proposed Method

2.2. KGNN Learning

2.3. SleepMLSTM Learning

2.4. SleepSLSTM Learning

2.5. Deep Fusion and Inference

3. Experimental Evaluation

3.1. Dataset and Feature Preparation

3.2. Ablation Experiment Design

3.3. Implementation

3.4. Evaluation Metrics

4. Results

4.1. SleepXLSTM Performance Analysis (Using the First Dataset)

4.2. SleepXLSTM Performance Analysis (Using the Second Dataset)

4.3. Ablation Experiment Comparison (Using the First Dataset)

4.4. Ablation Experiment Comparison (Using the Second Dataset)

4.5. Comparison with State-of-the-Art Methods

4.5.1. Model Comparison Based on the First Dataset

4.5.2. Comparison of the Models Based on ISRUC-SLEEP

5. Discussion

5.1. Model Performance

5.2. Subject-Wise Performance Variance and Robustness Analyses

5.3. Hyperparameter Sensitivity Analysis

5.4. Limitations and Future Work

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI