1. Introduction
The maintenance of sleep homeostasis is a critical regulatory mechanism of human physiological functions, and sleep disorders are associated with a variety of chronic diseases. Many studies have confirmed that a long-term lack of sleep or abnormal sleep structure can lead to sympathetic hyperactivity [
1] and metabolic regulation imbalance [
2], which can significantly increase the risk of cardiovascular and cerebro-vascular diseases [
3], diabetes [
4], and neurodegenerative diseases [
5]. According to statistics from the World Health Organization, approximately 27% of adults worldwide have at least one symptom of a sleep disorder; failure to diagnose these cases will accelerate the progression of chronic disease owing to continuous physiological decompensation [
6]. Therefore, the accurate monitoring of sleep quality and early identification of abnormal sleep have become key aspects of preventive medicine and chronic disease management.
Polysomnography (PSG), which is accepted as the gold standard for clinical sleep monitoring, synchronizes the acquisition of multimodal physiological signals, including electroencephalogram (EEG), eye movement (EOG), and electromyography (EMG) data, to realize sleep staging [
7]; however, the subject must be fixed in the laboratory environment and connected to equipment using invasive electrodes. Consequently, the monitoring cost is high, subject comfort is low, and the results cannot reflect the daily natural sleep state [
8].
Wearable devices (e.g., smart wristbands collecting photoplethysmography data) can provide an alternative method for continuous sleep monitoring at home [
9,
10,
11]. For example, Walch et al. [
12] collected raw acceleration and heart rate data from an Apple Watch, then applied different algorithms (the logistic regression, K-nearest neighbors, random forest, and neural network) to classify sleep stages as “Wake,” “Non-Rapid Eye Movement” (NREM) and “Rapid Eye Movement” (REM). The neural network exhibited the best performance, classifying sleep stages with an accuracy of 62.4%. We note that the sleep data collected by wearable devices is personal health information. In the context of the Medical Internet of Things (IoMT), addressing the issue of data privacy protection during data transmission and storage is as important as ensuring accuracy [
13]. Although traditional machine learning-based methods initially showed potential in wearable sleep monitoring, their performance (62.4% accuracy) limits practicability. The characteristics of physiological signals such as heart rate, acceleration, and sleep stages are heterogeneous—that is, the data of different modes have essential differences in their distribution morphology and dynamic mode. Heart rate signals are a continuous numerical sequence, and their fluctuations are regulated by the autonomic nervous system. The corresponding relationship between heart rate signals and sleep stages is a nonlinear and high-dimensional mapping, and a diagnosis depends on neurophysiological markers such as EEG signals. This heterogeneity makes it difficult to establish cross-modal feature alignment using manual feature engineering, on which traditional methods rely. Especially when logistic regression and random forest models are used, their limited representation learning ability cannot capture the complex coupling mechanism between heart rate dynamics and sleep macrostructure.
Because of the rapid development of deep learning theory, end-to-end feature learning provides a new way to handle heterogeneous representation data. Ref. [
14] employed a deep learning model to realize automatic sleep staging (“Wake,” “Light Sleep,” “Deep Sleep,” and “REM”) by extracting the instantaneous heart rate from electrocardiograms (ECGs); the algorithm exhibited an accuracy of 77% and a Cohen’s kappa coefficient of 0.66 when trained and tested using the Sleep Heart Health Study dataset, confirming that predictions made by the algorithm replicated the results of previous clinical studies. Ref. [
15] trained a deep learning algorithm on heart rate data to realize accurate and precise automatic sleep staging classification into two (“Wake” and “Sleep”), three (“Wake,” “NREM,” and “REM”), or four (“Wake,” “Light Sleep” (N1), “Moderate Sleep” (N2), and “REM”) levels. The accuracy of the two-level model was 87.97%, the kappa coefficient of the three-level model was 0.6025, and the sensitivity and specificity values of the four-level model were 0.3812 and 0.9744, respectively. Critically, this study demonstrated a cost-effective and non-invasive solution that can be deployed to automate sleep classification in the home environment. Furthermore, Ref. [
16] proposed a sequence-to-sequence long short-term memory (LSTM) model for automatic sleep staging capable of performing three-level (“Awake,” “NREM,” and “REM”) and four-level (“Awake,” “N1,” “N2,” and “REM”) sleep staging with accuracies of 71–82% and 60–79%, respectively. This method accurately estimated deep sleep using data from wearable devices, exhibiting promise for use in clinical applications that require long-term deep-sleep monitoring.
Although sleep staging provides the core data for evaluating sleep quality and neurophysiological state, the automatic identification of sleep stages has been consistently challenged by their weak correlation with physiological signals [
17,
18]. Previous studies have attempted to directly input heart rate signals into deep learning models, but have overlooked two key scientific issues: first, there is a nonlinear time-varying coupling relationship between the heart rate and sleep stage [
19], which is difficult to capture using traditional end-to-end modeling; second, time-varying noise such as motion artifacts and respiratory interference are common in the heart rate signals collected by wearable devices, and direct modeling based on these signals can easily lead to a spatial offset of features [
20]. These limitations significantly limit the practical application of wearable technologies in sleep medicine.
In light of these problems, the knowledge graph neural network (KGNN) may provide a theoretical breakthrough for capturing the implicit correlation between physiological signals and sleep stages by fusing a knowledge graph with a neural network [
21]. The principle underlying the application of the KGNN model for automatic sleep staging is to construct a dynamic knowledge topology comprising the physiological signal, regulatory relationship, and sleep event that not only considers information from neighboring nodes to influence target nodes according to the message passing rules [
22], but also performs nonlinear entity fitting on the graph structure to realize a model with excellent classification accuracy and robustness [
23].
Typically, deep learning models developed for sleep staging employ homogenization analysis to process the time-domain features of heart rate signals [
24], but the resulting staging effect is easily influenced by dynamic noise, such as motion artifacts and respiratory interference. The multiplicative LSTM (MLSTM) and scalar LSTM (SLSTM) components included in the extended LSTM (XLSTM) model provide the basic ability to address such noise problems, with the former adopting the dynamic attention mechanism constructed by matrix memory and covariance updating rules [
25] and the latter capitalizing on the feature processing efficiency realized by scalar memory and mixed updating [
26]. However, the ability of these components to suppress noise remains limited: the attention mechanism in the MLSTM can weaken the weight of other features [
27] (asymmetric feature selection) when enhancing key features, and the high-speed updating of the SLSTM may amplify the influence of noise on the model. Therefore, this study proposed an excitation–inhibition dual-effect regulator and implemented secondary feature modulation based on the original attention mechanism to realize a model that can not only dynamically capture key heart rate characteristics, but also actively suppress the interference of time-varying noise such as motion artifacts.
A cross-scale dynamically coupled XLSTM network (SleepXLSTM) was accordingly developed in this study to achieve automatic sleep staging based on heart rate signals, as shown in
Figure 1. A KGNN was applied to construct a dynamic knowledge topology, and cross-scale learning of the correlation between heart rate data and sleep stage labels (Wake, N1, N2, Deep Sleep (N3), and REM) was realized through entity and relation embeddings. Furthermore, an improved MLSTM module was applied to extract heart rate features by suppressing the interference of time-varying noise such as motion artifacts. The features obtained by the KGNN and improved MLSTM were input into the SLSTM via linear concatenation to improve feature processing efficiency. Finally, the sleep staging results were output through the fully connected layer of the XLSTM. To the best of our knowledge, this is the first study to apply heterogeneous temporal representations for sleep staging using data collected by a wearable device through a dynamic coupling strategy.
The primary contributions of this study are as follows:
- (1)
The cross-scale dynamically coupled SleepXLSTM network was developed to enable automatic sleep staging based on a single physiological feature (heart rate signal).
- (2)
The MLSTM model in the network was improved by introducing a dual-effect excitation–inhibition regulator to suppress the interference of time-varying noise, such as motion artifacts.
- (3)
The performance advantage and superiority of SleepXLSTM over previously proposed state-of-the-art automated sleep staging algorithms was demonstrated using heart rate data.
The remainder of this paper is organized as follows:
Section 2 describes the proposed algorithm,
Section 3 details the experiments performed in this study,
Section 4 presents the experimental results,
Section 5 analyzes the obtained results and discusses the validity of the proposed method, and concluding remarks are provided in
Section 6.
2. Methods
2.1. Proposed Method
This section describes the proposed deep fusion method for identifying sleep stages based on the heterogeneous features of heart rate signals. The proposed SleepXLSTM model comprises KGNN, SleepMLSTM, and SleepSLSTM components with deep fusion and inference modules. (
Figure 1) illustrates the workflow of the proposed method.
The SleepXLSTM model contains two modifications to overcome the limitations of the conventional LSTM network: exponential gating and new storage structures. These modifications are realized using an SLSTM with a mixture of scalar updating and storage and an MLSTM with matrix storage and covariance (outer product) updating rules [
25]. Note that both the SLSTM and MLSTM enhance the XLSTM through exponential gating and can be extended to multiple memory cells. Notably, the combination of the multiple SLSTM magnetic heads and exponential gating establishes a new memory mixing method in which memory can be mixed across cells within each head. The integration of these new LSTM variants into the residual block module resulted in blocks that could be residually stacked to construct the XLSTM architecture [
26,
27,
28].
Heart rate data and the corresponding sleep labels were used as input data for the SleepXLSTM model, and entity and relation embeddings were calculated by the KGNN, which output the context vector . Following the time step calculation by the SleepMLSTM model and the application of the improved attention mechanism within, the output hidden state , , was input to the Sleep-SLSTM module to obtain through feature fusion. Finally, the fully connected layers and loss function were employed to obtain the output classification value.
2.2. KGNN Learning
Knowledge graphs can represent structured relationships between entities and have become a critical subject of research in the fields of cognition and artificial intelligence accordingly. The KGNN uses a deep neural network to integrate topological information and attribute the features in graph data, then provides a more refined feature representation of nodes, allowing for it to learn the attribute and structural features of entities and relations from end-to-end [
21]. This section describes the embedding of the entities and relations in the knowledge graph into the continuous vector space as well as the undertaking of information fusion and inference through the neural network structure.
The nodes and edges in the graph structure used in this study are defined as follows. For the input heart rate time-series sliding window (where B denotes the batch, T denotes the timestamp, and 1 denotes the number of features), we construct a bipartite graph , where the node set consists of two types of nodes: entity nodes and relationship nodes . The entity nodes correspond to the current heart rate value of each sample. That is, the entity-node feature of the th sample is the scalar , where represents the last time step of the window. The relationship nodes correspond to the physiological correlation between sleep stages and heart rate and are defined as a single relationship type . Edge set connects each entity node with the relationship node—that is, . Each edge represents the correlation between the heart rate value and sleep stage . This graph structure encodes the domain knowledge, discretizing the continuous heart rate signal into a combination of entities and relationships.
Figure 2 shows the architecture of the KGNN used in this study. First, given the input data
, the KGNN maps the entities (heart rate data) in the knowledge graph to a high-dimensional vector space by defining a linear layer and setting the entity embedding operation to
, where
and
are unform distributions obtained via
and
h = 200 is the hidden layer size. The Entity_embedding operation preserves the basic characteristics of the entity and provides the basis for subsequent feature fusion. This operation can be expressed as follows:
where
is the heart rate of the previous
T, for which
, and
is the value obtained after Entity_embedding, for which
.
Simultaneously, the KGNN maps the relation
in the knowledge graph to the same vector space through the embedding layer. This Relation_embedding operation ensures that the entity and relation can be directly operated in the same space and is defined as
where
expresses the relationship, i.e., the heart rate corresponding to the sleep stage, for which
, initialized by the
normal distribution; and
is the value obtained after Relation_embedding, for which
.
Next, the KGNN performs linear transformation and feature fusion of the Entity_embedding and Relation_embedding vectors through a linear layer defined as follows:
where
is
after
and
are spliced, and the combination embedding
is obtained by repeated calculations for
iterations;
is the linear layer weight, for which
,
is the linear layer bias term, for which
, and the context vector
.
During forward propagation in the KGNN, the entity and relation embedding vectors are concatenated, then fused by a context vector generated using a linear layer transformation. This context vector is subsequently added to the hidden state of the final SleepMLSTM layer, which affects the final prediction results. This step captures the complex interactions and implicit associations between entities and extracts and reinforces higher-order semantic information in the knowledge graph.
2.3. SleepMLSTM Learning
The SleepMLSTM model employed in this study improved upon the attention mechanism included in the conventional MLSTM by introducing a novel dual-effect factor for regulating excitatory and inhibitory effects, shown in
Figure 3A, into the multilayer SleepMLSTM architecture as shown in
Figure 3B.
Figure 3C schematically depicts the function of the dual-effect regulator for global excitatory and inhibitory effects; in this regulator, the attention mechanism is used for secondary feature extraction, and its output
is fused with the hidden state
to obtain the final output
(Regarding the stability of this network, the analysis of dynamic characteristics can be found in
Sections S2–S4).
(1)
Time Step Calculation: Based on the MLSTM structure in the XLSTM [
29], this study proposed an improved SleepMLSTM model fusing the MLSTM and KGNN models. First, the input weights of the layer
input gate, forget gate, cell state, and output gate are defined as
, respectively. Distributed uniformly through
[
30], the cyclic weights of these gate responses are given by
, respectively, and the bias terms are given by
, respectively, which are generated into
through zero initialization. Furthermore, the weights of the excitation and inhibition effects are defined as
and
, respectively, both of which are generated through the standard normal distribution.
The input, forget, cell, and output states are respectively defined as follows:
and the hidden state is given by
The improved MLSTM processing of time-series data is realized using multiple LSTM layers to transfer and layer by layer to introduce shortcut and direct connections in the network, allowing information to skip some layers and directly transfer to subsequent layers, thereby shortening the path of information transmission and improving network efficiency.
(2) Attention Mechanism: The hidden state of layer is defined as ; the query , key , and value vector weights are defined as , , and , respectively; and the excitatory and inhibitory effect weight matrices are defined as and , respectively. The original formula for calculating attention is given in Equations (11)–(13), and the excitation–inhibition regulatory factor is calculated as shown in Equations (14) and (15). That is, we replace the traditional attention with excitatory–inhibitory attention .
The
,
and
vectors are computed as
The excitatory
and inhibitory
effect weights are calculated as
The attention values
are calculated as
The context vector
is computed for the time step corresponding
as follows:
Finally, the connection hidden state
and value
constitute the residual connection
as follows:
The original MLSTM attention mechanism enhances the expressive ability of the XLSTM model [
31]. The improved MLSTM in this study considers
and
to calculate the
,
and
vectors through the attention mechanism, then calculates the attention score and weight using the excitation and inhibition weights, respectively. These weights are subsequently applied to the output
of the MLSTM layer to enhance or suppress the expression of specific features.
2.4. SleepSLSTM Learning
Similarly to the conventional LSTM, the SLSTM can contain multiple memory cells connecting
,
,
,
and gates
,
and
via recursion from the hidden state vector
to the memory cell input
. One novel aspect of memory mixing in the SLSTM is the effect of exponential gating. The SLSTM incorporates memory mixing within each of multiple heads without crossing between heads; together with exponential gating, this establishes a new method for memory mixing [
32]. The SleepXLSTM model evaluated in this study incorporated the SLSTM by passing data through a fully connected layer, a dropout layer, then a second fully connected layer to output the final sleep staging sequence. The architecture of SleepSLSTM is shown in
Figure 4.
(1) Calculation of Gating: The input weights of the middle gate, defined as , are uniformly distributed through and the bias term is generated by the zero-initialization .
The input state
, forget state
, and intermediate state
are respectively defined as follows:
(2)
Stabilization of State Transitions: The stabilization factor
stabilizes
and
, updates the cell state
, normalizes the state
and hides the state output
as follows:
The stability factor converts the multiplication operation into addition by logarithmic transformation to avoid numerical underflow owing to the continuous gating operation, and the normalized state is used to eliminate the amplitude drift of the cell state . The output of the SLSTM in this study is the in the hidden gate.
2.5. Deep Fusion and Inference
The improved MLSTM output , where is the improved MLSTM layer number, is combined with the KGNN output through feature fusion to obtain , and the features of are fused to obtain . These fused features are input into the fully connected layer of SleepXLSTM, the loss function is optimized while the gradient is trimmed, and the automatic sleep staging results are output.
(1) Fully Connected Layer: The classification weight matrix of the first fully connected layer is defined as , which is generated through the Kaiming normal distribution , where is the dimension of the weight matrix and the bias term is .
Given the input
, for which
, the first fully connected layer
is represented by
To prevent overfitting,
is expressed as
[
33] after the dropout layer, where
, as follows:
The classification weight matrix of the second fully connected layer is defined as
, which is generated through the
uniform distribution, and its output bias term is
; the second fully connected layer is defined as
and the output probability value
is given by
where
.
(2)
Loss Function and Optimization: The weighted cross-entropy loss function [
34] is used to alleviate the long-tail distribution bias of the sleep stage data via adaptive weight allocation based on the inverse square root of the class sample size. Furthermore, the AdamW optimizer is employed to decouple the weight decay and gradient update paths, thereby enhancing the generalization performance of the model [
35]. Finally, the global gradient norm constraint (threshold
) is supplemented to suppress the gradient explosion phenomenon in the deep recursive architecture and improve the adaptability of sleep staging tasks while ensuring convergence [
36].
The weighted cross-entropy loss is defined as
where
, in which
is the number of training samples, and
is the probability of predicting the sample class, in which
(
denotes the sleep stage).
The AdamW optimizer is defined as
where the decay index
, momentum values
and
, learning rate
, and stability coefficient
.
Finally, the gradient is cropped such that when the gradient , , where .
3. Experimental Evaluation
3.1. Dataset and Feature Preparation
The public dataset used in this contained labeled sleep stages (Wake, N1, N2, N3, and REM) based on PSG data [
37] as well as acceleration (in g) and heart rate (in bpm, measured via photoplethysmography) data from 31 subjects collected between June 2017 and March 2019 at the University of Michigan [
37]. The subjects wore an Apple Watch to collect their dynamic activity patterns (number of steps) for at least a week (7–14 d) before spending a night in a sleep lab sleeping for 8 h as acceleration (g) and heart rate (bpm) were collected by the Apple Watch and PSG data were collected by laboratory equipment. Each type of data recorded by the Apple Watch was paired with the labeled sleep data based on the PSG results. For a complete description of these methods, see the database description [
12]. The normal heart rates for each subject and corresponding sleep stages were used to train the SleepXLSTM model in this study. To further verify the performance of the algorithm and enhance the persuasiveness of the results, a second public dataset called ISRUC-Sleep [
38] was used. The data were collected from adults, including healthy subjects, subjects with sleep disorders, and subjects affected by sleeping pills. Each data point was randomly selected from the PSG recordings obtained at the Sleep Medicine Center of the University of Coimbra Hospital, Coimbra, Portugal. This dataset consists of data from 100 subjects, each of whom underwent a session during which the heart rate and corresponding sleep stages were recorded. The sleep stage annotations were made by two human experts using visual scoring.
Given the heart rate signal sequence , where represents the heart rate value at time , the goal for this task was to predict the corresponding sleep stage sequence , where represents sleep stage categories. To prevent leakage of the time dimension, a gap-based sequence sampling strategy was adopted, that is, the input sequence was , the target label was , and the sampling index was . Here, was the sequence length and is the gap step size.
A min–max linear transformation was applied to eliminate dimensional differences in the heart rate data and avoid data leakage in the test set [
30]. The original heart rate series
was normalized as follows:
where
and
are the extreme values in the training set.
The sleep stages in this study were ordinally coded according to the sleep staging rules defined by the American Academy of Sleep Medicine [
39] as follows:
and an injective map of sleep stages was established through LabelEncoder [
40] as follows:
After encoding, the labels were given by and their ordinal relations were preserved.
A hierarchical multi-stage strategy was used for data partitioning. First, at the subject level, we randomly divided all subjects into a training set (80%), validation set (10%), and test set (10%), ensuring that subjects in the three sets were mutually exclusive. Next, within the training set, we used a Bernoulli sampling process to sample the time series data of each subject independently with sampling probability .
For category
and training set subject
, the Bernoulli sampling process is expressed as
where
is the number of samples of subject
in category
. The sampled data were used for model monitoring and the early stopping strategy during training, whereas the unsampled data were used for parameter updating (For details,
Section S1).
3.2. Ablation Experiment Design
Four ablation experiments were designed to systematically evaluate the synergistic effects and independent contributions of the key components of the SleepXLSTM architecture. These experiments considered the full model to comprise the complete SleepXLSTM architecture integrating the KGNN, improved MLSTM, and SLSTM to realize joint modeling of spatiotemporal features and domain knowledge through cascade feature fusion. Ablation Model 1 removed the KGNN module and retained the improved MLSTM and SLSTM to quantify the effect of domain knowledge injection on classification performance. Ablation Model 2 removed the improved MLSTM module while retaining the KGNN and SLSTM to verify the contribution of the graph structural attention mechanism to temporal dependence modeling. Ablation Model 3 removed the SLSTM module while retaining the KGNN and improved MLSTM to evaluate the optimization effect of the stable memory unit on gradient propagation.
The control variable method was applied with consistent hyperparameters (, , and ) as well as optimization (AdamW, , and ) and data division (stratified sampling to assemble a test set accounting for of total data) strategies to ensure that any observed differences in performance resulted from architectural changes. The evaluation considered the accuracy, macro-F1 score, and class-wise precision and recall metrics to comprehensively evaluate the performance of each ablation model. These experiments were conducted to reveal: (1) the boundary effect of the KGNN in introducing prior knowledge, (2) the improved functional complementarity of MLSTM and SLSTM in time-series modeling, and (3) the performance gain of the full model compared with the simple superposition of several components.
3.3. Implementation
The experimental platform employed an NVIDIA RTX-3060 GPU with 12 GB VRAM in an Intel Core i5-12490F machine with 32 GB DDR4 and used CUDA 11.8 to accelerate computing.
During the data preprocessing stage, the heart rate time-series data were normalized using min–max processing, and the sleep stage labels were coded using LabelEncoder. The datasets were split into training and testing sets at a 4:1 ratio using stratified sampling to ensure consistent category distribution. The model was trained using an end-to-end optimization strategy employing the AdamW optimizer and a gradient clipping threshold of
. The class-weighted cross-entropy loss function was employed, and the weight was calculated dynamically using
. Learning rate scheduling adopted the ReduceLROnPlateau strategy based on the accuracy of the verification set (patience value = 2, attenuation factor = 0.5) [
41]. The training period was set to
and the batch size was fixed at
to ensure optimal memory utilization (peak VRAM occupancy of 15.9 GB). Model parameter initialization followed a normal distribution [
42] (in the SLSTM and improved MLSTM layers) with orthogonal initialization (recursive weights) [
43].
Ablation Models 1, 2, and 3 were implemented using a modular algorithm structure to ensure that all hyperparameters except those changed to realize the target architecture were consistent.
3.4. Evaluation Metrics
This study employed a stratified retention validation strategy to ensure the rigor and reproducibility of the assessment. This strategy divides the data at the subject level, allocating 80% of the subject data for training, and evenly distributing the remaining 20% of the subjects to the validation set and the test set, thereby preventing data leakage between individuals. This design is based on two key considerations. Firstly, time series data exhibits significant autocorrelation. The traditional cross-validation method leads to overlapping time windows, resulting in evaluation bias. Secondly, retaining verification is more in line with clinical practical application scenarios. The performance of the model on an independent group of subjects can truly reflect its generalization ability, providing an unbiased estimate of the model’s performance.
The global accuracy (
ACC) [
44], class-weighted F1 score [
45] and normalized confusion entropy (
NCE) [
46] were used as multilevel quantitative metrics to evaluate the classification performance of each ablation model. The
ACC was defined as follows:
where
is the indicative function and
and
are the predicted and true labels, respectively, which reflect the overall classification accuracy but are sensitive to class imbalance.
The F1 score was calculated by
where
,
, and
denotes the number of test samples of class
. Note that the weighting strategy employed in this study assigned higher weights to most classes, which is suitable for the long-tailed distribution characteristics of clinical data.
Finally, the
NCE was determined as follows:
where
represents the number of samples for which the true class
was predicted to be
and quantifies the degree of confusion misjudged by the model (
, where 0 indicates no confusion).
These indices comprise an orthogonal evaluation space incorporating two dimensions—classification accuracy and robustness—and strictly satisfies scale invariance (NCE normalization). All calculations were based on nonparametric estimations to avoid distribution assumption bias.
5. Discussion
5.1. Model Performance
The SleepXLSTM model was proposed in this study to extract heart rate signal features for automatic sleep staging. The results of various evaluation experiments demonstrated that the introduction of physiological constraints using the KGNN complemented the effects of the XLSTM components, allowing SleepXLSTM to exhibit advanced automatic sleep staging capabilities.
The experimental results indicate that SleepXLSTM exhibited more advanced performance in terms of ACC, recall, and F1 score than either of its substituent algorithms designed in this study or the conventional XLSTM algorithm. The root of this performance advantage lies in cross-scale dynamic coupling at the network architecture level through the design of the SLSTM gating system, improved MLSTM excitation–inhibition dual-channel regulation, and sparse feature mapping at the fully connected layer.
The internal parameter distribution characteristics and functional implementation mechanism of the MLSTM network can be revealed through an analysis of its input hidden weight matrix, as shown in
Figure 19 (left). The weight range was concentrated within [−0.1, +0.1] with a horizontal band-like distribution indicating that the weight values of the same hidden unit exhibited continuous gradient changes for different input dimensions. However, the subtle weight distribution patterns between adjacent hidden units indicate that the network formed a regional preference for feature encoding during the learning process and suggest that a specific hidden unit group tended to produce a systematic response to a continuous interval of input features; this is related to the soft selection of features at the input gate using the Sigmoid function and the generation of candidate memory units using the Tanh function to ensure that adjacent hidden units followed a smooth transition during feature encoding.
In the hidden state weight transition matrix shown in
Figure 19 (right), which relates the hidden units of the previous time step to those of the current time step, the weight distribution presents unstructured characteristics with no obvious trends, and its range of [−0.15, +0.15] is wider than that for the input hidden weight matrix. This wider dispersion reveals two key properties of the hidden state updates: (1) there was no evident directional constraint on the connection strength between units, and (2) the state update of each hidden unit was simultaneously affected by multiple predecessor units. Furthermore, the density of this matrix indicates that the model realized nonlinear state evolution through global parameter optimization, which is consistent with the mechanism of coordination between the forget and output gates in the improved MLSTM and SLSTM.
These different matrix distribution characteristics reflect the design goals of the model architecture. The band-like structure of the input weight matrix reflects the spatial continuity requirement of feature encoding, with the local correlation constraint improving feature extraction efficiency. The unstructured distribution of the hidden state weight transition matrix reflects the global coupling requirements of the state evolution and confirms that the complex time-series pattern was modeled using a high-degree-of-freedom parameter space.
Note that the use of the dual-channel design employing excitatory and inhibitory weights is equivalent to introducing a dynamic bias term corresponding to the excitation–inhibition balancing mechanism in neuroscience. As shown in
Figure 20, the positive bias of the excitatory weights enhanced the signal-to-noise ratio of feature transmission, whereas the negative bias of the inhibitory weights actively inhibited irrelevant features. Critically, integrating this coupled method into the attention mechanism of the MLSTM provided more refined temporal feature extraction than MLSTM alone.
The weight distributions of the fully connected layers in the proposed network are shown in
Figure 21 to exhibit multimodal characteristics in which the weights at the end of the classifier present an obvious heavy-tailed distribution. The Gaussian distribution of shallower weights in the model ensures smooth mapping of the initial features, whereas the peaked distribution of deeper weights in the model indicates the establishment of sparse discriminative feature connections in the feature space. Furthermore, the correlation heat map in
Figure 22 suggests that the correlation coefficients in the MLSTM incorporating the enhanced excitation–inhibition regulator reached 0.11, 0.22 and 0.07, respectively, which are higher than the correlation coefficients inside the fully connected layer (−0.03, 0.04 and 0.16). In particular, the asymmetric structure of the weight correlation matrix (i.e., the correlation between the hidden state weights and attention in the first modified MLSTM layer is higher than that in the fully connected SLSTM layer) suggests a clear direction for information flow. This cross-module correlation feature confirms the effectiveness of the proposed architecture. Notably, the temporal features extracted by the improved MLSTM and the spatial features screened by the attention mechanism realized complementary enhancement through orthogonal projection rather than simple feature stacking.
At the micro-level, the control of excitation–inhibition weights in the improved MLSTM achieved a dynamic balance of feature transfer. At the meso-level, the multimodal distribution of the fully connected layer constructed a hierarchical feature extraction path. At the macro-level, intermodule correlation ensured the collaborative optimization of multiscale features. This multiscale coupling design can have particularly significant impacts on small-sample time-series classification tasks, such as sleep staging. Furthermore, the attention coupling layer of the MLSTM was improved to reduce the complexity of the model through parameter sharing, and the sparse connection of the fully connected layer prevented overfitting. The dynamic balance of the MLSTM and SLSTM improved the
ACC of the model by 26.2% compared with that of the neural network model demonstrated by [
12], as shown in
Table 1, confirming state-of-the-art automatic sleep staging performance. Furthermore, the SleepXLSTM model demonstrated its superiority compared to other state-of-the-art models in
Figure 17 in terms of kappa coefficient. The observed improvement in multiple performance indicators reflects the ability of SleepXLSTM to analyze sleep stages accurately and reliably using heart rate signals, informing assessments of sleep quality.
5.2. Subject-Wise Performance Variance and Robustness Analyses
To evaluate the generalization ability of the model across different individuals and detect potential overfitting, we conducted subject-level cross-validation and robustness tests, as described by Mu-Jiang-Shan Wang et al. [
47]. The data were divided by individuals (based on the first dataset), and the leave-one-subject-out (LOSO) strategy was adopted. For each subject, the data of that subject were used as the test set, while the data of the remaining subjects were used as the training set. The model was repeatedly trained and validated, and the accuracy, precision, recall rate, and F1 score of each subject on the test set are shown in
Figure 23. After completing all rounds, the mean, standard deviation, minimum value, and maximum value of each performance indicator were calculated to quantify the inter-subject variance of model performance, as shown in
Figure 24. The distribution histogram and box plot of the performance indicators are displayed in
Figure 25 to visualize the differences.
The LOSO cross-validation results in
Figure 23 demonstrate that the model maintained a high level of performance consistency across all subjects. Moreover, the summary of the model’s performance indicators in
Figure 24 indicates that the mean values are high and standard deviations are low, indicating a limited variance among subjects.
To assess the robustness of the model, during the test phase, Gaussian noise (with a mean of 0 and standard deviation of 5–10% of the standard deviation in the original data) was added to the input heart rate data. The boxplot in
Figure 25 intuitively reflects the centralized distribution of the performance metrics, which further indicates the robust generalization ability of the model. The decrease in model performance was used to calculate the robustness coefficient (the ratio of the accuracy rates before and after added noise;
Figure 26).
In the robustness test, the model still maintains relatively stable accuracy under different noise levels, which shows its good fault tolerance—that is, its ability to moderate interference.
5.3. Hyperparameter Sensitivity Analysis
To optimize the hyperparameter configuration, we conducted a hyperparameter sensitivity analysis. The hidden layer size, number of network layers, learning rate, weight decay coefficient, and batch size were analyzed in the multilevel experimental scheme. The hidden layer size was measured in steps of 50 from 100 to 300. The number of network layers was measured in steps of 2 from 4 to 12. The learning rate was evaluated at five levels using a log scale (1 × 10−5, 1 × 10−4, 3 × 10−4, 5 × 10−4, 1 × 10−3) and the weight decay was evaluated at five levels (0, 1 × 10−4, 1 × 10−3, 5 × 10−3, 1 × 10−2). Five batch sizes were assessed (64, 128, 256, 512, 1024). A two-stage experimental strategy was used. In the first stage, a grid search of the size of the hidden layer and number of network layers was performed, and 25 parameter combinations were generated by fixing the other parameters to their optimal values. In the second stage, a univariate sensitivity analysis centered on the optimal configuration was performed while changing only one hyperparameter at a time. The model was trained on fixed data splits for 100 epochs per configuration, and performance metrics such as accuracy and F1 score for the validation and test sets were recorded.
Univariate line plots are shown in
Figure 27, and the performance elasticity (the ratio of accuracy change to hyperparameter change) for each hyperparameter is plotted in
Figure 28 to identify the hyperparameters that have the greatest impact on the model.
The results in
Figure 28 show that the model is sensitive to changes in structural parameters such as learning rate and hidden layer size, but it is relatively robust to training hyperparameters such as batch size and weight decay. This analysis verifies the robustness of the selected parameter configuration and the rationality of the model design.
5.4. Limitations and Future Work
Although the incorporation of more parameters into SleepXLSTM increased its accuracy, this also increased its complexity; this increase was considered justified by the accuracy with which the model automatically identified sleep stages (Awake, N1, N2, N3 and REM). Notably, the network architecture, which comprises the KGNN, MLSTM and SLSTM modules, addresses current challenges and establishes a foundation for future improvements. These improvements may include strategies such as network lightweighting to further optimize sleep staging performance while reducing complexity. Furthermore, note that this study preprocessed heart rate data using min–max linear transformation [
30] to eliminate dimensional differences. However, recent studies have shown that heart rate variability can be used to enrich data features and thereby realize additional prediction functions (e.g., sleep pressure) [
48]. Therefore, future research should adopt richer heart rate signal preprocessing to explore additional prediction functions.
There is further room for improvement of the proposed SleepXLSTM model. At present, the model uses a 30-step time window for feature extraction as this is sufficient to effectively capture short-term and medium-term heart rate fluctuation patterns. However, limitations on modeling the long-term dependence between periods (e.g., the REM to NREM transition) remain. Therefore, future research should introduce a hierarchical temporal attention mechanism [
49,
50] and/or the WaveNet convolutional structure [
51,
52] to expand the scope of temporal perception while maintaining computational efficiency; however, a progressive training strategy must be designed to prevent the gradient explosion problem under this approach. In addition, verification of the clinical applicability of the proposed model remains limited by its ability to only interpret single-mode ECG data. Future research is planned to simultaneously collect physiological parameters such as ECG, EEG, respiratory, and blood oxygen signals to construct a cross-modal association verification system.
Furthermore, research on model compression and quantification is necessary to satisfy the embedded deployment requirements of medical devices. We analyzed the complexity of the SleepXLSTM model. The total parameter count includes the elements in all weight matrices and bias vectors. The model size was calculated based on the theoretical memory footprint, assuming 32-bit floating-point precision. FLOPs were estimated using a layer-wise theoretical analysis that sums the operations required for a single forward pass. CPU inference time was measured by averaging 100 runs with a batch size of 1 and a sequence length of 30. The analysis reveals that while the current architecture does not directly satisfy the stringent real-time constraints of wearable devices, Configuration 1 (in
Table 3) shows clear potential for deployment with further optimization. Therefore, the hierarchical redundancy of the MLSTM and SLSTM could be reduced through model optimization (selective pruning), and the mixed-precision quantization technique can be used to compress the number of model parameters to less than 1/5 of the existing parameters while maintaining the classification accuracy threshold
. Notably, the dynamic sparse training paradigm [
53] and hardware-aware distillation [
54] techniques proposed in recent studies may provide new technical paths for deploying high-precision sleep-monitoring models in resource-constrained environments, with the expectation that future models will function accurately and be sufficiently lightweight to be installed in contactless devices.
The automatic sleep staging model based on SleepXLSTM will be able to be integrated into the flexible biosensor platform in the future, successfully realizing the intelligent upgrade of non-contact wristband devices. This device, through the analysis of a single physiological signal, breaks through the physical limitations of traditional contact electrodes, significantly enhancing user compliance while maintaining medical-grade monitoring accuracy. Particularly worth noting is that its non-intrusive design can continuously collect physiological parameters of infants and young children during sleep, effectively overcoming the technical bottleneck of traditional polysomnography (PSG) being easily disturbed by limb movements, and establishing a safe and reliable long-term sleep monitoring system for children in their critical development period.
From a public health perspective, the widespread adoption of this technology will complement the existing sleep health management models. By establishing a cloud-based sleep quality assessment platform, it is possible to achieve dynamic tracking of residents’ sleep parameters and early warning of abnormal fluctuations. It is worth noting that the long-term sleep data it has accumulated not only provides a scientific basis for individualized sleep intervention plans, but also, through the analysis of group sleep characteristics, can reveal the potential correlations between environmental factors, lifestyles and sleep disorders. This “prevention—monitoring—intervention” full-chain management model is expected to reduce the treatment costs for chronic insomnia and circadian rhythm disorders, etc. What is particularly important is that this highly sensitive identification capability enables community medical institutions to utilize low-cost wearable devices to conduct large-scale screening for sleep disorders. In particular, it shows significant application value in the early detection of sleep apnea syndrome (SAS) among the elderly population.
Looking forward to future development, with the continuous breakthroughs in Internet of Things technology, the new generation of devices may integrate environmental sensor modules to achieve simultaneous monitoring of external factors that affect sleep, such as light intensity, temperature and humidity. This multi-dimensional data analysis capability, combined with the continuous optimization of artificial intelligence algorithms, will drive the evolution of personalized sleep health management systems, upgrading from simple stage identification to intelligent terminals with predictive intervention suggestions.