1. Introduction
Sleep is a vital physiological process necessary for maintaining both physical and mental health. Scientific evaluation of sleep structure plays a key role in diagnosing sleep disorders, assessing disease risk, and guiding clinical treatments [
1]. Traditionally, sleep staging is conducted by experts through manual analysis of polysomnography (PSG) recordings, following the standards set by the American Academy of Sleep Medicine (AASM). This process involves interpreting multi-channel and multi-modal signals such as electroencephalogram (EEG), electrooculogram (EOG), and electromyogram (EMG). However, this manual approach is time-consuming, labor-intensive, and inherently subjective [
2], limiting its scalability in large-scale health screenings and routine clinical practices.
The burgeoning adoption of wearable health technologies and the accelerated growth of home-based medical monitoring have created pressing needs for resource-efficient sleep staging frameworks capable of automated real-time analysis [
3,
4]. Contemporary methodologies employ advanced computational approaches to derive discriminative features from neurophysiological recordings (e.g., EEG), facilitating precise algorithmic categorization of sleep architecture into five distinct states (Wake, N1-N3, REM) [
5,
6]. This domain has witnessed transformative advancements following the integration of deep neural architectures [
7,
8].
Initial computational solutions for sleep stage classification predominantly utilized manually engineered features derived through temporal, spectral, or time-frequency analyses, subsequently processed by classical machine learning paradigms including support vector machines and ensemble decision trees [
9,
10]. These conventional approaches nevertheless demonstrated constrained cross-domain adaptability due to the inherent non-stationarity and inter-individual variability characteristic of EEG manifestations during sleep [
11]. The advent of convolutional neural networks (CNNs) in 2016 revolutionized feature abstraction capabilities for this task [
12,
13]. Seminal work by Supratak et al. [
7] introduced DeepSleepNet, which pioneered the synergistic integration of CNNs with bidirectional recurrent networks (BiLSTM), achieving unprecedented classification performance. Subsequent research has focused on architectural refinements of convolutional operators and temporal dependency modeling mechanisms to better accommodate the electrophysiological heterogeneity present in polysomnographic data [
8,
12,
13].
The progressive deepening of neural network architectures has precipitated a corresponding surge in parametric complexity and computational overhead, creating significant barriers for resource-constrained implementations on mobile and embedded health monitoring platforms. In response to these constraints, the research community has investigated multiple optimization approaches: (1) compact convolutional topologies (e.g., depth-wise separable architectures [
14], fire modules [
15]), (2) network sparsification and precision reduction techniques [
16], and (3) teacher-student paradigm distillation [
17]. While effective in model compression, these methods frequently impair the hierarchical temporal representation capacity, ultimately diminishing model fidelity and cross-domain adaptability.
Clinical polysomnographic analysis reveals that EEG manifestations contain inherently multi-resolution temporal signatures—exemplified by delta-wave predominance in N3 sleep versus theta/gamma interplay during REM—necessitating sophisticated multi-scale feature aggregation for optimal staging performance [
18,
19]. Temporal convolutional architectures employing dilated kernels have demonstrated particular efficacy in establishing extended contextual awareness across divergent time scales [
20]. Seminal implementations include Chambon et al.’s [
13] stacked dilation framework for automated feature abstraction and Tsinalis et al.’s [
12] systematic investigation of dilation coefficient impacts on stage-specific pattern recognition.
Contemporary TCN implementations nevertheless exhibit two critical limitations: (1) predetermined dilation configurations lacking input-adaptive flexibility [
13], and (2) suboptimal local-global feature synthesis during transient sleep stage boundaries. Recent innovations attempt to mitigate these through hybrid multi-branch designs (e.g., MSTCN’s parallelized multi-rate pathways [
20]) or gated feature modulation mechanisms [
21]. While MS-TCN [
20] demonstrates improved multi-scale processing through static branch concatenation, they remain constrained by fixed fusion coefficients that prevent dynamic, context-dependent feature reweighting during temporal evolution.
Attention mechanisms have also been widely adopted in sequence modeling tasks, significantly enhancing discriminative capability by focusing on salient input features [
22,
23]. In automatic sleep staging, channel attention mechanisms such as Squeeze-and-Excitation (SE) blocks [
24] and multi-head self-attention (MHA) modules [
23] have increased sensitivity to critical features. Moreover, combining attention with dilated convolutions enables adaptive multi-scale feature aggregation [
25].
Furthermore, gating mechanisms dynamically regulate information flow among parallel branches. While units such as the Gated Linear Unit (GLU) [
26] and gated residual networks [
27] are often task-specific, few studies have designed gating mechanisms explicitly for the dynamic fusion of different dilation rates. For example, Chen et al. [
28] combined gating with multi-scale branches, but without channel attention or end-to-end adaptive feature selection. Effectively integrating parallel multi-dilated branches, channel attention, and adaptive gating for dynamic multi-scale feature fusion thus remains an open challenge.
Recently, research has increasingly focused on developing efficient, lightweight, and highly generalizable models for sleep staging. Notable examples include AttnSleep [
29], which uses multi-head self-attention for long-range dependency modeling, and MSTCN [
20] and NAMRTNet [
30], which combine multi-scale TCN structures with channel attention. However, some of these methods involve complex architectures with large parameter counts, leading to high deployment costs, while others rely on fixed fusion weights and cannot dynamically adapt branch contributions to input data.
With the growing adoption of single-channel EEG in portable sleep monitoring, there is a pressing need for models that combine high accuracy, compact size, strong generalization, and efficient deployment [
31,
32]. To address these challenges, we propose ADG-SleepNet—a novel lightweight neural network architecture that integrates parallel multi-dilated convolutional branches, an adaptive gating mechanism, and channel attention. Our model extracts multi-scale temporal features in parallel and adaptively allocates branch importance based on input signals, while channel attention further enhances feature selection. This streamlined design enables significant reductions in parameter count and computational cost, supporting efficient deployment in resource-constrained environments. Experimental results demonstrate that ADG-SleepNet achieves state-of-the-art performance on public datasets, matching or exceeding leading lightweight models [
7,
32] and exhibiting strong generalization and application potential [
5].
2. Materials and Methods
This section provides a comprehensive description of the materials and methodologies employed in this study. We first introduce the datasets used and the preprocessing procedures applied to the raw EEG signals. Subsequently, we detail the architecture and core components of the proposed ADG-SleepNet model.
2.1. Dataset and Preprocessing
To ensure the scientific rigor and broad applicability of model evaluation, this study utilizes the internationally recognized Sleep-EDF Database Expanded [
33,
34]. Specifically, two subsets of this database are employed for systematic model training, validation, and generalization testing. In addition, all raw EEG signals undergo preprocessing steps, including signal selection and the removal of irrelevant phases, prior to being input into the model. These procedures are essential for maximizing data quality and enhancing the stability of model training. The following sections will provide a detailed description of the dataset composition and the corresponding signal preprocessing workflow.
2.1.1. Dataset Description
This study utilized two subsets from the Sleep-EDF Database Expanded to evaluate the performance and generalization ability of the proposed model. Sleep-EDF-20 consists of overnight EEG recordings from 20 healthy adults, with each subject undergoing two consecutive nights of monitoring and approximately eight hours of EEG data collected per night. This subset was used primarily for model training and cross-validation. In contrast, Sleep-EDF-78 includes single-night or multi-night EEG recordings from 78 different subjects, serving as a benchmark for assessing the cross-subject generalization capability of the model.
All EEG signals were recorded from the Fpz–Cz channel at a sampling rate of 100 Hz. Detailed statistics regarding the number of subjects and the class-wise distribution of sleep stages are presented in
Table 1. To ensure robust and reliable evaluation, a 20-fold cross-validation strategy was adopted: in each iteration, one fold was reserved for testing while the remaining folds were used for model training and validation [
33].
Figure 1 presents EEG signal samples corresponding to the five sleep stages from the original dataset. Each stage is represented by a single 30 s segment, serving as an individual sample that characterizes the typical EEG activity of that stage. The signals from different stages exhibit distinct variations in both frequency and amplitude, demonstrating clear differences. The Wake (W) stage is characterized by low-amplitude mixed-frequency signals. In contrast, the N1 stage begins to exhibit more pronounced theta waves, accompanied by changes in signal amplitude. The N2 stage is marked by sleep spindles and K-complexes, with the signal showing regular oscillatory patterns. The N3 stage, representing slow-wave sleep, is distinguished by high-amplitude delta waves, while the REM stage displays EEG signals resembling those of the wakefulness stage, with low-amplitude mixed-frequency patterns and rapid eye movements.
To visually assess the separability of traditional time-frequency statistical features across different datasets, we first extracted the average power in four frequency bands: 0.5–4 Hz, 4–8 Hz, 8–13 Hz, and 13–30 Hz for each 30 s segment, along with the mean, standard deviation, and peak-to-peak value of the channel signals, resulting in a total of seven handcrafted features. Subsequently, the feature sets of all samples from the Sleep-EDF-20 and Sleep-EDF-78 subsets were input into the t-SNE algorithm for dimensionality reduction and projection, as shown in
Figure 2. It can be observed that the samples from different sleep stages in both datasets are highly mixed in the two-dimensional plane and do not form distinct clusters, indicating that the reliance on the aforementioned time-frequency statistics alone is insufficient for distinguishing the stages in a low-dimensional space. This highlights the necessity of deep models for learning higher-order nonlinear representations.
2.1.2. Signal Preprocessing
The preprocessing procedure in this study first extracts EEG signals from specified channels in the raw PSG files and verifies that their sampling rate and recording duration are consistent with the sleep stage information in the annotation files. Then, based on the start time and duration of each segment from the annotation file, the sleep stages (W, N1, N2, N3, REM) and labels such as “MOVE” and “UNK” are mapped to numerical values using a predefined mapping dictionary, and the corresponding labels are assigned to signal segments based on 30 s epochs. After synchronizing the signals with the labels, the first and last non-“W” (wake) epoch indices are located, and 60 epochs (i.e., 30 min) are extended forward and backward to retain the transition data before and after sleep onset. Only the signals and labels within this interval are kept, while long wake segments at both ends are discarded. Subsequently, all epochs labeled as “MOVE” and “UNK” are further removed, resulting in high-quality data that only includes the five standard sleep stages. Finally, the processed signal matrix, label vector, sampling rate, original start time, total duration, and the number of original and filtered epochs, along with other metadata, are saved for subsequent analysis and model training.
Figure 3 shows a comparison of the sleep stage distribution before and after preprocessing in a segment of the dataset.
2.2. Methods
ADG-SleepNet is an end-to-end network designed for sleep stage classification, aiming to strike a balance between model performance and lightweight deployment. The overall architecture of ADG-SleepNet is depicted in
Figure 4, which illustrates how the network extracts multi-scale temporal features from raw EEG signals through a series of 1D convolutional layers and pooling operations. Subsequently, the innovative Multi-Dilated Gated TCN Module is employed, where the parallel multi-dilation convolutional branches (PMDC Branches) submodule applies multiple depth-wise separable convolutional branches in parallel, each utilizing different dilation rates. This module is further integrated with the adaptive weighted dilation module (AWDM) submodule, which innovatively combines channel attention with an adaptive gating mechanism through an adaptive gating and channel attention mechanism, enabling dynamic selection of the most discriminative channels while adaptively adjusting the weights of each dilation branch based on the input signal, thereby achieving the effective fusion of short-term high-frequency and long-term low-frequency features. Finally, the fused features are processed through a global average pooling layer and a fully connected classification layer to yield the predicted probabilities for the five sleep stages (W, N1, N2, N3, and REM). This end-to-end framework not only achieves accuracy comparable to or surpassing mainstream methods but also significantly optimizes parameter efficiency and computational cost, demonstrating strong potential for practical deployment.
2.2.1. Multi-Scale Encoder
In sleep staging, EEG signals exhibit distinct temporal and spectral characteristics across different sleep stages. For instance, light sleep stages (N1 and N2) are typically characterized by short-duration, high-frequency theta and alpha waves, while deep sleep (N3) is dominated by long-duration, low-frequency delta waves. To effectively capture these varying characteristics, the proposed module employs a two-stage convolutional encoding strategy that integrates multi-scale convolutional designs to efficiently extract temporal features from EEG signals.
As shown in
Figure 5, the multi-scale encoder module consists of a two-stage convolutional encoder designed to capture both long-term low-frequency features through large-kernel convolutions and short-term high-frequency features via small-kernel convolutions. The raw EEG signal is first processed by a large-kernel convolutional layer with a kernel size of 50 and a stride of 6. This configuration increases the receptive field, enabling the extraction of coarse-grained, long-term low-frequency features associated with deep sleep stages. The convolution operation is formulated as follows:
Here, X denotes the input signal,
represents the convolutional kernel, and
is the bias term. The output of this operation is subsequently passed through a ReLU activation function:
Subsequently, the signal processed by the large-kernel convolution is subjected to a max pooling operation with a window size of 8 and a stride of 8, which further downsamples the signal and smooths out noise. Additionally, a dropout layer with a rate of 0.5 is applied to prevent overfitting:
Subsequently, the second stage is initiated, in which the input signals undergo further refinement through three consecutive convolutional layers with small kernel sizes (each with a kernel size of 8 and a stride of 1). The design of these three small-kernel convolutional layers enables the effective capture of rapidly changing high-frequency details in EEG signals, particularly the wave features characteristic of light sleep stages. The small-kernel convolution operations are defined as follows:
After each small-kernel convolutional layer, the output is processed by a ReLU activation function:
The signals following the small-kernel convolutions are subsequently subjected to a max pooling operation (with a pooling window size of 4 and a stride of 4) for dimensionality reduction, which further extracts high-frequency features and enhances the model’s generalization capability:
At the final stage of this module, the output features are prepared for the subsequent adaptive dilation-rate gated TCN module. In this manner, the multi-scale encoder demonstrates exceptional capability in extracting features at multiple scales, thereby providing richer representations for the downstream network layers and significantly enhancing the model’s classification performance. The architecture illustrated in
Figure 5 clearly delineates this multi-scale convolutional process, enabling the network to effectively capture both low-frequency (deep sleep) and high-frequency (light sleep) signal characteristics, thus facilitating accurate classification of complex sleep stages.
2.2.2. PMDC Branches
In temporal data modeling, capturing dependencies across multiple time scales is essential. To tackle this, we propose a method using multi-dilated convolutional branches, where multiple dilated convolutions are applied in parallel to extract features at different scales. This approach enhances the network’s ability to capture temporal dependencies across various time spans.
Traditional convolutional neural networks (CNNs) rely on a fixed receptive field, limiting their ability to model long-range temporal dependencies. While dilated convolutions help expand the receptive field without increasing computational cost or stride, using a single dilation rate may still be insufficient for complex temporal data. To address this, we introduce multi-dilated convolutional branches, where each branch uses a different dilation rate, allowing the model to capture features at multiple temporal scales simultaneously. This enables the network to effectively perceive dependencies over a wide range of time scales.
Nevertheless, using a single dilation rate may still be inadequate when dealing with complex temporal data. To simultaneously capture information across various time scales, we design multi-dilated convolutional branches, in which multiple convolutional branches, each with a different dilation rate, are employed in parallel. Each branch is able to extract temporal features at a specific scale, thereby equipping the model with strong perceptual capabilities across a broad range of time scales.
In sleep EEG signals, short-term high-frequency oscillations (lasting several hundred milliseconds) often coexist with cross-epoch low-frequency
waves (lasting from several seconds to over ten seconds). Convolutions with a single receptive field struggle to capture both simultaneously. To this end, as illustrated in
Figure 6, we design four parallel depth-wise separable dilated convolutional branches with dilation rates of
, thereby covering multi-scale temporal dependencies.
Let the feature map output from the previous stage be denoted as:
where l denotes the temporal length and c represents the number of channels. The output of the i-th branch is denoted as:
Here, the number of output channels is set to f = 128 to ensure compatibility with the subsequent classifier; the convolutional kernel length is set to k = 3, which strikes a balance between local feature extraction and computational efficiency; and the dilation rate
determines the receptive field:
Accordingly, the receptive fields of the four branches are 3, 5, 9, and 17 time steps, respectively, allowing for the parallel capture of key signal fluctuations from the millisecond to the second scale. The rationale for employing depth-wise separable convolutions lies in their ability to decouple spatial and pointwise convolutions, thereby reducing the parameter count and computational cost of standard convolutions to the order of . This design significantly decreases the model size and accelerates inference speed, while preserving the advantages of the multi-branch structure. Ultimately, the outputs from each branch are fused with the outputs of the subsequently described adaptive weighted dilation-rate module.
2.2.3. AWDM
Capturing dependencies across different time scales is crucial for improving the model’s expressive power and accuracy in temporal data modeling. Conventional convolutional neural networks (CNNs) typically rely on fixed receptive fields, which are effective for capturing local information but limited when processing temporal data with multi-scale dependencies. To address this, we propose the adaptive weighted dilation-rate module. Building on multi-dilated convolutional parallel branches, this module integrates an adaptive gating mechanism with channel attention to dynamically assign weights to each dilation-rate branch, enhancing the model’s ability to capture features at multiple temporal scales.
The core idea of this module is to first extract multi-scale features using dilated convolutions from the parallel branches, and then adjust the weights of each branch through the adaptive gating mechanism. This enables the network to dynamically prioritize branches based on the input signal’s global characteristics. Additionally, the channel attention mechanism refines the focus on key features, selectively strengthening the contribution of important channels within multi-scale temporal representations. This approach allows the model to effectively adapt to both short-term fluctuations and long-term dependencies, capturing a broad range of temporal information. In the adaptive dilation rate gating mechanism, the input signal is first processed through global average pooling to obtain a global feature vector for each sample. This global feature vector is then passed through a fully connected layer, which calculates the gating weights for each branch. These weights reflect the importance of each branch for the current task. Specifically, the weight coefficients learned by this fully connected layer are dynamically adjusted based on the characteristics of the input signal, thereby automatically assigning appropriate weighted coefficients to each dilation rate branch.
The AWDM structure is shown in
Figure 7. The output feature tensor of the encoder is
, where B is the batch size, T is the number of time steps, and C is the number of channels. AWDM first performs global average pooling over the time and channel dimensions for each sample, resulting in the compressed sample-level channel description vector:
The pooled feature represents the overall activation level of each sample, serving as the foundation for attention allocation.
As shown in
Figure 7, the AWDM adopts a parallel dual-branch structure: the left side is the channel attention path, and the right side is the adaptive gating path. Both paths take the global pooling result
as input and generate dilation rate weights through independent fully connected networks.
The attention vector is obtained by applying a dense layer and a Softmax activation to the global pooled descriptor:
Here,
is the number of dilation branches. The gating branch employs an independent set of weights to compute the control distribution via Softmax:
This mechanism further enhances the network’s ability to focus on critical temporal features by emphasizing those channels that are most important for the task. For example, in certain scenarios, the features of some channels may contain richer high-frequency fluctuation information, whereas others may capture low-frequency dependencies. The channel attention mechanism enables the network to adaptively assign appropriate weights to these different channels, ensuring that the contribution of important channels is reinforced when capturing dependencies at different temporal scales.
By combining the gating weights
and the attention weights
, the final weighting coefficient
for each dilation-rate branch is obtained, which is calculated as follows:
The resulting features are subsequently fed into the underlying Weighted Summation module for further computation.
2.2.4. Weighted Summation
In the previous section, we introduced the adaptive weighted dilation-rate module (AWDM), which produces attention and gating vectors through two parallel branches. These vectors are fused via an element-wise Hadamard product to generate the final weighting coefficients, which modulate the contribution of each dilation-rate branch during the feature fusion process.
This section details how the outputs from different dilation-rate branches are integrated using the learned weights to form a unified multi-scale feature representation.
Let the output of the d-th dilated-convolution branch be denoted as
, where B, T, and C represent the batch size, temporal length, and number of channels, respectively. To enable dimension alignment for branch-wise weighting, we stack the outputs of all branches along an additional branch axis and reshape the sample-wise weights produced by AWDM to have singleton time and channel dimensions, so that they broadcast over these axes and align with the features. This yields element-wise compatible shapes and allows direct Hadamard multiplication. Consequently, for the b-th sample, the weighted output of branch d is given by:
The final fused representation is obtained by summing the weighted branch outputs along the branch axis, thereby reducing the stacked tensor back to
; this operation preserves the temporal and channel resolutions and performs only sample-wise adaptive reweighting of the contributions from different dilation rates:
This adaptive fusion mechanism enables the model to dynamically emphasize different temporal receptive fields based on the input, thereby capturing both short-term and long-term temporal dependencies more effectively.
To further enhance feature representation and gradient flow, a residual connection is introduced, as illustrated in
Figure 4. Specifically, the encoder output
is first projected to match the dimensionality of
using a one-dimensional convolution:
The residual-enhanced feature is then obtained via element-wise addition:
This operation preserves the original structural information while enabling deep integration with the multi-scale temporal features, thus improving both the expressive power and training stability of the model. Finally, the fused feature is passed through a ReLU activation and fed into the downstream classifier for final prediction. The seamless integration of adaptive fusion and residual connection allows the model to effectively aggregate multi-scale temporal information and adapt to various types of temporal dependencies, ultimately enhancing classification performance and generalization capability.
2.2.5. Classification Head
In this model, the input to the Classification Head module is first subjected to global average pooling along the temporal dimension. The purpose of this operation is to compress the temporal dimension and retain the global information of each feature channel, thereby reducing the computational complexity of subsequent layers and enhancing the expressive power of the features.
Subsequently, the globally pooled features are passed through a fully connected layer, whose output dimension is set to 5, corresponding to the five sleep stages . This fully connected layer is immediately followed by a Softmax activation function, which transforms the predicted scores (logits) for each class into a probability distribution. The Softmax function ensures that the predicted probability for each class lies within the range [0, 1], and that the sum of the probabilities across all classes equals 1.
In this way, the model outputs the predicted probability for each sleep stage. Let the output be , where each denotes the predicted probability for the corresponding sleep stage. The final predicted sleep stage is determined by selecting the class with the highest probability.
Through this process, ADG-SleepNet achieves end-to-end multi-scale dynamic fusion and sleep stage prediction. This framework integrates multi-level temporal feature extraction and adaptive weighting mechanisms, enabling the model to effectively recognize and classify different sleep stages, thereby providing precise support for sleep research and health monitoring.
Algorithm 1 shows the pseudocode of the ADG-SleepNet network architecture.
Algorithm 1 Pseudocode for ADG-SleepNet |
Input: Time steps T0 = 3000, Batch size B, Channels C0 = 1, Classes K = 5, Branches D = 4; |
Output: Trained/Inference model; |
1: function BUILD_ADG_SLEEPNET(T0, B, C0, K, D) |
; |
3: ip ← Input(shape = (T0, C0)); |
Encoder Block: |
4: H1 ← Conv1D(k = 50, s = 6)(ip) → BN → ReLU; |
5: H1 ← MaxPool1D(win = 8, s = 8) → Dropout(0.5) (H1); |
6: H1 ← 3 × [Conv1D(k = 8, s = 1) → BN → ReLU] (H1); |
7: Xenc ← MaxPool1D(win = 4, s = 4)(H1); |
PMDC Block (parallel): |
8: dilations ← [1,2,4,8]; |
9: for d in dilations do |
10: Z[d] ← DWSepConv1D(k = 3, dilation = d, out = 128) (Xenc) → BN → ReLU; |
11: end for |
AWDM Block (Channel Attention(parallel to)Adaptive Gating): |
12: g ← GAPt (Xenc); |
13: ω ← ChannelAttention(g) ⊙ AdaptiveGating(g); |
Weighted Summation: |
14: Zstack ← stack ({Z[d]}d∈dilations, axis = branches); |
15: ω# ← expand (ω → B × 1 × 1 × D); |
16: Z∼ ← Σd∈dilations (Zstack [:,:,:,d] * ω# [:,:,:,d]); |
Residual and Activation: |
17: Y ← ReLU(Z∼ + Conv1 × 1(Xenc)); |
Output: |
18: h ← GAPt(Y); |
19: probs ← Softmax (Dense(K) (h)); |
20: model ← Model (ip, probs); |
21: return model; |
22: end function |
2.3. Experimental Setup
To systematically evaluate the effectiveness and robustness of ADG-SleepNet in the context of automatic sleep staging, this study strictly adheres to mainstream experimental protocols for single-channel EEG-based classification, with careful consideration of data segmentation, training procedures, model hyperparameters, and optimization strategies.
Specifically, experiments are conducted on both the Sleep-EDF-20 and Sleep-EDF-78 subsets, employing a unified 20-fold cross-validation scheme [
35]. In each experimental round, one of the 20 folds is used as an independent test set, while the remaining 19 folds serve for model training and validation, thereby maximizing the assessment of model generalization and result robustness. To further mitigate overfitting, 10% of the training set is randomly selected as a validation set, enabling real-time monitoring of model performance on unseen data and facilitating adaptive early stopping based on validation loss dynamics [
36], which enhances both training efficiency and generalization capability.
During the model training process, the raw EEG signals are first segmented into 30 s temporal epochs (each containing 3000 samples) and fed into the multi-scale encoder module to extract local feature sequences. These feature sequences are then input into subsequent modules for modeling.
To enhance the model’s generalization ability, lightweight data augmentation strategies, such as segment permutation and time reversal, are applied during the training phase. The Adam optimizer is used with an initial learning rate of and a decay factor of . The learning rate is dynamically adjusted based on the validation set performance during training. To prevent gradient explosion, all gradients are clipped with a maximum norm of 5.0. Additionally, L2 regularization is applied to the convolutional layer weights with a weight decay coefficient of to further improve the model’s generalization.
Training is conducted for a maximum of 150 epochs with an early stopping window set to 30 epochs: if there is no significant improvement in the validation set metrics within 30 consecutive epochs, training is automatically terminated. All experiments are conducted with a mini-batch size of 15, balancing training efficiency and model performance. Throughout the training and testing processes, multiple metrics, including overall accuracy, macro-average F1 score, per-class F1 scores, and model parameters, are continuously monitored and recorded for final model evaluation and comparison. All experimental procedures and hyperparameter settings are kept consistent across different datasets, without any manual intervention, ensuring fairness and reproducibility of the evaluation.
2.4. Evaluation Metrics
To comprehensively and objectively evaluate the practical performance of the proposed model on sleep stage classification tasks, we designed and adopted a suite of evaluation metrics addressing overall classification accuracy, per-stage discriminative capability, and model architectural complexity. By conducting a detailed assessment across multiple dimensions, we aim to thoroughly capture the model’s real-world applicability and its comparative advantages and limitations relative to existing methods. The specific metrics and their corresponding formulations are as follows:
First, Accuracy (Acc) serves as the most straightforward indicator of model performance, reflecting the overall proportion of correctly classified samples. It is defined as follows:
In this context, TP denotes true positives, TN denotes true negatives, FP denotes false positives, and FN denotes false negatives. Acc provides an intuitive measure of the model’s performance on the overall task; however, in cases of class imbalance, Acc may be influenced by the dominance of certain classes, thus necessitating a more comprehensive analysis through the incorporation of additional metrics.
Furthermore, precision (Pre) and recall (Rec) evaluate the model’s performance in terms of its accuracy when predicting positive cases and the proportion of actual positive samples correctly identified by the model, respectively. The formula for Pre is as follows:
The formula for Rec is as follows:
Pre emphasizes the accurate exclusion of negative samples, while Rec focuses on the model’s ability to identify as many positive samples as possible. In many practical scenarios, these two metrics often exhibit a trade-off, where improving one may lead to a reduction in the other. Therefore, a balanced approach is required for their effective use.
To comprehensively consider both Pre and Rec, we adopt the F1 Score (F1), which is the harmonic mean of these two metrics. The formula is as follows:
The F1 provides a reasonable balance between Precision and Recall, making the evaluation results more comprehensive.
In addition, the Macro F1 Score (MF1) effectively addresses class imbalance by calculating the F1 for each class and taking the arithmetic mean of all class scores. The formula is as follows:
Here, N represents the number of classes, and is the F1 for the i-th class. The gives equal weight to the performance of each class, ensuring that the model’s performance across all classes is evaluated fairly.
By using these metrics in combination, we are able to assess the model’s performance comprehensively from multiple perspectives and gain deeper insights into its behavior across scenarios. These results not only help identify the strengths and weaknesses of the model but also provide guidance for future optimization.
3. Results
This chapter begins by presenting the overall performance of ADG-SleepNet on both the Sleep-EDF-20 and Sleep-EDF-78 subsets. Next, the proposed method is comprehensively benchmarked against several recent state-of-the-art approaches to highlight its relative advantages. To further elucidate the effectiveness of each component, ablation experiments are conducted. Lastly, confusion matrix analyses are employed to provide an in-depth examination of the model’s strengths and limitations across different sleep stages.
3.1. Model Performance
To systematically evaluate the overall performance of the proposed ADG-SleepNet model on automatic sleep staging, rigorous experiments were conducted on two public datasets: Sleep-EDF-20 and Sleep-EDF-78. All experiments employed 20-fold cross-validation to ensure the robustness and generalizability of the results.
Specifically, on the Sleep-EDF-20 dataset, ADG-SleepNet achieved an average Acc of 87.1% and MF1 score of 84.0%. These results indicate that the model is capable of accurately distinguishing between different sleep stages, reaching a state-of-the-art level of classification performance among comparable approaches.
Furthermore, on the larger and more complex Sleep-EDF-78 dataset, ADG-SleepNet consistently demonstrated strong generalization capabilities, achieving an average Acc of 85.1% and MF1 of 81.1%. These findings further illustrate that ADG-SleepNet can not only accommodate inter-subject physiological variability, but also effectively handle distributional shifts and challenges posed by large-scale data.
ADG-SleepNet achieves 0.011 G FLOPs and 0.49 M parameters on the NVIDIA V100 (32 GB) GPU, with an inference latency of 1.84 ms. These results demonstrate that the model not only maintains high computational efficiency but also achieves significant lightweight design while preserving accuracy.
3.2. Comparison with Other Methods
To comprehensively evaluate the effectiveness of the ADG-SleepNet model in automatic sleep staging, we conducted comparisons with various state-of-the-art and recent lightweight deep learning models on the Sleep-EDF-20 and Sleep-EDF-78 datasets, as shown in
Table 2. The benchmark models include representative methods in the sleep staging domain, such as SleepEEGNet, DeepSleepNet-lite, IITNet, AttnSleep, L-SeqSleepNet, SeriesSleepNet, TinySleepNet, SleepTransformer, EEGSNet, FFTCCN, as well as recently proposed models like MultiSEss, ZleepAnlystNet, and S4Sleep.
In terms of overall performance, ADG-SleepNet attained an Acc of 87.1% and an MF1 of 84.0% on the Sleep-EDF-20 dataset, surpassing most mainstream models by a margin of 1–3 percentage points. These results highlight the model’s superior consistency and reliability in classifying different sleep stages. Compared to established models such as AttnSleep and SleepEEGNet, ADG-SleepNet consistently achieved leading or near-leading results in both Acc and MF1. Specifically, AttnSleep reported an Acc of 85.6% and an MF1 of 80.9%, while SleepEEGNet achieved 84.3% Acc and 79.7% MF1. Other lightweight competitors, including DeepSleepNet-lite, IITNet, and L-SeqSleepNet, demonstrated comparatively lower overall performance. In comparison to more recent models, ADG-SleepNet also outperformed MultiSEss, which achieved an Acc of 83.8% and an MF1 of 79.0%, ZleepAnlystNet, with an Acc of 84.1% and an MF1 of 82.9%, and S4Sleep, which attained an Acc of 84.4% and an MF1 of 87.7%. These findings collectively indicate that ADG-SleepNet delivers breakthrough results without depending on large or complex model architectures.
Remarkably, for the most challenging N1 stage in the Sleep-EDF-20 dataset, ADG-SleepNet achieved an F1 score of 58.4%, considerably outperforming leading mainstream methods. By comparison, the F1 scores of representative models such as AttnSleep, SleepEEGNet, and L-SeqSleepNet for the N1 stage typically ranged from 44% to 53%. This clearly underscores the strong capacity of ADG-SleepNet to handle highly imbalanced samples and complex feature entanglement in sleep staging. Furthermore, the model’s overall Acc and MF1 on the Sleep-EDF-20 test set (87.1% and 84.0%, respectively) not only reinforce its effectiveness in terms of improving recognition of the N1 stage but also demonstrate robust generalization for the global sleep staging task.
Nevertheless, certain limitations persist in recognizing the N1 stage. The F1 score for N1 remains notably lower than those for W and N2, suggesting that misclassification still occurs under severe class imbalance and overlapping physiological features. Moreover, the model’s ability to delineate transitional states (e.g., the N1–N2 boundary) requires further enhancement. Future research will focus on improving the representation of rare stages and incorporating more multimodal and temporal contextual information to bolster robustness and generalization in complex real-world scenarios.
It is also worth emphasizing that ADG-SleepNet achieves outstanding model compression. With only 0.49 million parameters, it ranks among the most compact models in its class. For comparison, SleepEEGNet and AttnSleep comprise 2.60 M and 4.54 M parameters, respectively, while even other lightweight models such as DeepSleepNet-lite and TinySleepNet exceed 0.6 M parameters. This significant reduction demonstrates the lightweight architectural advantage of ADG-SleepNet, substantially lowering storage and computational costs without compromising staging accuracy, thereby making it highly suitable for deployment on mobile and embedded platforms.
In summary, ADG-SleepNet not only consistently surpasses existing models in overall and stage-wise accuracy and F1 scores but also offers advantages in model complexity and practical deployment. In particular, its superior performance in the challenging N1 stage provides strong support for clinical applications in sleep medicine and portable monitoring devices. Compared with current mainstream lightweight models, ADG-SleepNet achieves a better balance between performance and efficiency, fully showcasing the advancements and practical significance of the proposed method.
3.3. Ablation Study
To further assess the contribution of each key component in ADG-SleepNet to the overall performance, we conducted ablation experiments by progressively removing the adaptive gating (AG) and channel attention (CA) modules, and evaluated their impact on the Sleep-EDF-20 dataset. The detailed results are presented in
Table 3.
As shown by the experimental results, the complete model—incorporating both the AG and CA modules—achieved the highest performance, with an Acc of 87.1% and a macro-averaged MF1 of 84.0%. When the adaptive gating module (–AG) was removed, the model’s performance degraded significantly, with Acc and MF1 dropping to 85.3% and 79.1%, respectively, highlighting the crucial role of adaptive gating in feature selection and information flow. Further, removing only the channel attention module (–CA) resulted in Acc and MF1 values of 86.0% and 79.8%, respectively—also lower than those of the full model, but higher than those observed when the AG module was removed. As shown in
Figure 8, the average attention weights of the top 3 channels exhibit a continuous increase during the training process, gradually stabilizing over time. This indicates that the channel attention mechanism can dynamically focus on key channels, significantly enhancing their feature contributions. These results suggest that the channel attention module effectively models inter-channel feature dependencies, thereby improving the overall performance of the model. Notably, this enhancement is evident in both Acc and macro-average MF1, with a particularly marked improvement in MF1.
Moreover, when both the AG and CA modules were simultaneously removed (AG+CA), the model’s performance further declined, with Acc reduced to 84.1% and MF1 to 78.8%. These results not only demonstrate the complementary relationship between the two modules but also confirm that the superior performance of ADG-SleepNet stems from the collaborative enhancement of feature representation provided by multiple mechanisms.
3.4. Confusion Matrix Analysis
To gain a comprehensive understanding of the discriminative capabilities of ADG-SleepNet across different sleep stages, as well as its specific misclassification tendencies,
Figure 9 displays the confusion matrix for the model on the Sleep-EDF-20 test set. The results indicate that ADG-SleepNet achieves notably high recognition rates along the main diagonal, underscoring its effectiveness in correctly classifying the majority of samples across various sleep stages. However, the distribution of off-diagonal elements highlights the inherent challenges in sleep staging and offers valuable insights into the detailed classification behaviors of the model.
In particular, the model demonstrates exceptional performance in identifying the Wake (W), N2, and REM stages, with main diagonal accuracies reaching 93.5%, 92.1%, and 87.2%, respectively. These results reflect the model’s robust ability to accurately detect wakefulness, typical non-REM, and REM sleep periods. For the N3 stage, the main diagonal accuracy is slightly lower at 88.8%, but still remains at a competitive level, indicating strong recognition of deep sleep patterns. By contrast, the performance for the N1 stage is more modest, with a main diagonal accuracy of 58.4%—the lowest among all stages. This observation is consistent with established clinical findings, as N1 is both underrepresented and characterized by highly overlapping features with adjacent stages, thus presenting a persistent challenge in automated sleep staging.
Further analysis of the confusion matrix reveals that misclassifications of N1 predominantly occur as W, N2, and REM, with respective rates of 14.0%, 11.8%, and 8.4%. This pattern underscores the difficulties in discriminating transitional states, particularly between N1 and its neighboring stages, largely due to the ambiguous physiological nature of N1 and the considerable overlap in EEG characteristics. Additionally, there is a notable degree of mutual misclassification between N2 and N3 (5.5% of N2 predicted as N3, and 5.4% of N3 predicted as N2), suggesting that further refinement is needed to better distinguish between light and deep NREM sleep stages. It is also observed that a proportion of REM samples are misclassified as N2 (7.5%), highlighting the partial feature overlap that exists between certain sleep stages within complex sleep architectures.
In summary, ADG-SleepNet achieves high classification accuracy across the major sleep stages, thereby promoting overall staging consistency and reliability. Importantly, despite the substantial challenges associated with N1, the model still demonstrates commendable recognition capability in this stage. Compared to existing state-of-the-art baselines, the confusion matrix of ADG-SleepNet exhibits a more balanced distribution, with fewer extreme misclassification cases, further supporting its stability and practical application potential. Nevertheless, the discrimination between N1 and its neighboring stages remains an area for improvement. Future research will aim to address these challenges by incorporating multimodal feature integration and exploring finer-grained temporal context modeling, thereby further enhancing recognition accuracy for complex transitional states.
4. Discussion
The proposed ADG-SleepNet demonstrates significant performance improvements in single-channel EEG-based sleep staging, notably achieving higher staging accuracy and macro-averaged F1 scores than mainstream lightweight models, even with an extremely compact parameter count. Based on the aforementioned experimental results, the model’s performance can be further discussed from several perspectives.
Firstly, ADG-SleepNet achieves outstanding overall performance on both the Sleep-EDF-20 and Sleep-EDF-78 benchmark datasets. Specifically, the model attains an accuracy of 87.1% and a macro-averaged F1 score of 84.0% on Sleep-EDF-20, and 85.1% accuracy with 81.1% macro-F1 on Sleep-EDF-78. These results not only surpass most lightweight neural network models but also demonstrate notable advantages in challenging stages such as N1. Analyses of the confusion matrix and stage-wise F1 scores further validate the high recognition rates of the proposed model for the major sleep stages (W, N2, N3, and REM), while also significantly improving the model’s discrimination ability for the N1 stage. It is noteworthy that the classification of the N1 stage has long been recognized as a challenging task within the field. In both the Sleep-EDF-20 and Sleep-EDF-78 datasets, several studies have reported F1 scores for N1 typically ranging between 50% and 55% (e.g., 52.2% for SleepEEGNet, 44.4% for DeepSleepNet-lite, and 47.9% for AttnSleep). Compared to our F1 score of 58.4%, this represents a stage that remains a particularly challenging problem with limited room for improvement.
Secondly, the architectural innovations of the model serve as the core driving force behind its performance improvements. Ablation study results demonstrate that the adaptive gating mechanism and channel attention module play a critical role in optimizing the overall performance. The synergistic effect of these components not only enhances the ability to extract multi-time-scale features but also improves the interaction efficiency between feature channels, thereby significantly boosting the model’s adaptability to complex EEG signals. Compared to the MultiSEss model, which achieves an Acc of 83.8% and an MF1 score of 73.4% on the Sleep-EDF-20 dataset, ADG-SleepNet clearly exhibits superior performance in overall classification accuracy and the identification of complex stages, despite the former performing well on certain stages. A deeper analysis of the dynamic fusion weights reveals that the model can dynamically adjust the focus range of the multi-dilation branches based on the sleep stages, demonstrating a powerful multi-scale modeling capability and dynamic representation ability. In contrast, while the S4Sleep model shows more balanced F1 scores across multiple stages, particularly in the REM stage with an F1 score of 87.7%, its overall Acc (84.4%) and MF1 (80.7%) are still slightly inferior to ADG-SleepNet’s accuracy of 87.1% and macro F1 score of 84.0%. This is particularly evident in the fact that ADG-SleepNet exhibits higher precision and robustness when handling complex sleep stages, such as N1 and N3.
Moreover, ADG-SleepNet achieves high performance while compressing the parameter count to less than 0.49M, which is substantially lower than that of most comparable methods. This lightweight design lays both theoretical and practical foundations for real-time deployment of sleep staging models on wearable and edge devices. The success of this approach indicates that high-quality sleep stage recognition can be realized under resource-constrained conditions through rational architectural innovations and feature fusion mechanisms.
Nevertheless, this study has certain limitations. The model has been primarily validated on single-channel EEG datasets, and its adaptability to multi-channel signals, cross-device data, and complex clinical scenarios warrants further investigation. Additionally, although the model achieves improved performance on difficult stages such as N1, there is still room to enhance its robustness under conditions of extreme class imbalance or limited sample availability. Future work may further incorporate domain-specific priors, physiological feature augmentation, and multimodal data fusion to continually optimize both the performance and applicability of the model.
5. Conclusions
In this study, we proposed an efficient and lightweight network, ADG-SleepNet, for single-channel EEG-based sleep staging, and systematically validated its effectiveness on two authoritative public datasets: Sleep-EDF-20 and Sleep-EDF-78. The proposed model, with only 0.49 million parameters, achieved overall staging accuracies exceeding 85% and macro-averaged F1 scores above 80%, consistently outperforming current mainstream lightweight models—particularly demonstrating substantial advantages in the challenging N1 stage.
ADG-SleepNet integrates multiple architectural innovations, including multi-dilation parallel branches, adaptive gating, and channel attention mechanisms, to fully exploit the multi-scale and temporal features inherent in sleep EEG signals. Ablation studies and stage-wise analyses further confirm the effectiveness and synergistic contributions of these key modules. Meanwhile, the extremely low parameter complexity of the model provides a solid foundation for its real-time deployment in wearable devices, mobile platforms, and edge computing scenarios.
Overall, ADG-SleepNet achieves an excellent balance between model efficiency and staging accuracy, offering new insights into the development of approaches to lightweight automatic sleep staging. Future work will focus on extending the applicability of the model to more complex scenarios, such as multi-channel data, cross-device environments, and multimodal sleep monitoring, thereby further advancing the practical implementation and innovative application of intelligent sleep health monitoring technologies.