1. Introduction
Sleep is a fundamental aspect of our life, and ensuring high-quality sleep is important. The significance of sleep becomes evident when considering that individuals spend approximately one-third of their lives undergoing this vital physiological process [
1]. The sleep architecture is composed of distinct sleep stages, each marked by specific physiological changes. Originally, Rechtschaffen and Kales (1968) divided sleep into seven stages using the R&K method. According to their guidelines, these stages are wakefulness (W), four stages of non-REM (NREM) sleep (S1 to S4), and rapid eye movement (REM), in addition to body movement (MT) [
2]. Subsequently, in 2007, the American Academy of Sleep Medicine (AASM) revised their sleep scoring manual. As per AASM guidelines, the S3 and S4 were combined into a single stage, labeled as S3, due to their similar characteristics [
3]. Inadequate sleep quality in older adults is scientifically linked to numerous health problems, including obesity, diabetes, heart diseases, mood disorders, impaired immune system, elevated risk of mortality, and more [
4]. Research reveals that one-third (33%) of Canadian men experience sleep deprivation [
5]. Similarly, a study found that over one-third of American adults consistently face challenges in obtaining sufficient sleep [
6]. The early identification of sleep pattern changes can help prevent sleep disorders from worsening. Although polysomnography (PSG) is the gold standard for diagnosing sleep conditions, it is expensive, time-consuming [
7,
8,
9,
10], and generates extensive amounts of data, making the manual scoring of CAP cycles impractical and error-prone [
11]. To address the challenges and problems mentioned, researchers have been developing automated methods to assist sleep experts in detecting sleep stages and disorders. This is particularly relevant in aging populations, who exhibit a higher prevalence of sleep disorders and comorbidities, highlighting the need for more robust multimodal approaches.
Previous studies mainly used single-modality EEG signals to classify sleep stages [
12,
13,
14,
15,
16,
17,
18,
19] and disorders [
20], highlighting its simplicity, reduced model complexity, and suitability for extended home-based sleep monitoring. However, accurately identifying distinct sleep stages using only EEG signals is challenging as certain stages, such as REM, N1, and N2, exhibit somewhat similar EEG patterns [
21]. Thus, in addition to EEG brain activities, trained sleep experts also examine eye movements (EOG) and muscle activity levels (EMG) when annotating a 30 s PSG epoch [
3]. These additional signals play a crucial role in identifying certain sleep stages and diagnosing sleep disorders [
22,
23]. Furthermore, EOG and EMG have also been proven to be useful additional sources, complementing EEG in other multimodal automatic systems for sleep staging [
24,
25,
26,
27,
28,
29,
30] and disorder detection [
31,
32,
33,
34]. Therefore, this study uses multimodal signals (EEG, EOG, and EMG) to better align with the manual approach used by sleep experts, aiming to improve accuracy in classifying sleep stages and diagnosing sleep disorders for broader clinical adoption.
To achieve better results, some recent deep learning studies have utilized time–frequency images of physiological signals to classify sleep stages [
16,
17,
24,
26,
35,
36] and disorders [
32,
34]. To transform time-domain signals into time–frequency images, some studies used short-time Fourier transform (STFT) [
16,
17,
24,
26,
32,
34,
35], while others employed continuous wavelet transform (CWT) [
36]. These studies demonstrate the effectiveness of deep learning models in image-based sleep stage classification and disorder detection. In addition, time–frequency images are essential for sleep staging and disorder classification as they capture both time- and frequency-domain features, and they serve as a high-level representation of raw signals [
35]. While CWT is a valuable substitute for STFT, in this study, we opt for STFT due to its real-time efficiency and widespread industry adoption [
34].
Recent research studies on sleep scoring have widely investigated traditional machine learning algorithms, such as SVM [
12,
37], RF [
13], and KNN [
38]. Building on this foundation, Rui et al. (2019) proposed a multi-modality approach for automatic sleep staging using EEG, EOG, EMG, and ECG signals. Features selected using the ReliefF algorithm were classified with a random forest model, achieving an accuracy of 86.24% [
30]. Sharma et al. (2021) focused on detecting sleep disorders by using optimal triplet half-band filter bank (THFB) wavelet-based features from two EEG channels. The extracted Hjorth parameters were then fed into an ensemble boosted trees classifier, achieving an accuracy of 91.3% [
20]. In a subsequent study, Sharma et al. (2022) improved accuracy to 94.3% by applying a biorthogonal wavelet filter bank to EOG and EMG signals and using ensemble bagged trees [
33]. However, these approaches rely heavily on hand-crafted features and domain expertise, and often suffer from the curse of dimensionality, leading to potential information loss during data reduction [
9]. To address these challenges, recent studies are increasingly turning to deep learning, which automatically extracts features and performs well on large sleep datasets for sleep stage and disorder detection.
Deep learning methods, including CNNs, RNNs, transformers, and their combinations, have gained popularity for sleep stage classification and the diagnosis of sleep disorders. Several studies have employed CNNs [
24,
32,
34,
39,
40] for these tasks. Cheng et al. (2023) proposed a distributed multimodal and multilabel decision-making system (MML-DMS) based on VGG16 CNN architectures for the automatic identification of sleep stages and sleep disorders [
34]. They used EEG, ECG, and EMG recordings, computing spectrograms for each channel. These images were then fed into separate VGG16 CNN models, and the concatenated probability vectors from all classifiers were passed through a shallow perceptron neural network for final classification. The proposed approach achieved an average classification accuracy of 94.34% for sleep staging and 99.09% for sleep disorder detection [
34]. Overall, standalone CNN-based methods effectively extract features from sleep data without the need for manual feature selection. However, they fall short in modeling the transitional relationships of intra-epoch features, which can carry subtle yet clinically relevant temporal patterns, limiting their ability to fully exploit the temporal complexity present, even within a single epoch.
Given that sleep follows a sequential pattern, several studies have employed recurrent neural networks (RNNs) and long short-term memory (LSTM) networks to capture temporal dependencies in sleep data [
14,
26]. These studies suggest that accounting for the relationships between sleep epochs improves sleep staging performance. However, RNNs increase model complexity, are difficult to train in parallel, and are susceptible to vanishing or exploding gradients during backpropagation. Recent studies have been inspired by [
41] and adopted attention mechanism-based transformer encoder models [
17,
35]. In line with this, standalone transformer-based models are advantageous as they can be trained in parallel and are less complex than RNNs. They also excel at capturing global features but struggle to extract fine-grained local features, as emphasized in [
15]. Therefore, combining CNNs with transformers enables the model to first extract local features effectively, and the resulting feature maps are then fused column-wise, allowing the transformer to capture global dependencies across those features within a single epoch, boosting performance in sleep staging and disorder diagnosis.
Hybrid models combining CNNs with RNNs or LSTM architectures have been used in the current literature [
16,
25,
42,
43,
44]. For instance, Li et al. (2022) introduced a sleep stage classification method using EEG spectrograms [
16]. Their model, EEGSNet, used CNNs to extract time and frequency features from the spectrogram and two Bi-LSTMs to capture transition patterns between features from adjacent epochs, facilitating accurate sleep stage classification. Their method achieved the highest accuracy of 94.17% and an F1-score of 70.16% for the N1 stage. Similarly, Almutairi et al. (2021) proposed three architectures for detecting obstructive sleep apnea (OSA) from ECG signals [
43]: CNN, CNN + LSTM, and CNN + GRU. Using consecutive R-R intervals and QRS complex amplitudes as inputs, their results showed that the CNN with LSTM outperformed the other models, achieving an average classification accuracy of 89.11% for OSA detection. These studies, which combine the features of CNNs and LSTMs, have enhanced machine-based sleep scoring and sleep disorder diagnosis, bringing their performance somewhat closer to that of human scoring. However, due to their reliance on sequential processing and serialization, they still face challenges with training efficiency and parallelization.
Hybrid models that combine CNNs with transformers have gained attention in recent studies for sleep stage classification [
1,
15,
18,
19]. For instance, Yao et al. (2023) proposed a sleep stage classification approach that leverages single-channel EEG, employing four convolutional filter layers for feature extraction and transformers to model temporal variations, achieving a testing accuracy of 80% [
19]. These studies have achieved promising results using single-channel EEG signals. However, integrating signals from electromyography (EMG) and electrooculography (EOG) alongside EEG provide additional informative features. For example, the eye movements recorded by EOG are typically more frequent in stage W and REM but are less common in NREM stages [
22,
23]. This makes EOG features effective at distinguishing NREM stages from W and REM. Also, stage W is characterized by the highest muscle activity, whereas REM shows the lowest or absent EMG activity [
22,
23]. This makes EMG features useful for distinguishing between stage W and stage REM. Moreover, typically, muscles are temporarily paralyzed during REM sleep; however, in REM sleep behavior disorder (RBD), this paralysis is lacking [
22]. This makes EMG features useful for distinguishing between RBD and healthy individuals or those with other disorders such as narcolepsy and PLMD [
45]. This highlights the potential for developing CNN with transformer architectures that integrate EEG, EMG, and EOG signals by leveraging their distinct characteristics to enhance classification performance not only in sleep staging but also in sleep disorder diagnosis.
In this study, we propose a multimodal time-domain signal to time–frequency image conversion, where 30 s epochs of EEG, EOG, and EMG raw signals are individually transformed into spectrograms using short-time Fourier transform (STFT) and then converted into RGB spectrograms to prepare the input data for deep learning models. We also proposed three different architectures for classifying sleep stages and sleep disorders: (1) CNNs, (2) CNNs with a Bi-LSTM layer, and (3) CNNs with a transformer encoder. For each method, independent CNN layers with identical parameters (CNN architecture modules) extract unique features from each signal individually. The extracted features from all channels are then fused in a column-wise manner using a feature fusion block, which plays a key role in sleep stage and disorder classification. In Method-1, the fused features are directly classified. In Method-2, they are processed through a Bi-LSTM layer, while in Method-3, they are processed through a transformer encoder with a multi-head attention mechanism. Experiments were conducted using K-fold cross-validation to evaluate and compare the performances of the three model architectures against other advanced state-of-the-art methods. The CNN with a transformer encoder method achieved the best performance, with the highest average classification accuracy in detecting sleep stages and disorders using the CAP sleep dataset.
Overall, the main contributions, features, and advantages of our proposed model can be summarized as follows:
We adopt three novel architectures: (1) CNNs, (2) CNNs with Bi-LSTM, and (3) CNNs with a transformer encoder utilizing a multi-head attention mechanism. Each method extracts local features from RGB spectrogram images of EEG, EOG, and EMG signals separately, followed by column-wise feature fusion block to capture intra-epoch information. A residual connection is also applied to preserve the characteristics of the original joint feature maps and prevent gradient vanishing.
To the best of our knowledge, we investigate and innovate the CNNs with a transformer-based model architecture which is accurate and robust for classifying five sleep stages, particularly stage N1, and six sleep disorders by using RGB spectrogram images of EEG, EOG, and EMG signals of both patient and non-patient subjects.
We use modified L2 regularization to add a penalty term to the CCE loss to prevent overfitting. Moreover, our robust method classifies sleep stages and disorders in the elderly and could serve as a stepping stone for future research and a potential alternative to questionnaire-based diagnostic tools.
The rest of this article is organized as follows:
Section 2 describes the dataset, the proposed methods, and their processing steps.
Section 3 presents the ablation study and experimental results.
Section 4 provides a comparison, discussion of the results, and future directions. Finally,
Section 5 concludes the paper with a summary.
4. Comparison and Discussion
The results of the ablation experiments are summarized in the bar graphs shown in
Figure 6 and
Figure 7, where each bar corresponds to a different classification approach tested in our experiments for both sleep stages and disorders. The multichannel feature fusion CNN with multi-head attention outperforms all the other methods—including C4-A1 + ROC-LOC CNN, C4-A1 + EMG1-EMG2 CNN, ROC-LOC + EMG1-EMG2 CNN, C4-A1 + ROC-LOC + EMG1-EMG2 CNN Method-1 (Concatenation), and C4-A1 + ROC-LOC + EMG1-EMG2 Method-2 (CNN+Bi-LSTM)—in terms of both per-class metrics and overall metrics.
The addition of ROC-LOC and EMG1-EMG2 to C4-A1 led to clear sleep stage improvements in the overall performance metrics. According to the per-class F1-scores, in both Method-1 and Method-3, the contributions of ROC-LOC and EMG1-EMG2 features were particularly significant for enhancing the classification performance of stage N1 and REM; specifically, in Method-1, improvements of 4.7% and 5.5% were observed for N1 and REM, respectively, while Method-3 showed increases of 14.9% for N1 and 11.5% for REM. These results are expected, as stage N1 is not only associated with alpha and high-amplitude theta brain waves but also characterized by slow eye movements and moderate muscle tone. Similarly, REM sleep is distinguished by rapid eye movements in various directions and relaxed muscle activity, making the ROC-LOC and EMG1-EMG2 features particularly informative for its detection [
25]. The number of misclassifications between N3 and N1 or W, as well as between N3 and REM, is nearly zero. This is a good indication of model performance, because N3 represents deep sleep, whereas REM, N1, and W correspond to REM, light sleep, and Wakefulness, respectively. Additionally, the other stages also showed slight but consistent improvements in their classification from Method-1 to Method-3, confirming the potential correlation and complementarity of features across the EEG, EOG, and EMG channels, in a manner similar to how sleep experts manually assess sleep data [
35].
As shown above, the features across EEG, EOG, and EMG help in the detection of the REM stage, which is essential for diagnosing sleep disorders, particularly narcolepsy and RBD [
25]. Adding ROC-LOC and EMG1-EMG2 to C4-A1 improved the classification of narcolepsy and RBD. According to the per-class F1-scores, in Method-1, improvements of 1.4% and 1.1% were observed for narcolepsy and RBD, respectively, while Method-3 showed increases of 3.1% for narcolepsy and 2.2% for RBD. These results are expected due to the physiological traits of the disorders, as narcolepsy is associated with abnormal transitions into REM sleep. RBD, on the other hand, is characterized by abnormal muscle activity during REM sleep. The inclusion of EOG and EMG modalities enhances the model’s ability to capture the characteristic features of these disorders. Other disorders also showed consistent classification improvements from method-1 to Method-3, highlighting the complementary value of multimodal features, akin to expert sleep scoring.
Lastly, we aim to compare the consistency of our results with previous state-of-the-art methods.
Table 11 and
Table 12 present a comparative analysis of our CNN with transformer method against existing state-of-the-art approaches for sleep staging and sleep disorder classification, respectively. It can be observed from
Table 11 and
Table 12 that the CNN with transformer method outperformed previous state-of-the-art methods. This outperformance is due to the strong capability of CNN, which excels in single-channel feature extraction, the effective fusion of multichannel features, and the use of multi-head attention mechanisms, which capture intra-epoch temporal dynamics and contextual dependencies among the joint features. This helped our method to effectively mine useful features within the joint single-epoch features.
Our CNN with transformer method outperforms the best-performing study [
34] by 0.86% and 0.21% for sleep stages and sleep disorders, respectively. In addition, our method has a smaller model footprint and lower computational costs than [
34], which used the VGG16 CNN structures. Our model achieves superior performance compared to other state-of-the-art approaches, especially in the classification of N1, a stage known for its classification difficulty. The CNN with transformer method has several advantages over the existing studies; notably, it demonstrates superior performance in terms of MF1 and MGM for both sleep stages and sleep disorders, underscoring its better processing strategy in handling imbalances across all data categories. The highest Cohen’s kappa values, 93.6% for sleep staging and 99.1% for disorder classification, achieved by our model demonstrate an almost perfect agreement with human sleep experts, reflecting the highest level of concordance, as defined by Landis and Koch. Our model shows strong potential, delivering a performance comparable to or exceeding all previously reported state-of-the-art results, as shown in
Table 11 and
Table 12, and can be integrated into clinical care settings.
Regarding future work, some datasets contain more than 136 channels [
56], which are common across all subjects. For such datasets, we plan to apply channel selection techniques to identify a minimal set of informative channels necessary for automated sleep scoring and sleep disorder diagnosis, thereby enhancing the feasibility of clinical application. We will also explore other modalities, such as ECG and respiratory signals, for the effective detection of other disorders, such as SDB, in datasets with a sufficient number of SDB subjects. Furthermore, we plan to investigate inter-epoch contextual dependencies and explore other advanced transformer architectures to further enhance model performance. Finally, to enhance model generalization while reducing computational costs and improving processing speed, we will apply our approach to additional datasets, including the SHHS, MASS, MIT-BIH, Apnea-ECG Database, and St. Vincent’s University Hospital/University College Dublin Sleep Apnea Database (UCDDB).