1. Introduction
Sleep is an essential human activity that occupies one-third of people’s lives. Long periods of unhealthy sleep can lead to various diseases [
1,
2]. Medical experts assess five components of sleep health: duration, continuity, timing, alertness, and quality [
3]. Most of these indicators can be obtained via polysomnography (PSG) analysis. The acquisition and analysis process of PSG is as follows. First, multiple sensors placed on the patient record physiological signals—producing an electroencephalogram (EEG), electrooculogram (EOG), electrocardiogram (ECG), and electromyogram (EMG)—during sleep. Second, these signals are split into 30-s epochs that are classified by sleep state: wake (W), rapid eye movement (REM), non-REM stage 1 (N1), non-REM stage 2 (N2), non-REM stage 3 (N3), and non-REM stage 4 (N4), as defined by the Rechtschaffen and Kales Manual (R&K) [
4]; or by merging stage N4 into stage N3, as defined by the American Academy of Sleep Medicine Manual (AASM) [
5]. Third, the scorer notes spontaneous arousals, cardiac arrhythmias, and respiratory events. In this process, the second step is both crucial and time-consuming [
6]. It requires that an experienced medical expert observe each PSG epoch to look for its characteristic features and assign it to the correct sleep stage.
Figure 1 shows some examples. This labor-intensive process limits the efficiency of PSG analysis. With extensive researches of machine learning methods in biomedicine [
7,
8,
9,
10,
11], many researchers have proposed a series of machine learning-based algorithms to carry out computer-aided, or even fully automated, sleep stage classification [
12,
13,
14,
15].
In recent years, automated sleep stage classification research has focused on two machine learning approaches [
6]: traditional machine learning methods and deep learning-based methods. Traditional machine learning methods combine manually chosen representative signal features and machine learning models to classify sleep stages. For example, Liang et al. [
16] first proposed the use of multiscale entropy as a signal feature, and employed an autoregressive model for classification. Tsinalis et al. [
17] extracted 557 features in the time-frequency domains of EEG signals as input to a stacked sparse autoencoder model, and achieved 78.9% accuracy on the sleep-edf [
18] database. A study by Hassan et al. [
19] handled a signal that needed to be decomposed into several sub-bands, using the Tunable-Q wavelet transform. Classification based on a bootstrap aggregating model was then implemented based on the statistical characteristics of the sub-bands. Jiang et al. [
20] divided sleep stage classification into three steps: feature extraction based on multimodal decomposition, classification using a random forest, and result refinement based on sleep stage transition rules using a hidden Markov model. The refinement process was particularly suited to improving the classification accuracy of stage N1.
In deep learning models, feature extraction is automatically realized by a deep neural network model [
21,
22], enabling end-to-end automated sleep stage classification. Deep learning-based methods mainly use convolutional neural networks (CNNs) [
23], recurrent neural networks (RNN), or a combination of the two. CNNs have a strong capacity to learn shift-invariant features, and have already achieved great success in the field of computer vision. ResNet is a powerful architecture in image classification. Andreotti et al. [
24] first employed a modified ResNet with 34 layers to realize automatic sleep stage classification. Yildirm et al. [
25] developed a one-dimensional CNN that used raw PSG signals as input, and achieved 91% accuracy on the sleep-edf dataset. Phan et al. [
26] proposed a two-dimensional CNN-based model. Their method obtains a spectral map using a short-time Fourier transform of the raw PSG and employs a classification process similar to that used for natural images. However, labeling an epoch, whether using the R&K guideline or the AASM, sometimes requires combining its data with information from the previous and following epochs. RNNs are often used to deal with problems, like this one, that include time dimension information. Among several RNN methods, long short-term memory (LSTM) [
27] is the most widely used, and can competently deal with long-term temporal dependence. Michielli et al. [
28] used a two-level LSTM structure to classify EEG signals, which can effectively improve the classification performance of the N1 stage. The method of combining a CNN and LSTM was first proposed by Supratak et al. [
29]. The model used the CNN module to extract epoch-wise features, and then used bidirectional LSTM to extract sequence features to classify epochs.
In this study, we propose a neural network model based on a CNN and an attention mechanism [
30] for automated sleep stage classification, using a single-channel raw EEG signal. The main contributions of this work are as follows:
A neural network based on convolution and attention mechanism is built. The network uses a CNN to extract local signal features and multilayer attention networks to learn intra- and inter-epoch features. The recursive architecture is completely deprecated in our model.
For the unbalanced dataset, the proposed method uses a weighted loss function during training to improve model performance on minority classes.
The model outperforms other methods on sleep-edf and sleep-edfx datasets utilizing various training and testing set partitioning methods without changing the model’s structure or any of its parameters.
4. Discussion
In recent years, many automated sleep stage classification methods based on deep neural networks have used CNNs for feature extraction and vanilla RNNs or LSTM to capture temporal information. These strategies have significantly improved sleep stage classification accuracy. In this study, we used the sliding raw window signal as input to a CNN combined with multiple attention layers as the epoch feature extractor, and used multiple attention layers instead of an RNN structure to ascertain the temporal dependency between epochs. Our method achieved better overall classification accuracy and better performance in minority categories than several state-of-the-art methods. In the feature extraction stage, the CNN module can extract the features of each signal window well. As can be seen from the results of the attention weight visualization component, the attention block can learn that the model should give different attention to different windows based on the importance of each signal window. When an epoch has prominent characteristics, the model should pay more attention to significant areas, and when the characteristics of each signal window in the epoch are relatively similar, the same attention should be given across the epoch. From the results of the module validity analysis, we show that the multiple attention layers can play a role in processing temporal information from multiple epoch inputs, and that the weighted loss function effectively balances the model’s performance on the majority and minority stages.
In the future, we need to do the following work. First, in order to more accurately evaluate the general performance of the automatic sleep staging classification method in actual clinical applications, the model should be tested on additional independent external data, and transfer learning strategy should be applied to improve the generalization of the deep learning model. Second, during manual scoring, human experts combine EEG, EOG, EMG, and other signals to make a comprehensive judgment. However, deep learning-based methods that directly use multichannel data as input have not effectively improved classification accuracy, so we plan to use the attention mechanism employed in this study on multiple channels to improve the model’s classification performance.