Multi-Layer Graph Attention Network for Sleep Stage Classification Based on EEG

Graph neural networks have been successfully applied to sleep stage classification, but there are still challenges: (1) How to effectively utilize epoch information of EEG-adjacent channels owing to their different interaction effects. (2) How to extract the most representative features according to confused transitional information in confused stages. (3) How to improve classification accuracy of sleep stages compared with existing models. To address these shortcomings, we propose a multi-layer graph attention network (MGANet). Node-level attention prompts the graph attention convolution and GRU to focus on and differentiate the interaction between channels in the time-frequency domain and the spatial domain, respectively. The multi-head spatial-temporal mechanism balances the channel weights and dynamically adjusts channel features, and a multi-layer graph attention network accurately expresses the spatial sleep information. Moreover, stage-level attention is applied to easily confused sleep stages, which effectively improves the limitations of a graph convolutional network in large-scale graph sleep stages. The experimental results demonstrated classification accuracy; MF1 and Kappa reached 0.825, 0.814, and 0.775 and 0.873, 0.801, and 0.827 for the ISRUC and SHHS datasets, respectively, which showed that MGANet outperformed the state-of-the-art baselines.


Introduction
Sleeping is a crucial condition to maintain normal cognitive function. Long-term sleep deprivation causes normal cognitive diseases, such as learning and memory disorders [1,2]. Different sleep stages are complex and dynamic biological processes that periodically repeat. Polysomnography (PSG) contains physiological signals such as electroencephalogram (EEG) and electroophthalmogram (EOG) [3]. In a clinical setting, the transition rules between sleep stages assist experts to accurately classify sleep stages within frameworks determined by the American Academy of Sleep Medicine (AASM) and the Rechtschaffen and Kales (R&K) standards [4,5], which divide human sleep stages into five classes: wake (W), rapid eye movement (REM), non-REM1 (N1), non-REM2 (N2), and non-REM3 (N3) [6,7].
Automatic sleep staging has wide practical value in clinical medicine. Researchers have used traditional machine learning algorithms based on time and frequency domain features, random forest (RF) [8], and support vector machines (SVM) [9] for sleep stage classification. In recent years, with the development of deep learning, it has been widely used to classify sleep stages with its powerful representational ability. [10][11][12][13]. However, there are some disadvantages to the realization of sleep staging using convolutional neural networks [14]. Generally speaking, there are three types of information fusion in EEG signals processing: spatial information fusion, temporal information fusion, and spatial and temporal information fusion [15]. It not only fuses the relationship dynamics of EEG signals in different brain regions, such as correlation, but also calculates time-frequency domain features at different times through fusion. Traditional CNN architecture and 2D spatial and temporal information fusion [15]. It not only fuses the relationship dynamics of EEG signals in different brain regions, such as correlation, but also calculates timefrequency domain features at different times through fusion. Traditional CNN architecture and 2D time-frequency signal processing methods could well process the temporal information fusion of EEG signals. Unfortunately, the fixed convolution kernel cannot maintain the translation invariance of non-Euclidean distance signals or directly capture effective node information. Therefore, it cannot fuse and process rich spatial information. Jia et al. [16] considered using the graph structure to establish the functional connectivity and physical proximity of EEG brain regions.
Furthermore, brain regions are located in non-Euclidean space, and different brain regions control various functions of the human body [17]. For example, the brainstem has the function of maintaining circulation and respiration, and the telencephalon dominates the activities of the higher cortex, such as smell, hearing, and other systemic sensations, as well as body movement. Research has determined that the whole of the human brain function structure is strongly influenced by regional anatomical relationships, and the degree of interaction between regions varies [18]. When updating the current node representation of GCN, all adjacent nodes in the brain region will be randomly assigned static important weights. As a result, neighboring nodes closely related to the current node functions are randomly assigned larger or smaller weights, causing the importance of features of node functions to be incorrectly amplified or ignored [19].
Inspired by these recent works, we propose a multi-layer graph attention sleep stage classification network (MGANet) to distinguish sleep stages in the graph structure; the framework is shown in Figure 1. During training, the influence weights of neighbor channels in the EEG spatial domain and temporal domain are calculated, and the constructed graph structure can reflect the different interaction of functional connectivity between brain regions. Multilayer GATs use stacked layered space-time characteristics to capture information, and reduce the graph structure model parameters. This model successfully applies node-level attention and stage-level attention to sleep classification, dynamically establishing the interaction between channels, and combining RNN and GAT to extract the most representational spatial and temporal features. In general, the sleep staging model based on a multi-layer graph attention network has the following characteristics: (1) A node-level attention mechanism is applied to effectively calculate and dynamically update the importance of adjacent channels to the current channel, while considering the spatial location and functional connections of brain regions.  In general, the sleep staging model based on a multi-layer graph attention network has the following characteristics: (1) A node-level attention mechanism is applied to effectively calculate and dynamically update the importance of adjacent channels to the current channel, while considering the spatial location and functional connections of brain regions. (2) The parallel operation of multi-layer graph attention convolution of MGANet improves the computational efficiency of the model. When calculating output features and executing attention operations, all nodes and adjacent edges run in parallel. (3) A transition stage estimator is constructed to accurately classify sleep stages by expressing the features of the transition stage epoch through the stage-level attention vector. Experimental results show that MGANet achieves state-of-the-art performance in sleep staging. The remaining sections of this paper are organized as follows. Section 2 introduces the related work. Section 3 describes the materials and our method. Section 4 outlines the results and discussions, and Section 5 summarizes future work.

Related Work
In early studies, manual sleep stage classification by human sleep experts was tedious and time-consuming, with highly subjective classification results [20,21]. The application of traditional sleep staging methods includes the use of machine learning, such as RF and SVM. Tsinalis et al. [13] used the method of manual extraction of sleep spectrum features, and used stacked sparse autoencoders for integrated learning. Such methods require manual features extraction and a large amount of prior knowledge, and are gradually replaced by deep learning algorithms that automatically extract features.
Subsequently, a single-channel sleep staging model with the bidirectional long short term memory (BiLSTM) network was designed by Supratak et al. [10]. Later, Chambon et al. [11] used a hybrid neural network to extract time-invariant features and timecontextual features. Phan et al. [22] proposed a multi-variable and multi-modal algorithm, which automatically extracted sleep features by using multi-modal physiological signals and auxiliary classification of sleep rules. Chriskos et al. [23] processed sleep stage staging tasks as sequence-to-sequence tasks, and Jia et al. [24] established a sleep staging framework based on CNN to extract synchronous features of cerebral cortex interaction. Eldele et al. [25] proposed a model which consisted of feature extraction modules based on a multi-resolution convolutional neural network (MRCNN) and adaptive feature recalibration (AFR). A method of utilizing deep residual networks in mixed queue environments was proposed by Alexander et al. [26].
In addition, graph convolutional networks inherited from CNN are mainly divided into graph convolutional networks based on spectral structure and graph convolutional networks based on spatial structure [19]. Jia et al. [16] used graph structure to represent brain spatial connections for the first time, which adaptively learns the internal connections between different channels represented by an adjacency matrix. Subsequently, they proposed a multi-view spatial-temporal graph convolutional network with domain generalization, which constructed two brain views by utilizing functional connectivity and the physical proximity of brain regions [27].
In this paper, we propose a new general framework, which represents and fuses the deep semantic features of the input EEG data, and effectively learns the significant features and temporal dependence across epochs. The experimental results show that MGANet is superior to the most advanced baseline. In future work, it could be expanded to emotion recognition, speech processing, non-stationary signal analysis, and other fields [28][29][30].

Materials and Methods
MGANet constructs a new graph structure to describe the brain's spatial position and temporal contextual information, the architecture of which is shown in Figure 1. We introduce the networks used to construct sleep graph attention networks, and directly outline their theoretical and practical benefits, as well as limitations, compared to previous work.

Preliminaries
Sleep stage. Epochs were divided into five classes after 30 s. Graph attentional sleep network. This is defined as an undirected graph G A = (N, C, A), where N, C, A denotes vertex, adjacent edge, and the adjacency matrix in GATs, respectively, mapping the brain as an electrode channel, channel connection, and adjacent relationship between channels, respectively. The EEG channel numbers are defined as |N|= n , and the input sequence is defined as

Multi-Head Graph Attention Learning
Firstly, we construct the human brain as a graph, with electrodes corresponding to the nodes of the graph and the connections between channels corresponding to the edges of the graph. In addition, the sleep features of each node are taken as its dynamic features, and appropriate weights are assigned to adjacent nodes, corresponding to the influence level of neighbors on the current channel.
In order to adaptively capture spatial location information through brain position nodes, the input is a node features sequence MGANet uses parallel attention operation to assign importance weightings to different channels in the same brain region according to the influence of the interaction between nodes, using the spatialtemporal features of adjacent channels. Specifically, a fixed number of neighborhood edges are selected around the vertex of each electrode, and the learnable weight coefficient matrix W G of sleep space features is obtained. The attention coefficient u, i.e., the importance of neighbor nodes N k , k = 1, 2, · · · , m to the center node N c , which is calculated by the attention factor a and W G through the shared attention mechanism, and m is the first-order adjacent channel of N c .
To make the coefficients across different nodes easily comparable, softmax normalization is used to leverage masked self-attention vector α of N c and its neighborhoods' N k can be obtained through softmax normalization operation, as follows: The LeakyReLU nonlinear function with negative slope β = 0.2 is applied to activate the attention vector α N c ,N k as follows: where → a T is the attention mechanism network transpose operation, W → f N c ||W → f N k is the feature aggregation to be given weighting. In order to allocate the importance of different sleep features contained in each channel to adjacent nodes, the multi-node attention mechanism is introduced, and the weighted sum of multiple features contained in nodes N c is then averaged, and the final spatial features → f s are obtained after L th attention. The calculation process is shown as follows: where σ is the sigmiod activation function, and α (l) N c ,N k is the l th attention coefficient after normalization, and W (l) is the attention weight matrix. MGANet uses the fixed symmetric normalized Laplacian operator widely used in graph convolution as the propagation mode to dynamically learn the spatial structure through the adjacency matrix of the current channel, and introduces a multi-head attention mechanism to match and update the features of adjacent nodes of each node that has significant influence on itself. The model framework is shown in Figure 2.
, , , , s s P Q Q Q Q is an adjustable parameter, and the attention matrix is s normalized, as follows: A i j represents the correlation degree of sleep nodes i and j in the spat main. After the spatial attention mechanism, the spatial features A s f are updated lows:

Temporal Node-Level Graph Attention Sleep Network
EEG is a typical distraction time signal with intense time dependence betwee cent sleep stages. Combining the gated recurrent unit (GRU) to MGANet is used ture temporal dependent features and spectral information of the current node, a attention mechanism is used to capture the dynamic time information between stages.

Spatial Domain Attention
The spatial attention mechanism is introduced into MGANet to automatically update the attention coefficient to extract the dynamic spatial valuable information of signals, and its definition is as follows: where P s , Q s , Q 1 , Q 2 , Q 3 is an adjustable parameter, and the attention matrix is softmax normalized, as follows: where A S (i, j) represents the correlation degree of sleep nodes i and j in the spatial domain.
After the spatial attention mechanism, the spatial features → f A s are updated as follows:

Temporal Node-Level Graph Attention Sleep Network
EEG is a typical distraction time signal with intense time dependence between adjacent sleep stages. Combining the gated recurrent unit (GRU) to MGANet is used to capture temporal dependent features and spectral information of the current node, and the attention mechanism is used to capture the dynamic time information between sleep stages.

RNN represents the input epoch sequence as
,n }, recursively combining the time-domain features extracted at the previous moment with the current moment, thus dynamically updating the output data, that is . Time-frequency domain features of the current node and nonlinear function combine two kinds of information sources, which are represented by → f t and r(·, ·). The GRU update procedure is as follows: The GRU network unit structure is shown in Figure 3. moment, thus dynamically updating the output data, that is Timefrequency domain features of the current node and nonlinear function combine two kinds of information sources, which are represented by t f and ( , ). r ⋅ ⋅ The GRU update procedure is as follows: Input sleep sequence

Temporal Domain Attention
To better extract the temporal and frequency domain sleep features, the temporal attention mechanism is added to the RNN temporal sleep network, as follows: , , , , v v P Q P P P are the adjustable parameters of the model, and ( 1) r t f − is the input feature vector after the convolution of th r layer. The final time-domain feature expression is obtained after softmax normalization, as follows: where ( , ) T A i j represents the time correlation degree of sleep nodes i and j. After the time attention mechanism, the model updates to A t f which is as follows: In this paper, we present the calculation of the MGANet network layer in pseudocode referring to Appendix A.3.

Temporal Domain Attention
To better extract the temporal and frequency domain sleep features, the temporal attention mechanism is added to the RNN temporal sleep network, as follows: where P v , Q v , P 1 , P 2 , P 3 are the adjustable parameters of the model, and t is the input feature vector after the convolution of r th layer. The final time-domain feature expression is obtained after softmax normalization, as follows: where A T (i, j) represents the time correlation degree of sleep nodes i and j. After the time attention mechanism, the model updates to → f A t which is as follows: In this paper, we present the calculation of the MGANet network layer in pseudo-code referring to Appendix A.3.

Stage-Level Attention Transitional Estimator
The immediate task by the transitional stage estimator (TSE) mainly comprises two parts: spatial graph features and temporal contextual relationships. It is challenging to classify transitional epochs because of the easily confused features of multiple stages. Therefore, we extracted the sleep features of the transition stage through EEG-learnable hidden relationships, and explicitly proposed an attention-guided transitional stage estimator.
TSE firstly aggregates spatial features and temporal features as follows: Then, the attention vector is represented as the features of a sleep transition stage at each epoch, and uses the softmax function to calculate the sleep stage probability distribution P f as follows: (13) where F C indicates the full connection layer used for decision-making and G indicates the global average pooling operation. P f represents the confidence of each sleep stage, and the class probability jointly carries the unknown stage information. As the potential feature information, the probability of the sleep stage is calculated by a series of linear or non-linear functions, such as sigmoid, to represent its attention vector as follows: Finally, class probabilities and their attention vectors are clearly represented by the Hadamard product multiplication operation as follows: The obtained → f * contains epoch features and transitional fusion information. The TSE's network structure is shown in Figure 4.

Stage-Level Attention Transitional Estimator
The immediate task by the transitional stage estimator (TSE) mainly comprises two parts: spatial graph features and temporal contextual relationships. It is challenging to classify transitional epochs because of the easily confused features of multiple stages. Therefore, we extracted the sleep features of the transition stage through EEG-learnable hidden relationships, and explicitly proposed an attention-guided transitional stage estimator.
TSE firstly aggregates spatial features and temporal features as follows: Then, the attention vector is represented as the features of a sleep transition stage at each epoch, and uses the softmax function to calculate the sleep stage probability distribution f P ′ as follows: where C F indicates the full connection layer used for decision-making and G indicates the global average pooling operation. f P ′ represents the confidence of each sleep stage, and the class probability jointly carries the unknown stage information. As the potential feature information, the probability of the sleep stage is calculated by a series of linear or non-linear functions, such as sigmoid, to represent its attention vector as follows: Finally, class probabilities and their attention vectors are clearly represented by the Hadamard product multiplication operation as follows: The obtained * f ′ contains epoch features and transitional fusion information. The TSE's network structure is shown in Figure 4.

Spatial Multi-Layer Graph Attention Convolution
We extend the definition of spatial graph attention convolution to nodes with multiple channels to effectively solve the problem of too large or too small weighting, caused by uneven importance distribution of adjacent nodes. After multi-head graph at-

Spatial Multi-Layer Graph Attention Convolution
We extend the definition of spatial graph attention convolution to nodes with multiple channels to effectively solve the problem of too large or too small weighting, caused by uneven importance distribution of adjacent nodes. After multi-head graph attention sleep feature learning, MGANet extracts spatial features → f s related to sleep; the spatial multi-layer graph architecture is shown in Figure 5.

Classification Performance
Our model is verified on the ISRUC [31,32] and SHHS [33,34] datasets. The tained model performance is shown in Table 1. Dataset introduction and baseline tings refer to Appendix A.1 and A.2. Table 1 shows the performance comparison of several baseline results. The p posed model has the best performance on the two datasets. The accuracy, MF1, a Kappa of the models were 0.825, 0.814, and 0.775 on the ISRUC dataset, and 0.873, 0.8 and 0.827 on the SHHS dataset, respectively.

Classification Performance
Our model is verified on the ISRUC [31,32] and SHHS [33,34] datasets. The obtained model performance is shown in Table 1. Dataset introduction and baseline settings refer to Appendices A.1 and A.2. Table 1 shows the performance comparison of several baseline results. The proposed model has the best performance on the two datasets. The accuracy, MF1, and Kappa of the models were 0.825, 0.814, and 0.775 on the ISRUC dataset, and 0.873, 0.801, and 0.827 on the SHHS dataset, respectively. We bold the optimal result and underline the second-best result.
After comparing the deep learning method combining CNN and RNN, we found that the spatial location information of EEG could not be completely characterized, and the extracted features did not include regional connections between the brains. As shown in Table 1, the performance of traditional methods using CNN and RNN needs to be improved. Although the existing graph convolutional networks can fully extract the temporal context and spatial features, their spatial feature extraction does not consider the weight of each adjacent feature, resulting in some features being excessively important or neglected, resulting in differences between the extracted features and the real signal expression. Through the comparative test, it can be seen that the MSTGCN proposed can extract the spatial location information of EEG signals, and fully consider the importance of the interaction between adjacent channels to achieve the most advanced performance.
For each sleep stage, our model achieved the highest F1-score at the Wake, N1, and N2 stages, and the second highest level at the N3 and REM stages. Thus, for the most common confusion in the sleep staging task transition duration, namely Wake and N1, the proposed model's proposed TSE achieves the most advanced level, with N1 performing 2% better than the second most advanced.

Number of Parameters and Training Duration
We compared the number of training parameters and training duration of several advanced models and MGANet models in each epoch in Table 2. As can be seen from Table 2, the number of parameters of the proposed model is significantly less than that of other models, and the training duration of this model is also better than the others. Although the model establishes a complex graph channel structure, the parameter number of the proposed model is 1.5 × 10 5 , which is still on the same order of magnitude as other state-of-the-art methods. Furthermore, the number of parameters affects the complexity and robustness of the model; the experiments prove that the proposed model achieves better results under the same parameter magnitude.

Hypnograms
To visualize the classification performance of the proposed model, we compared the sleep classification results of the MGANet model with sleep hypnograms manually scored by sleep experts, as shown in Figure 6. The differences between MGANet and expert manual scoring were mostly in the N1 and REM stages. Although the proposed model effectively improved the classification accuracy of the N1 stage; the class imbalance of N1 stage causes the classification accuracy to drop.  Figure 6 shows that the MGANet projected for 800 epochs can correctly classify most sleep stages and has a high accuracy in each transition stage. In the first 20 epochs, the model mistakenly confused the N1 and W phases, which is consistent with the human sleep pattern, because humans usually repeatedly cycle between the waking to light sleep stages during this period. Around the 330 epoch, the model confuses REM with N2, a period of human dreaminess. Some incorrect results in this model are understandable because of the confusing nature of the transition stage, which is characterized by multiple sleep stages. Actually, sleep experts also inaccurately score these stages when manually scoring. Most of the results are a misjudgment of N2 as N1, or REM as N2, which are also widespread inconsistencies in other baseline models.

Ablation Experiments Results
In order to compare the performance of the proposed model in each module, we designed the following ablation experiments. The variable settings and experimental results are shown in Table 3 and Figure 7. The baseline is set as the spatial multi-layer graph attention convolution module, variable A is the RNN convolution, variable B is the spatial-temporal attention, and variable C is TSE. Table 3 shows that the performance of the model using a baseline to extract the non-Euclidean spatial features of sleep is poor, and the model accuracy is only 80.9%. After RNN is added, the model can extract time-dependent features well, and the model performance is slightly improved to 81.5%. However, when spatial attention is added to the baseline, the model performance deteriorates compared to the baseline. It could be that the first is the cause of the lack of time domain signal feature extraction and the second is why the model of the network layer number is less.   Figure 6 shows that the MGANet projected for 800 epochs can correctly classify most sleep stages and has a high accuracy in each transition stage. In the first 20 epochs, the model mistakenly confused the N1 and W phases, which is consistent with the human sleep pattern, because humans usually repeatedly cycle between the waking to light sleep stages during this period. Around the 330 epoch, the model confuses REM with N2, a period of human dreaminess. Some incorrect results in this model are understandable because of the confusing nature of the transition stage, which is characterized by multiple sleep stages. Actually, sleep experts also inaccurately score these stages when manually scoring. Most of the results are a misjudgment of N2 as N1, or REM as N2, which are also widespread inconsistencies in other baseline models.

Ablation Experiments Results
In order to compare the performance of the proposed model in each module, we designed the following ablation experiments. The variable settings and experimental results are shown in Table 3 and Figure 7. The baseline is set as the spatial multi-layer graph attention convolution module, variable A is the RNN convolution, variable B is the spatial-temporal attention, and variable C is TSE. Table 3 shows that the performance of the model using a baseline to extract the non-Euclidean spatial features of sleep is poor, and the model accuracy is only 80.9%. After RNN is added, the model can extract time-dependent features well, and the model performance is slightly improved to 81.5%. However, when spatial attention is added to the baseline, the model performance deteriorates compared to the baseline. It could be that the first is the cause of the lack of time domain signal feature extraction and the second is why the model of the network layer number is less. We bold the optimal result and underline the second-best result. Figure 7 shows eight confusion matrices obtained from ablation experiments, demonstrating the effectiveness of the proposed model in improving the classification of transition stages. In the past decades, many researchers have discussed the problem of stage-class imbalance [35] in sleep datasets, which seriously affects the classification performance of models. The transitional N1 stage accounts for only about 5-10% of a sleep cycle, while REM accounts for about 20-25% [36]. Dataset class imbalance tends to exacerbate the confusion degree in the transition stage. For this reason, traditional methods use oversampling and undersampling to expand or reduce the amount of data to achieve the purpose of data balance, but such methods can easily cause data overfitting [37]. In addition, some researchers have used weighted cross entropy loss function to balance the data. When the model performance is poor, the learning speed becomes fast; conversely, the learning rate decreases and the model performance cannot be further improved. Because the transitional stage of sleep confuses the characteristics of multiple stages, it makes it difficult to stage sleep. The current experiments have shown that the performance of the model is significantly improved after adding TSE, and it has also been proved to be helpful for accurate classification of confusion stages. Finally, the results show that the proposed MGANet adds variables A, B, and C at the same time, and the model has the best performance; its accuracy, F1 score, and kappa were 82.5, 81.4, and 77.5, respectively. Suboptimal results can be obtained using RNN and transition stage estimation methods. We bold the optimal result and underline the second-best result. Figure 7 shows eight confusion matrices obtained from ablation experiments, demonstrating the effectiveness of the proposed model in improving the classification of transition stages. In the past decades, many researchers have discussed the problem of stage-class imbalance [35] in sleep datasets, which seriously affects the classification performance of models. The transitional N1 stage accounts for only about 5-10% of a sleep cycle, while REM accounts for about 20-25% [36]. Dataset class imbalance tends to exacerbate the confusion degree in the transition stage. For this reason, traditional methods use oversampling and undersampling to expand or reduce the amount of data to achieve the purpose of data balance, but such methods can easily cause data overfitting [37]. In addition, some researchers have used weighted cross entropy loss function to balance the data. When the model performance is poor, the learning speed becomes fast; conversely, the learning rate decreases and the model performance cannot be further improved. Because the transitional stage of sleep confuses the characteristics of multiple stages, it makes it difficult to stage sleep. The current experiments have shown that the performance of the model is significantly improved after adding TSE, and it has also been proved to be helpful for accurate classification of confusion stages. Finally, the results show that the proposed MGANet adds variables A, B, and C at the same time, and the model has the best performance; its accuracy, F1 score, and kappa were 82.5, 81.4, and 77.5, respectively. Suboptimal results can be obtained using RNN and transition stage estimation methods.

Heads Numbers
We also discussed the number of heads of multi-head attention proposed, and designed it so that when the number of heads h = 2, 4, 8, 16, or 20, the accuracy and F1score were obtained to explore the optimal number of heads used in Figure 8. h = 8 achieved the highest accuracy and F1-scores of 82.5 and 81.4, respectively. In Figure 8, the head numbers of MGANet represent the standard for distinguishing features of center nodes. Target features are extracted from multiple angles to avoid feature redundancy or single features. As shown in Figure 8, the model achieves best performance when h = 8 is used. When the number of attention heads is small, such as h =2 or h =4, the accuracy of the model is only 80.3% and 80.7%, respectively. Due to the lack of aggregation features in each head, it results in ineffective multi-attention mechanisms. When the number of attention items are larger, such as h =16 and h =20, the performance gradually decreases, which may be caused by the increase of aggregation features leading to over-fitting of the model. The number of heads determines the influence of adjacent channels on the weighting of the current channel, and selecting an appropriate number of heads can achieve the best effect of the model.

Heads Numbers
We also discussed the number of heads of multi-head attention proposed, and designed it so that when the number of heads h = 2, 4, 8, 16, or 20, the accuracy and F1-score were obtained to explore the optimal number of heads used in Figure 8. h = 8 achieved the highest accuracy and F1-scores of 82.5 and 81.4, respectively.

Heads Numbers
We also discussed the number of heads of multi-head attention proposed, and designed it so that when the number of heads h = 2, 4, 8, 16, or 20, the accuracy and F1score were obtained to explore the optimal number of heads used in Figure 8. h = 8 achieved the highest accuracy and F1-scores of 82.5 and 81.4, respectively. In Figure 8, the head numbers of MGANet represent the standard for distinguishing features of center nodes. Target features are extracted from multiple angles to avoid feature redundancy or single features. As shown in Figure 8, the model achieves best performance when h = 8 is used. When the number of attention heads is small, such as h =2 or h =4, the accuracy of the model is only 80.3% and 80.7%, respectively. Due to the lack of aggregation features in each head, it results in ineffective multi-attention mechanisms. When the number of attention items are larger, such as h =16 and h =20, the performance gradually decreases, which may be caused by the increase of aggregation features leading to over-fitting of the model. The number of heads determines the influence of adjacent channels on the weighting of the current channel, and selecting an appropriate number of heads can achieve the best effect of the model. In Figure 8, the head numbers of MGANet represent the standard for distinguishing features of center nodes. Target features are extracted from multiple angles to avoid feature redundancy or single features. As shown in Figure 8, the model achieves best performance when h = 8 is used. When the number of attention heads is small, such as h = 2 or h = 4, the accuracy of the model is only 80.3% and 80.7%, respectively. Due to the lack of aggregation features in each head, it results in ineffective multi-attention mechanisms. When the number of attention items are larger, such as h = 16 and h = 20, the performance gradually decreases, which may be caused by the increase of aggregation features leading to over-fitting of the model. The number of heads determines the influence of adjacent channels on the weighting of the current channel, and selecting an appropriate number of heads can achieve the best effect of the model.

Speed Training
We compared the training rates of the proposed model with STGCN [17] and MST-GCN [27] on 80 epochs. The experimental results are shown in Figure 9. The proposed model converges faster and the loss is closer to 0 than others. We compared the training rates of the proposed model with STGCN [17] and MSTGCN [27] on 80 epochs. The experimental results are shown in Figure 9. The proposed model converges faster and the loss is closer to 0 than others. As shown in Figure 9, the losses of the proposed model decreased the fastest among the first 10 epochs, reaching the lowest losses compared with the other two methods. Based on the speeding training in Figure 9, we consider that the multi-head graph attention convolution used by MGANet aggregates features, calculates the corresponding weights, and obtains the most refined features for model training. Therefore, the running time and running efficiency are better than other methods. It is worth noting that the proposed model converges faster and loses less than other methods.

Attention Channel Visualization
We visualized the MGANet constructed through nodes and edges in the brain, as shown in Figure 10. A graph structure was constructed of the human brain. The MGANet was used to correspond to the current node and that of its neighbors to establish the adjacent relationship, extract the dynamic spatial features, and display the schematic diagram of multi-orientation construction. Figure 10 shows that the multiple neighbor nodes are aggregated by adjacent edges carrying information to form the feature representation of the current node. For the central node, the non-Euclidean distance and weight of different neighbor nodes are significantly different, so it is necessary to consider the description of their mutual influence. The dynamic update feature is faultlessly implemented in the MGANet network. As shown in Figure 9, the losses of the proposed model decreased the fastest among the first 10 epochs, reaching the lowest losses compared with the other two methods. Based on the speeding training in Figure 9, we consider that the multi-head graph attention convolution used by MGANet aggregates features, calculates the corresponding weights, and obtains the most refined features for model training. Therefore, the running time and running efficiency are better than other methods. It is worth noting that the proposed model converges faster and loses less than other methods.

Attention Channel Visualization
We visualized the MGANet constructed through nodes and edges in the brain, as shown in Figure 10. A graph structure was constructed of the human brain. The MGANet was used to correspond to the current node and that of its neighbors to establish the adjacent relationship, extract the dynamic spatial features, and display the schematic diagram of multi-orientation construction. Figure 10 shows that the multiple neighbor nodes are aggregated by adjacent edges carrying information to form the feature representation of the current node. For the central node, the non-Euclidean distance and weight of different neighbor nodes are significantly different, so it is necessary to consider the description of their mutual influence. The dynamic update feature is faultlessly implemented in the MGANet network. We compared the training rates of the proposed model with STGCN [17] and MSTGCN [27] on 80 epochs. The experimental results are shown in Figure 9. The proposed model converges faster and the loss is closer to 0 than others. As shown in Figure 9, the losses of the proposed model decreased the fastest among the first 10 epochs, reaching the lowest losses compared with the other two methods. Based on the speeding training in Figure 9, we consider that the multi-head graph attention convolution used by MGANet aggregates features, calculates the corresponding weights, and obtains the most refined features for model training. Therefore, the running time and running efficiency are better than other methods. It is worth noting that the proposed model converges faster and loses less than other methods.

Attention Channel Visualization
We visualized the MGANet constructed through nodes and edges in the brain, as shown in Figure 10. A graph structure was constructed of the human brain. The MGANet was used to correspond to the current node and that of its neighbors to establish the adjacent relationship, extract the dynamic spatial features, and display the schematic diagram of multi-orientation construction. Figure 10 shows that the multiple neighbor nodes are aggregated by adjacent edges carrying information to form the feature representation of the current node. For the central node, the non-Euclidean distance and weight of different neighbor nodes are significantly different, so it is necessary to consider the description of their mutual influence. The dynamic update feature is faultlessly implemented in the MGANet network.

Conclusions
In this paper, we proposed a novel and state-of-the-art algorithm for a multi-layer spatial graph attention convolutional network for sleep staging. The proposed method uses graph attention convolution to process non-Euclidean sequences in the spatial dimension, RNN to process spectrum information and time dependence in the temporal dimension, and the attention mechanism causes the model to focus on attributing the most appropriate weightings to adjacent channels. The proposed transitional stage estimator makes the model more suitable for the sleep transition classification stage and solves the problem of poor classification accuracy caused by the confusion of transition stages. The proposed MGAN not only outperforms the advanced works, but also achieves the most advanced performance on the ISRUC-S3 and SHHS.
Most EEG research has been carried out in a task-driven manner, which would be largely limited by data labels to a great extent. It is difficult and expensive to obtain labeled datasets; hence, the proposed model cannot be well generalized and is not suitable for clinical diagnosis at present. In future research, we will build a semi-supervised model and an unsupervised model, and will use the easily obtained unlabeled data to classify sleep stages to improve generalization ability.

Acknowledgments:
The authors wish to thank the editor and the anonymous reviewers for their valuable suggestions on improving this paper. We thank the support in part by the National Natural Science Foundation of China (No. 61673222) and Jiangsu postgraduate practical innovation program (No. SJCX22_0333).

Conflicts of Interest:
The authors declare no conflict of interest.

Conclusions
In this paper, we proposed a novel and state-of-the-art algorithm for a multi-layer spatial graph attention convolutional network for sleep staging. The proposed method uses graph attention convolution to process non-Euclidean sequences in the spatial dimension, RNN to process spectrum information and time dependence in the temporal dimension, and the attention mechanism causes the model to focus on attributing the most appropriate weightings to adjacent channels. The proposed transitional stage estimator makes the model more suitable for the sleep transition classification stage and solves the problem of poor classification accuracy caused by the confusion of transition stages. The proposed MGAN not only outperforms the advanced works, but also achieves the most advanced performance on the ISRUC-S3 and SHHS.
Most EEG research has been carried out in a task-driven manner, which would be largely limited by data labels to a great extent. It is difficult and expensive to obtain labeled datasets; hence, the proposed model cannot be well generalized and is not suitable for clinical diagnosis at present. In future research, we will build a semi-supervised model and an unsupervised model, and will use the easily obtained unlabeled data to classify sleep stages to improve generalization ability.
Sleep Heart Health Study (SHHS): SHHS is a multicenter cohort study of the cardiovascular and other consequences of sleep-disordered breathing. Subjects had a variety of medical conditions, including pulmonary disease, cardiovascular disease, and coronary heart disease. Our study using the second part of the SHHS included the performance of two EEG channels, two EOG channels, and one EMG channel over 2651 nights. To minimize the impact of these disorders, we selected subjects who were considered regular sleepers (e.g., apnea-hypopnea index or AHI < 5) based on studies [23]. In the end, we selected 329 subjects, 52.8% of whom were female.
ISRUC-S3: ISRUC-S3 contains 126 recordings of 118 unique subjects recorded between 2009 and 2013 at the Center for Sleep Medicine, University Hospital, Coimbra, Portugal. The third panel we used contained recordings of physiological signals from ten healthy subjects, nine males and one female. Sleep stages and sleep events for each PSG were scored by two technicians according to AASM guidelines, dividing each recording into 30 epochs. The ISRUC dataset contains six EEG channels, two EOG channels, three EMG channels, and one ECG channel. For both data sets, we applied the following pre-processing steps. (1) We excluded any unknown stages that were not part of any sleep stage. (2) We combined stages N3 and N4 into one stage (N3) according to AASM criteria. (3) We only included 30 min of wakefulness before and after sleep to increase attention to sleep stages.
The experimental configuration adopted in this paper is as follows: The hardware environment is Intel Xeon Silver 4116 CPU, 16G memory, NVIDIA CTX3060Ti GPU; the software environment is Win10 system, Python3.8, and the deep learning framework is Tensorflow 1.14.0.The proposed model was compared with six baseline models, and experimental data preprocessing and evaluation settings were consistent with other baseline models for fairness of comparison. Specifically, the experimental process was evaluated on multiple channels using 10-fold cross-validation, with 20% of the training set as the test set. Parameter Settings in the experiment are shown in Table A1. We compared the proposed model with six other advanced models in terms of accuracy, F1-score, and Kappa. The comparison of experimental results between the proposed MGANet model and other baseline models is shown in  [25]: Multi resolution CNN with adaptive feature recalibration is combined with multi head self-attention mechanism to achieve sleep classification. Alexander et al., 2021 [26]: In the mixed queue environment, sleep classification is carried out by using deep residual network. Jia et al., 2021 [27]: Proposed a multi-view spatial-temporal graph convolutional network with domain generalization to classify sleep stages.