A Siamese Network-Based Method for Improving the Performance of Sleep Staging with Single-Channel EEG

Sleep staging is of critical significance to the diagnosis of sleep disorders, and the electroencephalogram (EEG), which is used for monitoring brain activity, is commonly employed in sleep staging. In this paper, we propose a novel method for improving the performance of sleep staging models based on Siamese networks, based on single-channel EEG. Our proposed method consists of a Siamese network architecture and a redesigned loss with distance metrics. Two encoders are used in the Siamese network to generate latent features of the EEG epochs, and the contrastive loss, which is also a distance metric, is used to compare the similarity or differences between EEG epochs from the same or different sleep stages. We evaluated our method on single-channel EEGs from different channels (Fpz-Cz and F4-EOG (left)) from two public datasets SleepEDF and MASS-SS3 and achieved the overall accuracies MF1 and Cohen’s kappa coefficient of 85.2%, 78.3% and 0.79 on SleepEDF and 87.2%, 82.1% and 0.81 on MASS-SS3. The results show that our method can significantly improve the performance of sleep staging models and outperform the state-of-the-art sleep staging methods. The performance of our method also confirms that the features captured by Siamese networks and distance metrics are useful for sleep staging.


Introduction
The importance of sleep to the physical and mental health of humans cannot be overstated [1]. A growing number of people suffer from sleep disorders such as insomnia, sleep apnea syndrome and narcolepsy [2]. In order to measure sleep quality and evaluate sleep-related diseases, sleep monitoring and analysis are becoming increasingly relevant [3].
The quality of sleep is evaluated by sleep experts using polysomnograms (PSGs), which include electroencephalograms (EEGs), electrooculograms (EOGs), electromyograms (EMGs), and electrocardiograms (ECGs). In order to evaluate sleep quality, sleep staging segments PSG into 30-second epochs, and then categorizes the epochs into different sleep stages based on different sleep scoring standards, including the American Academy of Sleep Medicine (AASM) [4] and Rechtschaffen and Kales (R&K) [5]. The sleep stages are divided into awake (W), non-rapid eye movement (NREM) and rapid eye movement (REM). NREM stage is further divided into stages S1, S2, S3, S4 according to R&K and stages N1, N2, N3 according to AASM. For a long time, sleep staging relied on sleep experts to perform this labor-intensive and time-consuming task manually [6].
A lot of studies have focused on using machine learning models to automatically classify EEG epochs to their corresponding stages, such as support vector machine (SVM) [7] and random forest (RF) [8]. These methods should carefully design non-redundant and representative features and extract those features manually using algorithms like empirical 2 of 12 mode decomposition [9] and wavelet transform [10]. Then, the features, instead of raw EEG signals, are sent to the machine learning models for classification. However, since the features were designed based on specific datasets, these methods may not generalize to different sleep datasets.
With the recent advancement of deep learning, various researchers have attempted to develop deep learning techniques based on EEG signals to aid in classifying sleep stages automatically. The existing deep learning models for sleep staging are primarily composed of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) or attention mechanisms. [11]. The general approach to deep learning is to use CNNs to extract time and frequency information and other time-invariant features, followed by RNNs to learn temporal dependencies such as stage transition rules. The authors of [12] constructed a model using two CNNs with different filter sizes and two layers of bidirectional LSTMs. The architecture proposed in [13] begins with multi-resolution CNNs, followed by a temporal context encoder with multi-head attention.
The shortcomings of manually designed features can be effectively overcome by deep learning methods. However, we find that patterns representing differences or similarities between two epochs that are overlooked by deep learning models can be extracted by Siamese networks. Figure 1 shows EEG epochs in the time domain and STFTs of the signal. It can be seen from the figure that there are significant differences in EEG epochs between different sleep stages in both the time domain and frequency domain. To extract these features, we tend to design a new architecture. Siamese networks have two of the same structures which share parameters for a pair of sleep epochs' input, and have been widely used and achieved great success in natural language processing (NLP) [14] and object detection [15]. In the field of sleep staging, Siamese networks can encode similarity and dissimilarity patterns between two epochs. The features extracted by Siamese networks are then sent to sleep staging models to improve the performance of classification models. In order to achieve better performance in sleep staging, we proposed a method based on Siamese networks, contrastive loss, Euclidean distance and max-sliced Wasserstein distance [16,17] for automatic sleep staging using single-channel EEG.
In this research, we provide a novel approach to enhance the performance of sleep staging models. The primary contributions of our study are as follows: (1) We propose a Siamese network architecture and implement Siamese convolutional neural networks (Siamese CNNs) and Siamese autoencoders (Siamese AEs). The Siamese network architecture is used to extract features of similarity between two different epochs of EEG. (2) To train sleep staging models with a Siamese network, we developed a new loss function. The new loss function incorporates metrics for traditional sleep staging models such as cross entropy loss and distance metrics for measuring the distance between two distributions. (3) We evaluated our model on two public datasets, SleepEDF and MASS-SS3, and compared our results with baseline sleep staging models and some SOTA sleep staging methods. The results show that our encoder can greatly improve the accuracy of the baseline model and outperform the state-of-the-art models in sleep staging.
The rest of this paper is organized as follows. Section 2 shows the datasets and explains the methods by illustrating the architecture of Siamese AEs and Siamese CNNs, the loss function and the training process of our model. Section 3 shows the experimental results and the comparison against the baseline models. Section 4 draws the conclusion of our work. Biomedicines 2023, 11, x FOR PEER REVIEW 3 of 13 In this research, we provide a novel approach to enhance the performance of sleep staging models. The primary contributions of our study are as follows:

Datasets
We used two different EEG channels from two public datasets, the Montreal Archive of Sleep Studies (MASS) [18] and SleepEDF [19], respectively.
SleepEDF was obtained from PhysioBank, containing two subsets, which were Sleep Cassette (SC) and Sleep Telemetry (ST). The SC subset was from a study aimed at exploring the age effects on sleep of healthy participants, while the ST subset was involved in research focusing on the effects of temazepam on sleep. Each PSG in this dataset contained two EEG channels (Fpz-Cz and Pz-Oz) with a sampling rate of 100 Hz, one chin EMG channel, one EOG channel and 1 oro-nasal respiration signal. Here, Fpz-Cz and Pz-Oz represent the positions of the electrodes. Each 30-second epoch of recordings was labeled by sleep experts based on the R&K standard. In our experiments, we utilized the Fpz-Cz EEG channel in the SC subset to train and evaluate our model. According to previous work [20][21][22][23], we merged the N3 and N4 stages into N3 and we excluded MOVEMENT and UNKNOWN.
The database MASS has 5 subsets, SS1-SS5. The subsets were arranged in accordance with their protocols for study and acquisition and the subset we utilized was SS3. Compared with SleepEDF, MASS-SS3 contained many more EEG channels. Each recording in SS3 was composed of 20 EEG channels, 3 EMG channels, 2 EOG channels and one ECG channel. The EEG and EOG recordings shared a sampling rate of 256 Hz. A notch filter of 60 Hz pre-processed the EEG and EOG recordings; then, the EEG and EOG passed the band-pass filters of 0.30-100 Hz and 0.10-100 Hz, respectively. The recordings were classified by sleep specialists into one of five stages, in accordance with the AASM standard. We selected the F4-EOG (Left) channel to train and evaluate our model. Here, F4-EOG (Left) indicates that the electrodes are placed close to the outer lower canthus of the left eye and the hairline. Furthermore, it has been found that in both the Sleep-EDF and MASS datasets, there are long periods of stage W in the recordings. However, too many or too few W stages in the training set may lead to a class imbalance problem. To avoid this issue and emphasize sleep data rather than wake data, we only used 30 min of wake periods before and after the sleep process. Table 1 shows the number of EEG epochs of each sleep stage of the two datasets. In SleepEDF, there were 20 participants in total, with 10 males and 10 females. The ages of participants ranged from 25 to 34, with a mean of 28.7 and a standard deviation of 2.9. There were more subjects in MASS-SS3, containing 28 males and 34 females, and a wider age range of 20 to 69. The mean and standard deviation of the ages were 42.5 and 18.9, respectively.

Siamese Networks
We proposed two Siamese networks with different architectures. One was Siamese CNNs; the other was Siamese AEs. Figure 2a,b shows the architecture of Siamese CNNs and Siamese AEs respectively.
For Siamese CNNs, we employed two-branched CNNs to encode the input pair of EEG epochs, as shown in Figure 2a. According to previous studies [12,24], convolution with small filter sizes is better for extracting temporal information. Convolution with a larger filter size is better for extracting frequency information from EEG. Moreover, different filter sizes can also capture features from various frequency bands [13]. Therefore, the filter (kernel) sizes of the two CNN branches were different.
2.9. There were more subjects in MASS-SS3, containing 28 males and 34 females, and a wider age range of 20 to 69. The mean and standard deviation of the ages were 42.5 and 18.9, respectively.

Siamese Networks
We proposed two Siamese networks with different architectures. One was Siamese CNNs; the other was Siamese AEs. Figure 2a A pair of EEG signals with the same or different labels was the input to our Siamese CNNs. The details of EEG data pairs will be further explained in Section 2.4. Each Conv1D block was composed of 1D-convolution, batch normalization [25] and rectified linear unit activation (ReLU). The kernel size, number of filters and stride size are shown in Figure 1. The MaxPooling block shows the pooling size and the stride of the pooling layer, which downsamples the input with a max operation. The dropout layer sets neurons in the neural network to zero with a probability of 0.5.
For Siamese AEs, the encoder consists of convolutional layers and the decoder is formed of transposed convolutional layers. Transposed convolution, by mapping smaller feature maps into larger ones, is an upsampling technique, rather than the reverse of convolution. The encoder creates a latent representation EN_L of the input EEG, which is the output of the encoder. The decoder reconstructs the input EEG. DE_L is the output of the decoder.

Loss Functions for Measuring Similarity
We propose a novel loss function to measure the performance of the classification model as well as the Siamese network. As the Siamese network consists of two neural networks with shared parameters and architecture, which map a pair of inputs into latent features, it utilizes distance metrics to measure similarity, which is called contrastive loss [26]. The distance should be as small as possible if the input pair is obtained from the same sleep stage. In contrast, the distance should be as large as possible when the input pair comes from different sleep stages.
Defined is the distance function D W between the inputs x 1 and x 2 as the Euclidean distance between the outputs of a branch of the Siamese network G W , shown as: The contrastive loss of the Siamese network is shown in Equation (2): The contrastive loss is calculated by adding up the distance metric of EEG data pairs, where y, x 1 , x 2 ) i is the i-th data pair. The distance metric of a single data pair is shown in Equation (3): In Equation (3), y is the label of the EEG data pair, which represents whether the two EEG epochs belong to the same sleep stage; if the same, y equals 1 and if not, y equals 0. The hyperparameter m (m > 0) is a threshold to measure the dissimilarity of the data pair, which is determined by experiments.
The loss function of our baseline sleep staging model is multi-class cross-entropy (CE) which is defined in Equation (4): Here, N is the number of EEG epochs and K is the number of classes. y k i is the real label of the i-th EEG epoch of the class k. Similarly,ŷ k i is the predicted probability of the i-th EEG epoch of the class k.
For Siamese CNNs, the loss is Loss CNN = CE model + CT siam (u,v) , where (u,v) are the input pair of EEG epochs. Considering the encoding and reconstructing characteristics of autoencoders, the loss of our Siamese AEs has an additional component, compared with Siamese CNNs, which is the distance metric between input EEG epoch and the output of the decoder. This distance metric is composed of max-sliced Wasserstein distance (max-W distance) and Euclidean distance. The max-W distance is calculated from Equation (5): Here, Ω represents the set containing all directions on unit sphere, u and v are the input distributions, and ω is an element in Ω. So, the distance metric we propose is , where ||u − ν|| 2 represents the Euclidean distance between u and v. The loss of our Siamese AEs is shown in Equation (6).
In Equation (6), the EEG epoch is the input of Siamese AEs, DE_L is the output of the decoder, EN 1 , EN 2 are the latent features of the input EEG data pair encoded by the encoder.

Training and Evaluation Process
There are two stages in our training process: pretraining and finetuning. The baseline sleep staging model is formed of a set of CNNs and a sequence residual learning block [27] with bidirectional long short-term memories (bi-LSTMs), which is an extension of long short-term memory (LSTM) [28,29]. The whole model (sleep staging model and Siamese network) is shown in Figure 3.
Here, Ω represents the set containing all directions on unit sphere, u and v are the input distributions, and ω is an element in Ω. So, the distance metric we propose is ( , ) = || − || + max − ( , ), where || − || represents the Euclidean distance between u and v. The loss of our Siamese AEs is shown in Equation (6).
In Equation (6), the EEG epoch is the input of Siamese AEs, DE_L is the output of the decoder, , are the latent features of the input EEG data pair encoded by the encoder.

Training and Evaluation Process
There are two stages in our training process: pretraining and finetuning. The baseline sleep staging model is formed of a set of CNNs and a sequence residual learning block [27] with bidirectional long short-term memories (bi-LSTMs), which is an extension of long short-term memory (LSTM) [28,29]. The whole model (sleep staging model and Siamese network) is shown in Figure 3.  The input of Siamese networks is a pair of EEG epochs. The EEG epoch data pairs are generated by randomly selecting data pairs with the same or different labels from the training set. The label of data pairs depends on the label of each EEG epoch. If they belong to the same sleep stage, the label of data-pair is 1; otherwise, the label is 0.
During pretraining, as shown in Figure 3, a pair of EEG epochs is firstly fed into the Siamese AEs. The encoder of our Siamese AEs maps the inputs into latent features EN_L.  The input of Siamese networks is a pair of EEG epochs. The EEG epoch data pairs are generated by randomly selecting data pairs with the same or different labels from the training set. The label of data pairs depends on the label of each EEG epoch. If they belong to the same sleep stage, the label of data-pair is 1; otherwise, the label is 0.
During pretraining, as shown in Figure 3, a pair of EEG epochs is firstly fed into the Siamese AEs. The encoder of our Siamese AEs maps the inputs into latent features EN_L. Then, every 30 s, the EEG epoch is encoded by one branch of Siamese AEs, producing the latent representation to be concatenated with the original EEG epoch. The concatenated data is then fed into the sleep staging model. The basic model in the pretraining stage is the CNN block, and the output of decoder DE_L is used for the calculation of the loss function. Similarly, for Siamese CNNs, the latent feature vector to be concatenated is the output of the encoder. Since the loss function of Siamese CNNs does not need to reconstruct the original input, there are not any decoders in Siamese CNNs.
The sleep staging model during finetuning is based on the pretraining ones. The output of CNN block feature_CNN is then fed into a sequence residual learning block. The left branch of the sequence residual block is formed of bi-LSTMs with a hidden size of 512/512 and the right branch is a fully connected layer. The parameters of the Siamese network, the CNN block and the sequence residual block are updated simultaneously with different learning rates. The learning rate of the sequence residual block is larger than other blocks since they have been pretrained in the former stage.
During testing and evaluation, we used one branch of the Siamese network, since the parameters of the two branches are shared. We applied k-fold cross-validation to evaluate our Siamese network. For the two datasets involved in our experiments, k equals 20 and 31 for SleepEDF and MASS-SS3, respectively.

Experimental Setup and Evaluation Metrics
We implemented our model with the Pytorch 1.10.0 toolkit and trained the model on a GeForce RTX 3090 GPU. We utilized the Adam optimizer [30] with an initial learning rate of 10 −3 and a weight decay of 10 −3 in the first training stage. The Siamese network and the CNN block of the classification model are pretrained for 100 epochs. Then, in the finetuning stage, the model is trained for 200 epochs and the learning rate of the pretrained layers is modified to 10 −6 . The learning rate of the sequence residual learning block is the same as pretraining, which is 10 −3 . The model is trained with a batch size of 32 during pretraining. In the finetuning stage, the batch size is set to 16 and the sequence length is set to 25. The threshold m in the contrastive loss function is set to 7400.
We evaluate the performance of our method using precision (PR), recall (RE), per-class F1-score (F1), overall accuracy (ACC), macro-averaging F1-score (MF1) and Cohen's kappa coefficient (k) [31]. The per-class metrics PR c , RE c , F1 c are calculated as follows: Here, TP c , FP c , TN c and FN c refer to the true positives, false negatives, true negatives and false negatives of class c. When calculating per-class metrics, the positive class is the current class, and the negative class is defined as the combination of other classes.
The overall metrics ACC, MF1 and kappa (k) are defined as follows: Biomedicines 2023, 11, 327 9 of 12 In the equations above, C is the number of classes which equals 5 according to the AASM manual, while N is the number of epochs in the test set.  Since we evaluated our method using k-fold evaluation, we added up the metrics of each fold to calculate the confusion matrices. Each row represents the number of EEG epochs classified by humans, and each column indicates the number of EEG epochs classified by our model. We show the per-class performance (PR, RE, F1) of our models in the last three columns of the table.

Classification Performance
In view of the better overall performance of Siamese AEs, we further evaluated Siamese AEs on dataset MASS-SS3. It can be seen from the confusion matrices that our model works well on both datasets, except for stage N1. The precisions of other stages are all around 85% while the precision of stage N1 is around 55% in SleepEDF and around 65% in MASS-SS3. The lack of EEG epochs results in the lower performance of existing sleep staging models in stage N1. Comparing the performance of Siamese CNNs and Siamese AEs on SleepEDF, Siamese AEs performs better on stages N2, N3, REM and Siamese CNNs performs better on stages W and N1. In comparison to SleepEDF, the model achieves a better result on the dataset MASS-SS3.

Comparison with State-of-the-Art Models
We compared our model with some other sleep staging methods and discussed the performance of our Siamese CNNs and Siamese AEs. Table 5 shows the result of our method comparing with other state-of-the-art methods. Our baseline sleep staging network is DeepSleepNet (DSN); therefore, we compare our results with DSN first. Our Siamese CNNs and Siamese AEs both improved the performance of DSN, with increases in overall accuracy of 3% and 3.3%, respectively. The MF1 and kappa of our method also outperformed DSN on SleepEDF. To further prove that our method is effective for sleep staging, we evaluated our model on MASS-SS3 and compared the results with DSN. There was a 1% improvement in the results. Moreover, the performance of our method outperforms other SOTA sleep staging methods. Our Siamese AEs achieved the highest overall metrics among the sleep staging methods mentioned above.
Comparing our Siamese CNNs and Siamese AEs, the overall accuracy of Siamese AEs is higher than Siamese CNNs on SleepEDF, but the training time is longer than Siamese CNNs. Siamese AEs have an extra component to the loss function, so this extra constraint makes computations more complex and results in higher performance.

Conclusions
A Siamese network architecture is proposed in this study for encoding the similarity between two EEG epochs of the same sleep stage. It also encodes the differences between two EEG epochs of different sleep stages. Those features are extracted by the loss function presented by us, which is composed of distance metrics and cross entropy. Distance metrics are used to measure the distance between the distributions of two EEG epochs, and cross entropy is used to classify sleep stages. We also implemented Siamese CNNs and Siamese AEs and evaluated our method on SleepEDF and MASS-SS3. In our experiment, we used the DeepSleepNet (DSN) sleep staging model as a baseline. The results showed that the Siamese architecture proposed by us not only significantly improved the performance of DSN, but also outperformed the SOTA methods in sleep staging. In our future work, we will focus on applying our method to sequence-to-sequence models to improve the performance.