A Deep Transfer Learning Framework for Sleep Stage Classification with Single-Channel EEG Signals

The polysomnogram (PSG) is the gold standard for evaluating sleep quality and disorders. Attempts to automate this process have been hampered by the complexity of the PSG signals and heterogeneity among subjects and recording hardwares. Most of the existing methods for automatic sleep stage scoring rely on hand-engineered features that require prior knowledge of sleep analysis. This paper presents an end-to-end deep transfer learning framework for automatic feature extraction and sleep stage scoring based on a single-channel EEG. The proposed framework was evaluated over the three primary signals recommended by the American Academy of Sleep Medicine (C4-M1, F4-M1, O2-M1) from two data sets that have different properties and are recorded with different hardware. Different Time–Frequency (TF) imaging approaches were evaluated to generate TF representations for the 30 s EEG sleep epochs, eliminating the need for complex EEG signal pre-processing or manual feature extraction. Several training and detection scenarios were investigated using transfer learning of convolutional neural networks (CNN) and combined with recurrent neural networks. Generating TF images from continuous wavelet transform along with a deep transfer architecture composed of a pre-trained GoogLeNet CNN followed by a bidirectional long short-term memory (BiLSTM) network showed the best scoring performance among all tested scenarios. Using 20-fold cross-validation applied on the C4-M1 channel, the proposed framework achieved an average per-class accuracy of 91.2%, sensitivity of 77%, specificity of 94.1%, and precision of 75.9%. Our results demonstrate that without changing the model architecture and the training algorithm, our model could be applied to different single-channel EEGs from different data sets. Most importantly, the proposed system receives a single EEG epoch as an input at a time and produces a single corresponding output label, making it suitable for real time monitoring outside sleep labs as well as to help sleep lab specialists arrive at a more accurate diagnoses.


Introduction
Sleep is central to human health and a healthy lifestyle. The health consequences of insufficient sleep, abnormal sleep patterns or de-synchronized circadian rhythms can be emotional, cognitive, or somatic. The disruption of normal sleep patterns has also been linked to obesity and neuro-degenerative diseases, as well as cardiac diseases [1,2]. The central diagnostic tool and the gold standard in the evaluation of sleep quality and disorders is the polysomnogram (PSG) or the overnight sleep study, during which several physiological signals are simultaneously monitored and collected, including electroencephalogram (EEG), electrooculography (EOG), electromyography (EMG), electrocardiography (ECG), blood oxygenation, airflow, and respiratory effort [3].
As defined by the American Academy of Sleep Medicine (AASM) [3], sleep is categorized into four stages. These stages include the stage of Rapid Eye Movement (REM) sleep • We develop a deep transfer learning (DTL) framework for automated scoring of sleep stages using a single EEG channel only. This eliminates the need to establish and train a deep neural network from scratch by using a pre-trained deep neural network architecture that has been previously trained from sufficient labeled data in a different context. • We investigate the proposed DTL framework with different detection and training scenarios. These include CNN and CNN-RNN architectures, as well as considering different choices for generating imaging data inputs from the corresponding EEG sleep epochs. • Finally, we evaluate the DTL model on three different EEG single channel inputs. We perform a thorough comparison between the performance achieved over each of the signals considered using different detection and training scenarios.
The structure of this research paper is as follows. Section 2 discusses previous studies in automatic sleep stage scoring. The data sets used in this research are described in Section 3. The proposed deep transfer learning (DTL) framework and the study's evaluation measures are covered in Section 4. Results for the suggested framework are discussed in Section 5, and they are further examined and analyzed in Section 6. Finally, Section 7 outlines this paper's conclusions.

Related Work
Many previous studies considered automated detection of sleep stages using singlechannel EEG signals. Nevertheless, the majority of these studies rely on the use of feature engineering methods and classical machine learning algorithms for classifying the different sleep stages based on hand-crafted EEG features. For example, [16,31,32] used support vector machines (SVM), [17,33] used random forests, and [18,29] used recurrent neural networks. Other studies evaluated more than one classifier [11] or used more complex methods such as Dendogram SVM (DSVM) [7] and bootstrap aggregating [33][34][35]. Although these methods have reported a reasonable performance, they carry several serious limitations including the need for sufficient domain expertise and a prior knowledge of sleep analysis as well as a significant time and effort to carefully develop expert-defined features. More importantly, the hand-crafted features are highly dependent on the characteristics of the available data sets. Consequently, this limited the ability of classical machine learning methods to generalize to large patient populations due to the heterogeneity among subjects and recording devices.
This study eliminates the need for complex EEG pre-processing algorithms or the need for human-engineered features, which are required to perform scoring with classical classification methods. The proposed approach advances the state-of-the-art by developing an end-to-end deep learning framework for automatic feature extraction and detection for sleep stages using a single EEG channel. In this regard, few similar studies exist on the use of deep learning instead of conventional algorithms for automated sleep stage scoring. In particular, refs. [2,5,10,27,36,37] presented different end-to-end deep learning approaches for automatic sleep stage scoring using single EEG channel inputs. Although [29] considered a deep recurrent network structure, they used expert-defined features and so cannot be considered an end-to-end approach. There are other deep learning methods that considered multiple PSG signals together with EEG as inputs to their models [26,28,38]. For example, EEG and EOG were considered in [28] and together with EMG in [38]. Additionally, ref. [26] considered six EEG channels simultaneously as inputs to their model. Yet, a major limitation in multi-channel deep learning-based studies is that they can only be implemented in dedicated sleep centers, compared to single EEG algorithms that can also be applied in home or low-resource settings.
The present study provides two aspects of improvement compared to previous end-toend deep learning approaches for automatic sleep scoring with single EEG channels. First, the proposed system considers a one-to-one classification scheme compared to other studies that used many-to-one [2,10,27,36] or (less frequently) many-to-many [5] classification schemes. The proposed classification system receives a single PSG epoch as an input at a time and produces a single corresponding output label for the sleep stage. This is a much more efficient classification scheme than the many-to-one scheme that augments the classification of the target epoch by combining it with surrounding epochs or the manyto-many scheme that maps an input sequence of multiple epochs to the sequence of their corresponding target labels. Although using many-to-one and many-to-many classification models potentially improved the overall performance by taking into account the existing temporal dependencies between PSG epochs, these approaches suffer from modeling ambiguity and high computational overhead. More importantly, using these schemes poses a major limitation for the development of online and realtime sleep monitoring applications.
The second aspect of improvement in the proposed study compared to previous similar studies is the extensive use of deep transfer learning in building the proposed framework, eliminating the computational overhead required to set up and adequately train a deep learning scoring system from scratch. Among previous similar deep learning studies [2,5,10,27,36,37], the study of [2] was the only one that used transfer learning of a pre-trained CNN applied to a small single data set and evaluated over a single EEG source.
In this study, a comprehensive end-to-end deep transfer learning framework was developed for automated scoring of sleep stages using a single EEG channel. The proposed framework employs an efficient one-to-one classification scheme and extensively uses transfer learning in several training and detection scenarios that are comprehensively evaluated over three EEG signals from two data sets that have different properties and are recorded with different hardware.

Data Sets
In this study, we leverage two PSG data sets from two major health centers in Germany and USA: Each of the polysmnography devices used to collect sleep data from the two health centers included electrodes for electroencephalography (EEG), electrooculography (EOG), electrocardiography (ECG), and submental and tibial electromyography (EMG). The EEG electrodes for DS-1 are reusable gold cup electrodes made by GVB-geliMED (Bad Segeberg, Germany). They either have no brand or GRASS brand. The EEG electrodes for DS-2 are NATUS GRASS brand gold-plated electrodes made by BESDATA (Shenzhen, China) EEG electrodes were placed using the 10-20 system [3]. The recommended primary EEG channels for sleep scoring according to the AASM, namely F4-M1, C4-M1, and O2-M1, were considered in this study. Sleep scoring was carried out by expert clinicians according to recommendations of the AASM [3]. For each 30 s epoch in the PSG data file of each patient, one of five possible stages is scored by an expert clinician. The scored sleep stage can be either W, N1, N2, N3, or REM. A complete sleep cycle starts with stage W and ends in stage REM.
The PSG studies are standard sleep studies that were conducted on adult subjects in both sleep centers to determine the presence of sleep disorders and possible treatment options. The two data sets were used in previous research to analyze and detect sleep apneic events using respiratory PSG signals [25,39].
The EEG data in DS-1 were sampled at 128 Hz, whereas the EEG data in DS-2 were sampled at 256 Hz. Thus, the EEG data in DS-2 were re-sampled at 256 Hz so that all data have the same sample rate, which translates to a vector with a length of 7680 samples for each EEG sleep epoch. Furthermore, the EEG data was filtered using a zero-phase digital bandpass filter in the range of 0.3-35 Hz, which includes the frequency bands of interest for scoring sleep stages. Table 1 summarizes the detailed distribution for different sleep stages in the data sets. In total, 72,496 sleep epochs were found across the total 81 subjects in this study. The data set was divided randomly such that 90% of the sleep epochs were used for training the different deep learning scenarios while the other 10% of the epochs were used for evaluating the performance of these models in sleep stage classification.

Time-Frequency (TF) Imaging
We used time-frequency imaging to convert raw EEG data into images. Timefrequency (TF) imaging is a technique that allows looking at both the time and frequency domains of a signal simultaneously, using various time-frequency representations. This tool is considered in this research for analyzing EEG signals since different time-frequency patterns are specific to different sleep stages. Time-frequency imaging can be obtained using different methods. In this study, we selected two popular techniques, Fourier-Based Synchrosqueezing Transform (FSST) and Continuous Wavelet Transform (CWT), in order to generate TF images for different sleep EEG epochs.

Fourier-Based Synchrosqueezing Transform (FSST)
The first technique is the FSST, which converts the EEG signal into a time-frequency domain signal. In the time-frequency plane, this approach effectively represents multicomponent signals in a condensed manner [40]. It can identify specific time-localized signal components and examine their frequency and variational behavior. By reassigning the coefficients in scale or frequency, this transform, which is a member of the family of time-frequency reassignment techniques (RM), operates on the time-frequency domain of the Short Time Fourier Transform (STFT). By moving components to a neighboring ridge, commonly referred to as the energy distribution's center of gravity, this RM technique has been used to sharpen spectrograms. It is simpler to discern between EEG epochs at different stages of sleep thanks to this relocation process, which produces sparse and sharpened time-frequency representations. Other time-frequency representations can also use the reassignment, provided that the coefficients are reassigned for both the time and frequency components [41]. To maintain causality, only the frequency component of the FSST's coefficients is redistributed.
Many physiological signals can be expressed as a superposition of amplitude-modulated and frequency-modulated modes. For time-frequency analysis, the FSST decomposes an EEG sleep epoch y(t) as a multi-component signal consisting of K oscillatory components defined by [40,41]: where A k (t) is the instantaneous amplitude and φ k (t) (derivative of the phase) is the instantaneous frequency of component k. For a weak frequency modulation between components, there exists a small value . This requires amplitude to be differentiable and phase to be twice differentiable. The adjacent components are well-separated in frequency with a distance d, Gaussian window g of size γ g , the frequency bandwidth of g is ∆ = √ 2 log(2) γ g . Accordingly, the minimum distance between adjacent components is d = 2∆.
The FSST T f (w, t) is based on the modified coefficients of Short-Time Fourier Transform (STFT) V f (η, t) from (η, t) to (ω f (η, t), t) described by the synchrosqueezing operator: where g(0) is the value of a sliding window g(t) at time 0, δ is the Dirac delta function, andω f (η, t) is the instantaneous frequency defined by: The instantaneous frequency can be approximated byω f (η, t), when V f (η, t) > 0. Moreover, from the FSST, we can obtain the complex-valued bivariate image T f (ω, t) for each EEG sleep epoch.

Continuous Wavelet Transform (CWT)
CWT is a useful method for representing time series at various resolutions. Using CWT, a time series can be transformed mathematically into a different feature space in order to be employed in feature extraction in the time-frequency domain [41]. By performing a mathematical inner product operation on the signal and a collection of wavelets, the wavelet transform is produced. This group of wavelets is a wavelet family that was created by scaling and translating the mother wavelet ψ(t), which can be represented as: where τ is a translation parameter and s is a scale parameter inversely related to frequency. A CWT of an EEG sleep epoch y(t) can be obtained by a convolution operation with a complex conjugate, mathematically defined as follows: where ψ * (.) denotes the complex conjugate of ψ(.). This operation decomposes the EEG sleep epoch y(t) into a series of wavelet coefficients where the wavelet family is the basis function. Observing the above equations, there are two types of parameters in family wavelets: s and τ. Through the convolution operation, the sleep epoch segment y(t) is transformed by the family wavelets and projected to the two-dimensional (2D) time and scale dimensions [41]. Accordingly, successive one-dimensional EEG sleep segments are converted into TF images.

Convolutional Neural Networks (CNNs)
A CNN is a feedforward neural network with a deep structure and is one of the popular representative algorithms of deep learning. They are widely used when dealing with image tasks. Using raw input 2D images in CNN architectures is the standard in applications of deep learning in computer vision and signal processing [42][43][44][45][46]. Thus, in this study we used TF images generated from raw EEG sleep epochs as inputs to the CNN.
Deep CNNs have the ability to automatically and efficiently learn hierarchical features from input images, such that the higher-level layers' features are more abstract than the lower layers'. Convolution layers (filtering), pooling layers (subsampling) with a type of nonlinearity applied before or after pooling, and lastly fully-connected layer(s) are the types of layers that typically make up a CNN. Convolution blocks, which are created by combining convolution and pooling layers, are often stacked to create a deep architecture. In classification tasks, a softmax (multinomial logistic regression) layer is commonly added to CNNs with a size that is equal to the number of target classes. CNNs are trained using iterative optimization with the backpropagation algorithm. The most common optimization method in the literature is stochastic gradient descent (SGD). More details about the definition and computational processes in CNNs are introduced in [47].

Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs) are a distinct class of neural networks that excel at handling time-series data and are well-suited for sequential information, as opposed to ordinary feedforward networks that take each input value of a signal into account independently [24].
However, due to the vanishing and exploding gradient problem [48], traditional RNNs are unable to detect long-range dependencies. The aforementioned issue is addressed by the Long Short-Term Memory (LSTM) network, which is an expanded version of RNN equipped with a gating mechanism in order to regulate the flow of information. It also has the ability to extract deeper contextual data from time series. Because we conducted a retrospective analysis of the PSG recordings in this investigation, we were able to apply a Bidirectional LSTM (BiLSTM) variation. Causal and anticausal counterpart LSTM layers make up each BiLSTM layer. Figure 1 depicts a single causal LSTM unit that processes the time series forward in time. This unit can be formally described as follows: where Equations (6)-(10) represent respective mathematical models for the forget gate, input gate, cell state update, output gate, and output function of an LSTM unit with an input vector x t , respectively, while C t−1 is the state and h t−1 is the output of the preceding LSTM unit. For each gate * ∈ {i, f , g, c}, b * is the bias term, W * is the input weight matrix, and U * is the recurrent weight matrix. These are the training parameters that are updated during the network learning process. The operator in Equations (6)-(10) is the Hadamard product. σ c and σ h are tangent hyperbolic activation functions, while σ g is the sigmoid activation function. To process the time series backwards in time, the BiLSTM has an anticausal (reverse) LSTM in addition to the forward LSTM. The anticausal LSTM is very similar to the causal LSTM but with a reverse time order. Thus, Equations (6)-(10) can be used after replacing b * , W * , and U * with W ' * , U ' * , and b ' * , respectively, as well as replacing h t−1 and C t−1 with h ' t+1 and C ' t+1 , respectively. In order to capture bidirectional long-term relationships between time steps of the time series, the outputs of the forward and reverse LSTMs are concatenated in the final step.

Transfer Learning and Fine Tuning Strategy
In learning theory, transfer Learning (T L ) is an approach that targets applying the knowledge learned from one task to another new but relevant task in order to enhance the learning performance of the new task. The domain of the original task is called the source domain (D s ), while the domain of the new task is called the target domain (D t ) [49]. By initializing the target model using parameters that are transferred from a pre-trained model, T L is able to aid in the training of a target model. Furthermore, T L is a very effective method for the quick building and evaluation of deep learning models when training a deep architecture from scratch is complicated. Complex deep neural networks frequently have a lot of weights, which are iteratively updated based on labeled data and a loss function after being randomly initialized prior to training. This iterative procedure is extremely time-consuming with all labeled data. Additionally, in cases with limited training data, deep architectures are prone to overfit to the training data. Instead of creating and training a CNN from scratch, T L offers a simple solution that uses a pre-trained deep CNN that was already trained by another data set.
Several studies have shown effectiveness of transfer learning with pre-trained models in medical imaging applications [50] as well as fault detection applications of mechanical systems [51,52]. In this study, we leveraged GoogLeNet as a pre-trained CNN to investigate knowledge transfer from natural images to time-frequency (TF) images of sleep epochs in EEG data.

A Framework for Automatic Sleep Staging Using Deep Transfer Learning
The proposed framework is based on deep transfer learning where time-frequency images of single-channel EEG sleep epochs are used as the input. T L based on pre-trained model helps improve deep model performance. This study proposes an end-to-end deep learning framework that is automatically able to learn features and recognize sleep stages using single-channel EEG signals. Our framework is composed of three stages: Time-frequency imaging, pre-trained model building with fine-tuning, and finally model application.
Three EEG signal channels acquired by PSG were evaluated in this study, including C4-M1, F4-M1, and O2-M1, which are the primary signals recommended by AASM for sleep scoring . Data from each of these signals were segmented at 30 s sleep epochs. EEG epochs were then transformed from the time domain to the time-frequency domain, forming a set of time-frequency images that were utilized as the input to the following pre-trained model. In this study, two methods were evaluated for time frequency imaging: FSST and CWT.
The pre-trained CNN model used in this paper is GoogLeNet, which is a deep convolutional network created by Google. It is 22 layers deep and was originally trained on the ImageNet data set [53] in order to classify images into 1000 object categories, such as keyboard, mouse, pencil, and many animals. This network has originally achieved accurate classification performance on the ImageNet data set and was chosen for this study since it provides a good tradeoff between classification accuracy and computational complexity [54]. The pre-trained model was trained on the ImageNet data set, but the target data set was the time-frequency images of EEG sleep epochs. Figures 2 and 3, we considered two deep transfer architectures for automatic scoring of sleep stages. The first one is a CNN-based transfer learning approach. In this case, the feature-extraction layers of the pre-trained model were transferred and a dropout layer was added before the fully connected layer. Finally, the final layer of the pre-trained CNN model was replaced with a softmax output layer whose size is dictated by the number of target sleep stages (five classes). Weights of the new output layer were initialized randomly. During the training process, weights of the trainable layers were updated to minimize errors between predicted labels and the true ones. After enough epochs, the designed model was fine-tuned and the deep CNN-based architecture together with all the of parameters were saved.  The second architecture, illustrated in Figure 3, is a CNN-RNN-based transfer learning approach. First, the feature-extraction layers of the pre-trained CNN model were transferred. Then, a recurrent neural network was added to form a CNN-RNN deep architecture. The recurrent network part of the CNN-RNN is composed of two BiLSTM layers and each of them is followed by a dropout layer to avoid overfitting. Finally, a softmax output layer with five nodes was added to enable classification between the five possible sleep stages. In order to convert successive time-frequency images to sequences of feature vectors, restore the sequence structure, and reshape the output to vector sequences, we used a sequence unfolding layer and a flatten layer prior to the RNN part of the network. During the training process, weights of the trainable layers were updated to minimize errors between predicted scores and the corresponding clinical annotations. After enough epochs, the designed model was fine-tuned and the deep CNN-RNN-based architecture together with all corresponding parameters were saved.

As shown in
This process was performed on EEG training segments (90% of data), to build and fine-tune each of the proposed deep transfer approaches through evaluating each of the time-frequency imaging approaches on each of the three EEG signal channels investigated in this study. The testing data set (10% held-out EEG data) was then used to validate the ability of the proposed framework to perform automated scoring of sleep stages and to compare performance considering different signal inputs, time-frequency imaging methods, and modeling scenarios. To avoid excessive computational loads and processing times, we did not initially over-sample the training data (due to the class imbalance between sleep stages) so that all different options can be evaluated in a reasonable time. Subsequently, we applied over-sampling to train the detection scenario that showed best results. Finally, we used 20-fold cross-validation over the input signal that showed the best scoring performance among the three input signals considered in this study in order to report a comprehensive evaluation for the proposed framework.

Evaluation Metrics
Recognizing the classical high imbalance problem in sleep scoring data, the proposed framework was evaluated for both per-class performance and for the overall performance to ensure a comprehensive evaluation of the proposed methods.
Per-class metrics that were computed in this study include sensitivity, precision, F1score, specificity, and accuracy. The per-class metrics were computed by considering a single class as a positive class and all other classes combined as a negative class. Thus, for each of the five classes of interest, a one-versus-all classification problem was considered to compute true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Accordingly, per-class metrics can be computed from the following equations: where Sn c , Pr c , F1 c , Sp c , and ACC c are per-class sensitivity, precision, F1-score, specificity, and accuracy of class c ∈ {1, 2, ..., C}, respectively, and C = 5 is the number of sleep stages. For the overall classification metrics, we considered overall accuracy (ACC), macroaverage F1 (MF1), overall sensitivity (Sn), and overall specificity (Sp). These metrics can be mathematically expressed as follows: where TP c is the true positives of class c ∈ {1, 2, ..., C} and N is the total number of test epochs. To compare results across different detection scenarios and different EEG signal channels, per-class sensitivity (Sn c ) is reported as well as the per-class F1-score (F1 c ), which provides a comprehensive snapshot of the per-class performance by considering the sensitivity/precision tradeoff. Additionally, the overall performance across different detection scenarios and different EEG signal channels was also compared using the accuracy (ACC), macro-average F1 (MF1), overall sensitivity (Sn), and overall specificity (Sp). Finally, the detailed performances for the best-performing scenarios were thoroughly analyzed using all per-class metrics: sensitivity (Sn c ), specificity (Sp c ), precision (Pr c ), F1-score (F1 c ), and per-class accuracy (ACC c ).

TF Imaging Data
First, the EEG signals were pre-processed using two types of TF image representations: The FSST power spectrum and the CWT scalogram. Each of these representations were obtained for consecutive 30 s EEG Epochs. The TF images were then re-scaled to the size of 224 × 224 × 3 as required by the GoogLeNet CNN model. Finally, the processed images were divided into two parts: the training data set and the testing data set. The training data set was used to train and fine-tune network weights of the pre-trained model, whereas the testing data set was only used to verify the performance of the deep model and was not used during the training process. The processed TF images obtained with the CWT and FSST methods for each of the sleep stages are shown in Figure 4a,b, respectively.  Table 2 compares the overall performance of the different CNN-based transfer learning networks in automatic scoring for sleep stages. The CNN-based transfer learning architecture was evaluated on the three EEG channels considered in this study, C4-M1, F4-M1, and O2-M1 and using the two TF representations (CWT, FSST). Each model was evaluated for the overall accuracy (ACC), macro F1-score MF1, overall sensitivity (Sn), and specificity (Sp), as well as per-class sensitivity (Sn c ) and per-class F1-scores (F1 c ).

Performance of Deep CNN Transfer Learning Networks
The results in Table 2 show that the test CNN performance results obtained with CWT-TF image representation are significantly higher for the C4-M1 and F4-M1 signals compared to the FSST-TF method applied to these signals. The CNN transfer learning model built using EEG data from the O2-M1 channel using both TF representation methods showed lower ability to correctly identify sleep stages compared to the performance obtained with C4-M1 and F4-M1.
The highest classification results for the CNN transfer learning approach were achieved using CWT-TF image representations obtained from the C4-M1 EEG channel. The detailed per-class test performance for this best-performing scenario is shown in Table 3. Results are shown for three trials with their mean and standard deviation. The CNN transfer learning model achieved high per-class specificity and accuracy in all classes, a much higher sensitivity in the W and N2 classes compared to other classes, and a generally high precision, except for class N1, which also achieved the lowest per-class sensitivity. The bold rows in the tables show the best performing scenarios which are further discussed in the text.  Table 4 summarizes the performance of the CNN-RNN-based transfer learning approach in automatic scoring for sleep stages, comparing the three EEG sources along with the two TF representation methods implemented in this study. Inspecting this table shows that the CNN-RNN transfer learning models with CWT-TF image representation achieved higher overall performance in sleep scoring compared to those obtained with FSST-TF image representations. Similar to the CNN transfer learning modeling framework, C4-M1 and F4-M1 EEG channels showed an overall improved performance in detecting sleep stages with the CNN-RNN transfer learning approach compared to the detections obtained from O2-M1.

Performance of Deep CNN-RNN Transfer Learning Networks
Furthermore, Tables 2 and 4 can be used for comparing test results obtained with the CNN and CNN-RNN transfer learning approaches. It can be clearly noticed that the CNN-RNN transfer learning scenario provided an overall improved detection performance compared to the CNN transfer learning scenario as illustrated by all of the listed performance metrics. Combining the CNN and RNN networks provides the ability to extract features present in the EEG spectrograms while preserving the temporal relationship present in the EEG data. The bold rows in the tables show the best performing scenarios which are further discussed in the text.
The highest classification results for the CNN-RNN transfer learning were achieved with CWT-TF image representation obtained from the C4-M1 EEG channel. The detailed per-class test performance for this best-performing scenario is shown in Table 5. To ensure the robustness of the proposed approach, the results are shown for three trials with their mean and standard deviation. The CNN-RNN transfer learning model achieved significantly higher per-class-sensitivity and precision compared to the CNN transfer learning model for data from the same EEG channel and using the same Time-Frequency (TF) imaging approach. Despite the improvement obtained in per-class sensitivity with the CNN-RNN approach, the lowest per-class sensitivity was still obtained with class N1.

Performance of Deep CNN-RNN Transfer Learning Networks with Oversampled Training Data
Our results indicate an improved performance in automated sleep scoring with the CNN-RNN transfer learning approach over the CNN transfer learning approach. Moreover, time-frequency images as inputs to the deep transfer learning framework showed a better performance with the continuous wavelet transform approach compared to the Fourierbased synchrosqueezing transform. Accordingly, the best detection approach is the CNN-RNN transfer learning modeling framework using the continuous wavelet transform approach for TF-image representation (CWT-TF).
Next, we considered the technique of oversampling to overcome the issue of class imbalance in the distribution of the five different sleep stages in our data set. Thus, we retrained the best detection scenario for the proposed CNN-RNN approach using oversampled CWT-TF training images while still evaluating the model on the same (originally sampled) testing data in order to demonstrate the effect of training with oversampled images as compared to the performance obtained with training with original samples in Table 4. This was repeated over the three EEG channels considered in this study and the results are summarized in Table 6. As expected, training the CNN-RNN modeling framework with oversampled CWT-TF images improved the overall performance over all of the listed EEG channels. It also provided more consistent classification performance across the five different sleep stages. In particular, the oversampling significantly increased the performance of N1, N3, and REM stages. Similar to what was observed in Table 4, Table 6 still indicates an improved scoring performance over the C4-M1 and F4-M1 channel inputs compared to the performance over O2-M1. Similarly, the best-performing scenario in Table 6 is the CNN-RNN transfer learning model that was trained with oversampled CWT-TF training images from the C4-M1 EEG channel.
Finally, in order to ensure the generalizability of the proposed framework over testing data, the best-performing scenario was comprehensively evaluated using 20-fold crossvalidation. To avoid excessive computational load, this approach was applied only on the best-performing scenario in Table 6. In this approach, EEG data epochs from C4-M1 were randomly divided into 20 folds where 19 out of the 20 folds were used for building the model that was then evaluated on the remaining fold. The process was repeated twenty times, in which each time the model was built with oversampled data from the training folds while being evaluated on original samples for each of the 20 folds. Table 7 reports the 20-fold cross-validation per-class performance of the proposed CNN-RNN modeling framework applied to oversampled CWT-TF training images obtained from C4-M1. The detailed per-class performance results in Table 7 show an excellent average per-class detection performance over the 20 test folds along with a small standard deviation across these folds, indicating an excellent potential for the proposed model to generalize over unseen EEG data.

Discussion
This study presents a comprehensive Deep Transfer Learning (DTL) framework for automated scoring of sleep stages using a single EEG channel. Two deep learning architectures were investigated. The first one considers a pre-trained CNN only, whereas the second one considers a CNN-RNN architecture with a BiLSTM network that follows the pre-trained CNN. Experiments showed an improved performance of the CNN-RNN detection scenario in detecting sleep stages compared to the CNN scenario. This improvement is achieved because the BiLSTM network is able to consider temporal dependencies and extract temporal features in EEG data. The cyclic behavior for the occurrence of sleep stages allows the BiLSTM network to enhance the overall performance in detecting sleep stages.
In order to generate imaging data inputs needed for the proposed DTL framework, two popular time-frequency (TF) imaging approaches were applied to the EEG sleep epochs. The Fourier-Based Synchrosqueezing Transform (FSST) and the Continuous Wavelet transform (CWT) were employed to generate TF representations for successive EEG epochs. Our results show an improved performance in detecting sleep stages using the TF representations obtained by CWT compared to those obtained by the FSST approach. This improvement was consistently noticed regardless of the EEG channel source and across all of the deep learning architectures considered. The results demonstrate that the CWT approach provided better localization and concentrated representation for the different frequency components present in the EEG sleep epochs.
The proposed framework was evaluated on the three primary EEG channels recommended by AASM for scoring sleep epochs in PSGs studies (C4-M1, F4-M1, and O2-M1). Our results show that, without changing the model architecture and the training algorithm, the proposed modeling framework can be applied on different EEG signals. Yet, it was noticed that the detection performance achieved with C4-M1 and F4-M1 was significantly better than the detection performance achieved with O2-M1. Interestingly, the study demonstrated that the proposed framework is able to work with EEG data sets from two different data centers with different recording properties and different EEG sampling rates. These results demonstrate that the presented framework is able to generalize well over different hardware settings and different single-channel EEGs.
Furthermore, the proposed framework leverages Deep Transfer Learning (DTL) as an efficient tool for rapid development and evaluation of the proposed framework, eliminating the computational time and effort and complexities required to set up and sufficiently train and fine-tune a deep learning scoring system from scratch. In particular, this study focused on using the existing feature extraction layers of the pre-trained GoogLeNet CNN model for automatic feature extraction from the time-frequency representations of the EEG epochs. The temporal features were also automatically extracted through the RNN part (BiLSTM network). Future work will focus on modifying or adding more feature extraction layers in the pre-trained CNN model in order to evaluate the effect on the overall performance in discriminating between TF images that belong to different sleep stages. Table 8 compares the performance of the proposed DTL framework with the stateof-the-art deep learning methods. The proposed DTL system composed of a pre-trained CNN followed by an RNN achieved a comparable performance to the studies in Table 8. However, it should be pointed out here that these studies vary with respect to the method that was used for handling the classical imbalance problem in sleep scoring data sets. For example, subsampling approaches were used to generate randomly class-balanced data [10,36] as well as to trim data from the most represented classes [37], which might affect the performance in real case scenarios.
In our analysis, we used the complete data set with the actual imbalanced class distributions. The DTL system composed of a pre-trained CNN followed by an RNN was trained using original EEG samples and also using oversampled data from the same EEG channel, and both scenarios were evaluated on hold out (un-augmented) EEG data. As displayed in Table 8, using oversampled EEG data for training the DTL CNN-RNN framework significantly improved the detection performance for the less-represented stages N1, N3, and REM in our data set. Yet, it also had a (smaller) negative effect on the detection performance of the majority classes W and N2. In practice, stage N1 is known as the most challenging stage to detect since it is a transition stage between wakefulness and sleep. This stage typically lasts only 2-5% of the total duration of a standard sleep cycle and can be misinterpreted as Wake, N2, or REM (even by an expert) [37,55]. This can be further verified by observing the low detection performance achieved with this stage in previous studies listed in Table 8. Interestingly, the proposed approach shows a significant improvement in the ability to detect stage N1. Future research is necessary to develop novel methods to address the class imbalance in order to further improve the overall performance of the proposed framework.
Even though our results are encouraging, there are some limitations to our study. We only considered the analytic Morse wavelet as the mother wavelet in order to generate CWT-TF representations. This allowed maintaining consistency across all of the CWT experiments carried out in this study. We also considered one pre-trained CNN for transfer learning in our experiments. Future work may consider performing a comprehensive analysis over the proposed CNN-BiLSTM model with different wavelet families and different pre-trained CNN networks for improving the performance. We also plan to study the effect of designing an ensemble of different pre-trained CNNs to improve the overall performance in sleep scoring.

Conclusions
We propose an end-to-end deep transfer learning framework for automated scoring of sleep stages based on single-channel EEG signals without the need for any humanengineered features. Two time-frequency imaging approaches were investigated in order to obtain the time-frequency representations of EEG sleep epochs. Additionally, several detection and training scenarios were comprehensively evaluated, including CNN compared to CNN-RNN architectures. The best detection results where obtained using a deep architecture composed of the GoogLeNet CNN followed by a BiLSTM network that operates on time-frequency images generated with the continuous wavelet transform applied to the EEG sleep epochs. Furthermore, the proposed system is set up and trained using transfer learning in order to eliminate the computational overhead and experience required to set up and sufficiently train a deep learning scoring system from scratch.
Our results demonstrate that the proposed modeling framework is able to automatically learn features and score sleep stages in three different single-channel EEGs obtained from two completely different data sets. The proposed CNN-BiLSTM system achieved promising performance while using a one-to-one classification scheme, making it suitable for online and real time monitoring applications.
Future efforts will focus on improving the proposed system in order to be applied to single-channel EEG sources recorded by wearable devices.