Automated Detection of Sleep Stages Using Deep Learning Techniques: A Systematic Review of the Last Decade (2010–2020)

: Sleep is vital for one’s general well-being, but it is often neglected, which has led to an increase in sleep disorders worldwide. Indicators of sleep disorders, such as sleep interruptions, extreme daytime drowsiness, or snoring, can be detected with sleep analysis. However, sleep analysis relies on visuals conducted by experts, and is susceptible to inter- and intra-observer variabilities. One way to overcome these limitations is to support experts with a programmed diagnostic tool (PDT) based on artiﬁcial intelligence for timely detection of sleep disturbances. Artiﬁcial intelligence technology, such as deep learning (DL), ensures that data are fully utilized with low to no information loss during training. This paper provides a comprehensive review of 36 studies, published between March 2013 and August 2020, which employed DL models to analyze overnight polysomnogram (PSG) recordings for the classiﬁcation of sleep stages. Our analysis shows that more than half of the studies employed convolutional neural networks (CNNs) on electroencephalography (EEG) recordings for sleep stage classiﬁcation and achieved high performance. Our study also underscores that CNN models, particularly one-dimensional CNN models, are advantageous in yielding higher accuracies for classiﬁcation. More importantly, we noticed that EEG alone is not su ﬃ cient to achieve robust classiﬁcation results. Future automated detection systems should consider other PSG recordings, such as electroencephalogram (EEG), electrooculogram (EOG), and electromyogram (EMG) signals, along with input from human experts, to achieve the required sleep stage classiﬁcation robustness. Hence, for DL methods to be fully realized as a practical PDT for sleep stage scoring in clinical applications, inclusion of other PSG recordings, besides EEG recordings, is necessary. In this respect, our report includes methods published in the last decade, underscoring the use of DL models with other PSG recordings, for scoring of sleep stages.


Introduction
Sleep is crucial for the maintenance and regulation of various biological functions at a molecular level [1], which helps humans to restore physical and mental wellbeing and proper brain function during the day [2]. There are two primary types of sleep: non-rapid eye movement (NREM) and rapid eye movement (REM) sleep. NREM sleep comprises four stages, after which, it continues into the REM sleep stage. NREM and REM sleep stages are connected and cyclically alternated through the sleep process wherein unbalanced cycling or the absence of sleep stages give rise to sleep disorders [3]. Unfortunately, sleep disorders, which lead to poor sleep quality, is often neglected [4]. Stranges et al. [4] highlighted that sleep-related problems is a looming global health issue. In their study, datasets from the World Health Organization (WHO) and International Network for the Demographic Evaluation of Populations and Their Health (INDEPTH) were used to investigate the prevalence of sleep problems in low-income countries. It was reported that 16.6% of the adult population, which amounts to approximately 150 million, have sleep problems and current trends indicate that this figure will increase to 260 million by 2030.
To date, it is mandatory that sleep stage scoring is done manually by human experts [5,6]. However, human experts have limited capacity to handle slow changes in background electroencephalography (EEG) and learn the different rules to score sleep stages for various polysomnogram (PSG) recordings [6]. Furthermore, evaluations by human experts are prone to inter-and intra-observer variabilities that can negatively affect the quality of sleep stage scoring [7]. Another important factor affecting sleep stage scoring is the patient convenience and the cost of diagnosis. As such, a sleep lab is a highly controlled environment that requires dedicated facilities and highly trained personnel. Hence, sleep labs tend to be in urban centers and patients must travel there to spend one or multiple nights in the facility. These factors make sleep labs inconvenient for patients and the cost per diagnosis is high. Other diagnostic methods, such as portable monitoring devices for sleep stages, exhibit some advantages, such as enhancing access to patients, low cost, and user-friendliness. However, these advantages are outweighed by several disadvantages, such as having diagnostic limitations, failure of device, reliability concerns, and underestimating the apnea/hypopnea index, amongst others [8]. To improve the situation requires a fundamental change in the sleep stage scoring process. We need machines to replace the labor carried out by human experts. This can only be done with systems that understand sleep stages in much of the same way as human experts do. Deep learning (DL) is hailed as a method to mechanize knowledge work, such as sleep stage scoring. However, before we join and adopt this technology, it is prudent to investigate both capabilities and limitations of current DL methods.
This paper aims to capture both capabilities and limitations of current DL methods in sleep stage classification. It is intended to provide deep cohesive information for experts to consolidate and extend their knowledge in the field. This knowledge might also be of interest to policy makers and healthcare administrators because DL technologies are going to shape future sleep stage scoring systems. This review paper summarizes the various DL models employed in the last 10 years and their performances as sleep stage classification systems. This information is valuable for those who plan to use established techniques to address a related problem. This review will also help establish the distinctiveness of a study, because any claim for novelty requires an overview of established methods. In this paper, we focused the review process on DL techniques, because during our practice in the field we found that, among the various artificial intelligence technologies, DL is the most suitable to be developed into a decision support tool for sleep stage scoring. In the foreseeable future, all studies on the topic would either include DL or mention DL-based techniques as a reference point.
To support our claim that DL technology will benefit sleep stage scoring, we have structured the remainder of the manuscript as follows. Sections 3 and 4 describe programmed diagnostic tools (PDTs) and various DL tools, respectively. Section 5 describes the guidelines for sleep stage classification and the publicly available databases with sleep recordings to train and evaluate DL models. Section 6 discusses the key findings of automated sleep stage classification studies based on different DL models. In Section 7, we elaborate on the potential future direction of sleep analysis. Section 8 concludes the paper by highlighting our review findings, which includes a discussion of the best DL models and PSG signals employed for automated sleep stage classification.

Medical Background
The discovery of obstructive sleep apnea (OSA) in 1965 is lauded as the greatest progress in the history of sleep medicine [9]. For many years, OSA was regarded as the occasional closure of the upper airway, thus, early treatments, such as tracheostomy, prioritized primarily on reducing the obstruction to airway [10]. However, recent studies show that OSA is linked to the risk of cardiovascular disease and death [11], emphasizing the need to consider other factors for treatment options.
The emergence of sleep disorders, such as insomnia, OSA, and various other sleep-related disorders further contribute to poor sleep quality [12,13]. Some sleep disorders manifest themselves in sleep interruptions, such as early morning awaking, and the lack or absence of restful sleep [14]. In OSA, this is more severe with symptoms such as extreme daytime drowsiness, snoring and repeated occurrences of interruptions to the respiratory airflow during sleep, which stems from the collapse of the upper airway in the throat [15]. These in turn affect the cardiovascular physiology, causing cardiovascular diseases, such as stroke, angina, and heart failure [16]. OSA has also been linked to higher morbidity and mortality rates [17] and low quality of vital scores [18]. Some studies had also showed that a lack of sleep increases fatigue during the day, which decreases the performance of individuals at work and threatens their occupational safety [19,20].
Overnight polysomnogram (PSG) is currently the "gold standard" to measure multiple physiologic parameters of sleep, and it is used to score sleep stages [6]. These recordings include electroencephalograms (EEGs), electrooculograms (EOGs), electromyograms (EMGs), electrocardiogram (ECGs), respiratory efforts, airflow, and blood oxygenation [5]. According to the American Academy of Sleep Medicine guidelines [21], sleep should be scored by segmenting these PSG recordings into 30-s fragments, also known as epochs. Each epoch is then scored and categorized based on the sleep stages that appear most often. For example, an epoch with one characteristic of Sleep Stage 1 and two characteristics of Sleep Stage 2 will be classified as Sleep Stage 2.
Technology has been shown to help improve convenience and bring down healthcare costs. In this case, such technology could take the form of a cost-effective programmed diagnostic tool (PDT) for automated detection of sleep disorders. Some studies had demonstrated the ability of PDTs to perform as well as experts in sleep stage scoring [22][23][24], and even outperforming the experts in the detection of microstructures of sleep, such as arousal and cyclic alternating pattern. These highlight PDT versatility and the potential to first augment (and perhaps replace) human experts in the analysis of sleep recordings [6,25]. the support vector machine or random forest classifiers, to classify PSG recordings into respective sleep stages. However, this explicit feature engineering involves converting PSG recordings to a low dimensional vector which can result in information loss [27]. Furthermore, the aforementioned machine learning techniques are not ideally suited to handle high dimensional data because of the lack of depth to capture relevant relationships between the covariates in large volumes of data. As a result, standard machine learning is limited in its ability to reliably classify PSG recordings with high precision and accuracy.
On the other hand, DL models can be trained with raw PSG recordings without the need for information reduction [27]. DL techniques attempt to train the model, such that they learn how to make sense of data on their own by extracting features automatically from PSG signals. The extracted knowledge determines the inference. When a large data volume is available, DL models often outperform machine learning models because they can utilize all the available information and make accurate predictions [29]. In addition, DL models learn features that are useful to make inference and neglect those that are not. Hence, DL techniques are more suitable than traditional machine learning techniques when dealing with high-dimensional PSG signals and are thus considered state-of-art methods for automated sleep stage classification.

DL Models
In contrast to other programming languages (such as MATLAB), the Python programming language includes a plethora of libraries that can facilitate the development of DL models with greater ease. Figure 1 shows that over 75% of the automated sleep stages classification studies employed DL tools from Python (TensorFlow, Theano, Keras, PyTorch, Lasagne). Keras is a high-level Application Programming Interface (API) for Python that can use either TensorFlow or Theano as its backend. The programming process in TensorFlow and Theano is also simplified, such that the cognitive load for human to build a DL model is significantly reduced.

Convolutional Neural Network (CNN)
The first CNN was created to mimic the human visual system. In the visual cortex, simple and complex cortical cells break down visual information into simpler representations, so that it is easier for the brain to perceive and classify the image [30,31]. In an analysis of multiple light recordings, a typical CNN model comprises three main layers: convolutional, pooling, and fully connected, as depicted in Figure 2. In this model, an input data is first broken down by learned filters at the convolutional layer to extract important features, and feature maps are created as outputs. These feature maps contain different kinds of information about the data characteristics [32]. The pooling layer follows immediately after the convolutional layer, and it is responsible for reducing the feature map dimension. By doing so, the feature map complexity is reduced, and the visual information is broken down further. Another desired effect of this architecture is the reduction of overfitting [33]. Multiple convolutional and pooling layers can be incorporated in a model to make it "deeper" and increase its recognition ability to classify complex images. After a series of convolutional and pooling layers, the resulting feature map is flattened into a single list of vectors before it is fed to the fully connected layers, which establishes connections between output and input via learnable weights [32]. The superiority of this architecture was first demonstrated in image recognition and classification by Krizhevsky et al. [34], where their proposed CNN model was awarded the top five award in an image classification competition, known as ImageNet Large Scale Visual Recognition Competition (ILSVRC-2012).

Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM)
Both RNN and LSTM models are designed for sequential data processing, such as speech [35], [36], text [37], and handwriting recognition [38]. These models attempt to recognize a pattern in the sequence. Previously, Kumar et al. [39] proposed LSTM model to analyze brain EEG for brain-computer interface (BCI) systems. Their model achieved a low misclassification rate of 3.09% and 2.07% for two publicly available BCI datasets. A recent study by Kim et al. [40] demonstrated the effectiveness of the LSTM and RNN in analyzing bio signals. The LSTM-based deep RNN model, proposed in their study, achieved exemplary performance: 100% accuracy, precision, and sensitivity in personal authentication based on ECG signals. This makes them suitable for classifying bio signals like PSG recordings, which have distinct patterns in different sleep stages.
It needs to be noted that very early versions of RNNs were incapable of learning long-term dependencies, because these RNNs were unable to form connections between old and new data when a large information gap existed between them [41]. This resulted in a phenomenon known as vanishing gradient problem, where the error signals vanished during backpropagation, which eventually led to a model breakdown. Hochreiter and Schmidhuber [42] developed LSTM to solve the vanishing gradient problem, but Ger et al. [43] showed that LSTM was still not able to efficiently learn sequences that were very long or continuous. The reason for this failure was that the internal values of the memory cell in LSTM models had grown out of bound from the continuous input stream, despite the fact that the LSTM model was programmed to reset itself when faced with this kind of problem. As a remedy, "forget gates" were introduced to LSTMs, which remove data that were no longer relevant in memory cells, thereby forgetting and resetting information in memory cells at appropriate times [43]. Useful information, on the other hand, was continuously backpropagated, thus, allowing these models to memorize relevant information and recognize patterns in long-term dependencies. The architecture of the LSTM model is shown in Figure 3. However, high computational complexity is the downside of RNN and LSTM, and they need a large memory bandwidth to train [44,45]. As such, hardware designers often experience difficulty when dealing with RNN or LSTM models, because these models occupy a large amount of cloud space, which is not scalable. Therefore, reducing the computational complexity of RNN and LSTM models must be considered for real-time or mobile applications.

Autoencoders (AEs)
Rumelhart et al. [46] were the first to propose Autoencoders (AEs), a DL technique that is specialized in dimensionality reduction and denoising of data. The key player in an autoencoder operation is the latent (hidden) representation, h, as shown in Figure 4. It plays the role of a bottleneck, which retains only those features that are necessary to reconstruct the input data on the AE output [47]. This latent representation (features) is often used in data classification tasks. This implies that AEs and CNNs have common characteristics and attempt to extract and learn only important features. While CNNs make use of convolutional layers to concentrate the extracted features and improve recognition ability, AEs make use of latent representation (h) to compress the data received from the encoder unit. This step retains salient features and removes irrelevant data. Thus, the operation of AEs includes denoising and reducing the data dimension while extracting features simultaneously. This reduces the computational complexity which in certain classification tasks makes AE models easier to train [48]. However, AEs are also associated with disadvantages, such as poor data compressibility and the inability to train a model effectively for certain tasks. For instance, the latent representation (h) will fail to capture salient features if errors are present in the encoder unit [47]. This is because h cannot get rid of errors; instead, it will compute the average of the input data rather than retain salient features for the decoder unit. The goal of AEs is to reconstruct the input data as shown in Figure 4, hence, the encoder units of AEs have to ensure that there are minimum errors before feeding the input data into the latent representation, h.

Hybrid Models
Hybrid models are based on either CNN-RNN or CNN-LSTM models. The idea of creating such hybrids is to combine the advantages associated with both CNNs and RNN/LSTMs, in terms of feature extraction and pattern recognition ability in sequential data [49,50]. In these hybrid models, the convolutional layers are at the frontline of the model to extract important features from PSG signals, while RNN or LSTM layers would attempt to recognize patterns in feature maps received from the convolutional layers.

Different Stages of Sleep
According to Rechtschaffen and Kales (R and K) [51], humans can experience six discrete stages during sleep: (1) wakefulness (W), (2) rapid eye movement (REM) sleep, and (3) four stages of non-REM (NREM) sleep (S1 to S4) [52]. Based on the sleep electroencephalogram (EEG) characteristics, W occurs when the brain is most active, which is represented by high frequency of alpha rhythms. In the NREM sleep, these alpha rhythms eventually diminish when entering the S1 wherein theta rhythm dominates instead. In the S2, sleep spindles and occasional K-complex waveform will appear. The K-complex waveform usually lasts for approximately 1 to 2 s. The S3 sleep occurs when low frequency delta rhythms appear intermittently and eventually, they dominate in the S4 sleep. Finally, REM sleep usually follows after the S4 sleep. In the REM sleep, theta rhythms resurface again, but unlike in the S1 sleep, theta rhythms are accompanied with EEG flattening [52]. Following the guidelines from American Academy of Sleep Medicine (AASM), the S3 and S4 sleep stages can be merged into one sleep stage S3, because of the similarity in their characteristics [21]. Since the delta rhythms are the slowest EEG waves, S3 and S4 sleep stages are known as Slow Wave Sleep (SWS) or the deep sleep. Thus, most sleep classification studies are based on five: W, S1, S2, S3, and REM sleep stages, instead of six ( Figure 5).

Sleep Databases
Eight main sleep databases have been used for automated sleep stage classifications. Five of the databases are free to download from PhysioNet [53], namely the Sleep-EDF [54], the expanded Sleep-EDF [54], the St. Vincent's University Hospital/University College Dublin Sleep Apnea Database (UCD) [53], the Sleep Heart Health Study (SHHS) [55,56], and the Massachusetts Institute of Technology-Beth Israel Hospital (MIT-BIH) [57] database. The ISRUC-Sleep datasets [58] can be downloaded from the official websites. Permission is required to obtain the sleep datasets from the Montreal Archive of Sleep Studies (MASS) [59].
The PSG recordings, in most of the sleep databases, are scored according to R and K rules [51], wherein scoring is done based on wakefulness, NREM sleep and REM sleep. NREM sleep is then subdivided into four stages (S1 to S4). Exceptions are ISRUC and MASS which follow the AASM guideline and partition the recordings into five sleep stages instead of six [21].

DL Techniques Used in Automatic Sleep Stage Classification
The development of a program diagnostic tool (PDT) for automatic sleep stage classification using DL techniques is shown in Figure 6. First, PSG recordings have to be pre-processed to achieve standardization or normalization. Depending on the requirement and architecture of the proposed DL model, additional steps to convert the PSG recordings into the right input format is required; for example, converting one-dimensional (1D) signals into a two-dimensional (2D) format to train 2D-CNN models. Subsequently, the pre-processed signals are split into training, validation, and testing sets. The training set is used to train the model, the validation set is to fine-tune the model, and the testing set is used to evaluate the model's performance. A well-trained model can accurately classify PSG recordings into the five sleep stages.   (Table 1), expanded Sleep-EDF (Table 2), MASS (Table 3), MIT-BIH, and SHHS (Table 4), and studies that used the remaining two sleep databases (ISRUC and UCD) and private datasets are listed in Table 5. With the exception of three studies [60][61][62], which classified sleep into four stages, all automated sleep stage classification studies, in Tables 1-5, followed the AASM guidelines [21] and classified sleep into five stages. In studies with sleep databases following the R and K rules [51], (i.e., Sleep-EDF, expanded Sleep-EDF, UCD, SHHS, and MIT-BIH), the S3 and S4 stages were often combined manually before pre-processing the PSG signals.     Figure 8 shows the number of times PSG recordings such as EEG, EOG, EMG, and ECG signals were used for sleep stage classification studies. It is not surprising that EEG signal was the most popular input for DL models. The characteristic waves and description of each sleep stages are often based on EEG characteristics (i.e., alpha waves, theta waves, delta waves, etc.); Figure 5.  Tables 1-5. Of the 36 studies, the mixture of signals (electrooculogram (EOG), electromyogram (EMG), and electroencephalography (EEG)) was employed 14 times while EEG signals were used 28 times. Only a small fraction (five studies) employed ECG or EOG time series. * Summary statistics: using EEG versus EEG + additional signals.
Nonetheless, other signals within the PSG recordings are indispensable, because they provide additional information on biological aspects of sleep that may not be manifested in EEG recordings. Since REM sleep is characterized by the movement of eyes and loss in muscle tone of the body core, EOG, and EMG signals may provide key information to separate the REM sleep stage from the other stages. It was shown that some of the REM sleep stages could be overlooked in single-channel EEG input [27]. Therefore, a combination of signals, comprising of EOG, EMG, and EEG, are second in terms of frequency of use after single-channel EEG inputs (Figure 8).
Although ECG is an important sleep parameter [96], it is not common to use raw ECG signals as a direct input for DL models. As seen in Table 4, heart rate variability (HRV) parameters derived from ECG signals, were used to train the DL models instead. There are only three studies that employed HRV parameters, and these studies classified sleep into four stages instead of five: wakefulness (W), light sleep (S1 and S2), deep sleep (S3 and S4), and REM sleep. Li et al. [60] proposed a 3-layer CNN model. They used a cardiorespiratory coupling (CRC) spectrogram, which was derived from ECG and HRV. Besides alternations in physiological signals, there are other changes in body system changes in some individuals such as cardiovascular [97], respiratory [98], or blood flow in the brain [99]. Hence, the CRC picks up the cardiovascular and respiratory changes. Their model achieved an overall accuracy of 65.9% and 75.4% for SHHS and MIT-BIH respectively, as seen in Table 4. Tripathy et al. [61] combined EEG and HRV features as input to an AE model. During testing, the model achieved an overall accuracy of 73.7%. Radha et al. [62] published the only study that was based on ECG signals from a private dataset that was collected as part of the European Union SIESTA project [100] as shown in Table 5. Likewise, they converted ECG signals into HRV and used the HRV features to train an LSTM model, which achieved an accuracy of 77.0%.

Discussion
Even though CNNs are primarily used in image classification, they can also be successfully applied to 1D PSG recordings. Most of the automated sleep stage classification studies rely on the CNN approach (Figure 9a). However, in order to convert 1D signals to 2D images, the input signals need to be reshaped so that the 2D convolutional layer is able to read the data. There are various 1D to 2D transformation methods, such as spectrogram [101], time-frequency representations, which can be established via Hilbert-Huang transform [86], or bispectrum algorithms [102]. To date, there are eight studies that included 2D-convolutional layers in their models [60,74,81,84,86,90,93,95]. However, the process of converting 1D signals to 2D signal representations should be carried out with caution due to potential loss of useful information during the conversion step [103]. One-dimensional (1D)-CNNs were specifically designed to process 1D signals [104]. Unlike traditional 2D-CNNs, which require the input data to be in a matrix format, and 1D-CNNs can run with a simple array, hence, significantly reducing the computational complexity. In addition, 2D-CNNs require deeper model architecture to learn 1D signals. In contrast, 1D-CNN can easily learn 1D signals with shallow model architecture. This means that training of 1D-CNN models with 1D signals is simpler, easier, faster and, therefore, more efficient. This also highlights the 1D-CNN models' compatibility with near real-time processing and deployment in mobile applications, which can potentially be used to track and recognize sleep patterns at home [104]. The popularity of 1D-CNNs for the analysis of 1D PSG signals is demonstrated in this review. Almost all of the studies that proposed CNN-based models had employed 1D convolutional layers, as seen in Figure 9b. Furthermore, studies that proposed 1D-CNN models had higher performance compared to those with 2D-CNN. The highest accuracy score obtained by a 1D-CNN model was 96% [71]. Conversely, none of the eight studies that proposed 2D-CNN models surpassed an accuracy score of 90%.
Two studies included a 3D convolutional layer in their proposed CNN models. Phan et al. [73] used 3D filters to process a combination of signals, namely EEG, EOG, and EMG. Similar to signal pre-processing for a 2D-CNN, these three signals were converted to a 2D time-frequency image before arranging them as a 3D input. As a result, a higher accuracy was obtained with a combination of input signals when compared to a single channel input. Jadhav et al. [80] converted their EEG input into 2D-Continuous Wavelet Transform (CWT) images. The CWT images came with three main colors, red-green-blue (RGB), which provided the third-dimensional input. Hence, their convolutional layer in their proposed model had 3D filters to read the CWT images in terms of width, height, and color of the image.
From Table 4, it is evident that only Tripathy et al. [61] proposed AE models for automated sleep stage classification. In the study by Zhang et al. [87], AE was used solely as a dimensionality reduction tool to pre-process the EEG time-frequency distribution.

Proposed CNN-Based Models
From Figure 10, it can be observed that the performance of the CNN-based models had improved over the years. To date, the model by Zhang et al. [71] achieved the best overall accuracy (96%) for sleep stage classification. However, this accuracy was achieved using a private clinical dataset. When they evaluated the performance of their model using the Sleep-EDF dataset, an overall accuracy score of 86.4% was achieved. This was lower than the accuracy obtained by Zhu et al. [63] (93.7%) who used single channel EEG signals from the same database. The unique feature of the model proposed by Zhu et al. was the attention mechanism that they incorporated into the CNN's learning framework. This attention mechanism improved the feature extraction performance of the model through intraand inter-epoch feature learning. However, the same model achieved a lower accuracy of 82.8% when it was tested on EEG signals from the expanded Sleep-EDF database. Another unique CNN model was proposed by Cui et al. [91] wherein fine-grained methods were used to assist the CNN model to find the best time segment in the EEG signals. Fine-grained methods construct time series from the EEG signals. Basically, if the time window in fine-grained method is set to 3, every 3 time-steps along the EEG signals will be combined together as one-time segment, hence reducing the complexity of EEG signals. The layers in their proposed CNN were shallow-7 layers, including 2 convolutional layers. Yet, they were able to achieve a high overall accuracy of 92%.
A more versatile and consistent CNN model was proposed by Yildirim et al. [65], which was a 19-layer 1D-CNN model with 10 convolutional layers. It achieved an accuracy higher than 90% in both Sleep-EDF dataset and its expanded version, as seen in Figure 10. The peak accuracy was achieved (91.2%) when a mixture of PSG signals was used as input (EEG and EOG), but when single channel EOG signals were used, the accuracy of the model decreased to below 90% (88.8% and 89.8%), (Figure 11).
Both Zhu et al. [63] and Cui et al. [91] showed that CNN models with a small number of layers can achieve a high classification performance of sleep stages by improving the feature extraction ability through additional tools, such as attention mechanism or fine-graining. On the other hand, Yildirim et al. [65] showed that a deeper CNN model can achieve high accuracy classifications across different inputs of PSG recordings (EEG, EOG, EEG + EOG).

Proposed RNN/LSTM-Based Models
Contrary to CNN models, very little automated sleep stage classification studies were done using RNN/LSTM models. The best performance was observed in a study by Hsu et al. [66] back in 2013, wherein a 4-layer RNN model was used. They had adopted the structure of an Elman network and had successfully classified various sleep stages with an overall accuracy score of 87.2%, as seen in Figure 12. A similar accuracy of 86.7% was achieved by Michielli et al. [67], where a cascaded RNN network was proposed, with 2 LSTM units. Two other studies explored a mixture of signals to train RNN-based models: Dong et al. [83] proposed a mixed neural network where a RNN model was substituted for a multi-layer perception and an LSTM model was exchanged for an RNN model. The proposed final model achieved accuracies of 85.9% and 83.4% using F4-EOG and Fp2-EOG inputs, respectively. Both F4 and Fp2 were single channel EEG signals recorded at different electrode placements. Hence, F4-EOG and Fp2-EOG inputs were considered a mixture of signals (EEG + EOG). Subsequently, Phan et al. [85] proposed an end-to-end hierarchical RNN model, known as SeqSleepNet, which consisted of an attention-based recurrent model and filter bank layers. Short-time Fourier transform was used to convert multiple PSG recordings (EEG, EOG, and EMG) into power spectra signals which were then fed to train the proposed model. The model achieved a high accuracy score of 87.1%, which was the highest amongst the RNN/LSTM-based models.

Proposed Hybrid Models
There are limited studies on the employment of hybrid models. The best performing hybrid model for EEG signals was proposed by Seo et al. [70]. Their IITNet model consisted of CNN layers and two bidirectional LSTM layers. The CNN layers were responsible for extracting representative features in each epoch and producing sequential feature maps, which were analyzed by the bidirectional LSTM layers to capture temporal sleep stage information [72]. Figure 13 shows that this model achieved an overall accuracy score of 86.7% (SHHS database), compared to 83.9% when applied to Sleep-EDF database. On the other hand, Mousavi et al. [69] proposed SleepEEGNet, a CNN-RNN model with bidirectional RNN units. The difference between this model and Seo et al.'s model [70] was in the architecture of bidirectional RNN units which resembled AEs in the former model. Comparing within the same Sleep-EDF databases, this model achieved a higher overall accuracy score of 84.3%, as seen in Figure 13. However, their proposed model achieved a lower accuracy score of 80.0%, with EEG signals from the Expanded Sleep-EDF database. Figure 13. Performance of proposed hybrid models using EEG signals or a mixture of signals. Various sleep datasets are represented by different colors and the type of signals are described in the bar chart. * Summary statistics: different datasets used to build hybrid models using EEG/PSG signals. * The accuracy scores in Figures 9-12 are based on AASM guidelines and pertain to the five class classification [21].
When a mixture of PSG signals were taken into consideration, the CNN-RNN based model proposed by Biswal et al. [94] outperformed all other models. However, in their study they used a private sleep database with recordings from the Massachusetts General Hospital (MGH) sleep laboratory to train and test the model. They also trained their model with the MGH dataset and then tested it on the SHHS dataset. With that setup, they obtained an overall accuracy of 77.7%. This score is lower than that obtained by the CNN-based model proposed by Fernández-Varela et al. [89] who employed a mixture of signals from the SHHS database to train and test their model.
The number of studies employing DL methods for automated sleep stage classification has increased in the past few years (Figure 14), most of the studies incorporated CNN models. The CNN models became popular after the publication by Krizhevsky et al. [34] in 2015. However, the number of studies relying on this architecture started to decline after 2018. This implies that CNN-based approaches have perhaps reached their peak ability and peak performance in classifying sleep stages. This decline is also likely due to the fact that best CNN based architectures are now less competitive when compared to the other DL models. Conversely, the research on RNN/LSTM and hybrid models remained stagnant from 2018 to 2019, and no further studies were carried out using AE models after 2018. This suggests that more attention should be paid to improve the performances of these models, particularly to RNN/LSTM, and hybrid models in automated sleep stage classifications. When assessing polysomnographic recordings, clinical experts rely on a combination of EEG, EOG, and EMG signals before they determine the sleep stage for each sleep phase [63]. In order to be on par with the clinical experts, an ideal DL model should effectively classify sleep stages based on a mixture of signals. At present, the majority of automated sleep stage classification studies demonstrate a high performance with single-EEG channel, but only a small fraction (25%) evaluated the performance of their approaches on a mixture of signals ( Figure 8).
In summary, research on a mixture of PSG signals should be the main focus in the automated detection of sleep stages. The RNN/LSTM and hybrid models have yet to reach their peak performance in classifying sleep stages. Further research on these models to evaluate their performance across different databases could decrease the bias in these methods and help to identify architectures capable of processing mixed PSG signals. A model with optimal architecture could be employed in various applications and platforms, such as mobile, point-of-care monitoring devices.
This review underscores the advantages (key points discussed) as follows: 1.
Numerous studies (15 from Figure 10) employed CNN models with EEG signals, and that CNN models are effective in recognizing characteristic features of sleep EEG. 2.

3.
Most studies (60% from Figure 8) used EEG signals and achieved high classification accuracy.

4.
EEG signals were mainly used in studies that explored a mixture of PSG signals. In other words, EEG could be a reference signal when considering mixture of PSG signals to train and evaluate newly proposed models.
The limitations of this review are as follows: 1.
It is difficult to compare various models and identify the best performing approach, because the majority of studies used data from only one sleep database to train and test the model.

2.
There is a lack of studies that utilized other PSG recordings, such as EOG, EMG, or ECG signals. Studies that used these PSG recordings also did not perform equally well as those that used only EEG signals. Hence, this limits the implementation of these PSG recordings in real world applications for automated sleep stages classification.

Future Work
We anticipate that more public databases with large data records will become available to conduct studies on sleep stage classification and the development of more accurate approaches. When constructing a DL model, the ability to implement it in cloud-based processing systems should be considered, (Figure 15). Extracted EEG or PSG signals could then be sent to the model, for processing in the cloud and results of the analysis returned to the clinician. The analysis could be performed on-line or off-line with the goal to reduce expert workload who otherwise would have to review hours of recordings manually. Subsequently, verified results could be sent to the patient's mobile phone. With the DL model deployed in the cloud, any device with access to internet, such as mobile applications, could use it anytime and anywhere. For example, Patanaik et al. [105] developed a cloud-based framework that could classify sleep stages using real-time data with 90% accuracy. Their model performed better than expert scorers (82%). This type of framework is suitable for wearable sleep devices, such as the Kokoon and Dreem headband that reads real-time data [105]. Furthermore, future studies should focus on using various EEG signals to detect sleep stages. Currently, the most common electrode placing to record EEG signals are Fpz-Cz, Pz-Oz, C4-A1, and C3-A2.
Another area of development and research is the microstructure of sleep, such as Cyclic Alternating Pattern (CAP) and arousal. Their duration is shorter than a half-minute epoch, hence, often undetected by humans in conventional sleep stage scoring [6]. Since quantifying microstructures by humans has poor reproducibility and is prone to errors, it would be desirable to develop automated approaches to address this deficiency [6]. However, CAP sleep database (CAPSLPDB) is currently the only database that provide PSG recordings for development of tools for CAP detection [106]. Hence, more public databases designed for microstructure of sleep assessment should be available to allow advancement of PDT in the detection of CAP.

Conclusions
Sleep disorders are a pressing global issue and the most dangerous sleep disorder is the obstructive sleep apnea, which can lead to cardiovascular diseases, if left untreated. Hence, efficient and accurate diagnostic tools are required for early interventions. In this work, we reviewed 36 studies that employed programmed diagnostic tools with the DL models as the backbone, analyzing overnight polysomnogram recordings to classify sleep stages. Presently, CNN models can offer higher performance in classifying sleep stages, especially with EEG signals. Hence, they are consistently and favorably used by researchers to classify sleep stages as compared to the other machine learning models and physiological signals. Moreover, employing 1D-CNN models is advantageous, because they yield high classification results on EEG signals. However, EEG signals alone may not be sufficient to achieve robust classifications. To achieve robustness and high accuracy one could develop a system that takes advantage of an automated processing and human experts in the interpretation of EEG, EOG, and EMG signals together when classifying sleep stages. Therefore, in this review, we highlighted that future studies should focus on classifying sleep stages using all or a combination of these signals. Furthermore, other DL models, such as RNN/LSTM and hybrid models, should also be explored as their full potential has yet to be realized. Future studies could focus on the compatibility and applicability of the DL models in mobile and real time applications. Lastly, more research in developing DL models to detect sleep microstructures is required, as these are often undetected in sleep stage scoring.

Conflicts of Interest:
The authors declare that they have no conflict of interest.