1. Introduction
Sleep apnea is defined by the American Academy of Sleep Medicine (AASM) [
1] as a sleep related disorder characterized by the presence of breathing difficulties during sleep. The Apnea Hypopnea Index (AHI) is considered to be the most relevant metric to diagnose the existence and severity of the disorder, indicating the number of apnea events per hour of sleep. This disorder is significantly prevalent with a global estimation of 200 million people [
2]. Four percent of adult men and two percent of adult women are victims of this disorder making it more common in males than in women [
3]. However, among the apnea patients, 93% of middle-aged women and 82% of middle-aged men with moderate to severe sleep apnea were undiagnosed [
4]. Sleep apnea can also affect the juvenile population as verified by Gislason and Benediktsdóttir [
5], estimating a prevalence of three percent in pre-school children. Sleep apnea can relate to ischemic heart disease, cardiovascular disfunction, and stroke [
6], daytime sleepiness [
7] and can be associated with the development of type 2 diabetes [
8]. In some cases, traffic accidents can occur because of drowsiness due to not sleeping well [
6].
Full night polysomnography (PSG), performed in a sleep laboratory, is considered the gold standard for sleep apnea diagnosis [
1]. PSG involves recording a minimum of eleven channels of various physiological signals collected from different sensors, including electroencephalogram (EEG), electrooculogram (EOG), electromyogram (EMG) and electrocardiogram (ECG), allowing researchers to achieve accurate results [
9]. However, it is considered to be uncomfortable (due to a large number of wires and sensors connected to the subject’s body), expensive and unavailable to a large group of the world’s population [
10]. In addition, the analysis process is time-consuming and labor-intensive [
11]. Thus, it is prone to errors. Commonly, the medical facilities have a small number of professionals capable of diagnosing sleep apnea [
12,
13], leading to long waiting lists [
14].
Various methods have been proposed in the literature to address these issues and most of them include two steps: handcraft a set of relevant features; and develop a proper classifier to provide an automatic diagnosis. These methods employ classifiers such as k-nearest neighbor (kNN) [
15,
16], support vector machine (SVM) [
2,
16,
17], fuzzy logic [
18,
19], neural network [
16,
20,
21], and linear discriminant analysis (LDA) [
16,
22]. However, these approaches have two main issues. The first one is the infinite combination of features that can be chosen, which is enhanced by the fact that combining two or more independent features, chosen as the best, cannot guarantee a better feature set [
23]. However, this problem can be mitigated using proper feature selection methods and multiple algorithms have been presented in the literature: statistical estimation [
9]; minimum redundancy maximum relevance (mRMR) [
16]; wrapper approaches such as sequential forward selection (SFS) [
15,
16,
24] and principal component analysis (PCA) [
25]; and the genetic algorithm (GA) [
21]. The second problem is the need for considerable knowledge in the specific field to create relevant features. These two issues can be solved by using deep neural networks that automatically generate features by finding patterns in the input signal from the sensor.
Although previous reviews have been performed in the field of sleep apnea detection, such as analyzing devices for home detection of obstructive sleep apnea (OSA) [
26], classification methods based on respiratory and oximetry signals [
26], different detection approaches [
27], detection and treatment methods [
28]. However, no review was previously performed to assess the current development of methods for detecting sleep apnea using deep learning. In addition to that, recent publications show a significant accuracy improvement using deep network over shallow networks. Therefore, the main focus of this review is in the analysis of such works, assessing the performance of the presented methods to provide in-depth knowledge about the applicability of deep learning in the detection of sleep apnea.
A systematic review is performed using the preferred reporting items for systematic reviews and meta-analyses (PRISMA) approach. The employed review method is presented in
Section 2. The analysis of the employed signals or sensors and databases is presented in
Section 3, while
Section 4 presents a discussion regarding the usability and necessity of pre-processing the data.
Section 5 provides a model detailed explanation of the employed classifiers that were mentioned through the review. The common performance indicators are discussed in
Section 6. The key question of this review, how to implement deep learning for sleep apnea, and the comparison between different techniques are addressed in
Section 7. The discussion and conclusions are presented in the final section (
Section 8), with indications of the limitation and possibilities for future research in the analyzed topic. Abbreviations of different acronyms are mentioned in the
Appendix A.
2. Materials and Methods
The review was performed considering the timeline between 2008 and 2018, based on the PRISMA style. A systematic search was conducted on Web of Science, IEEE Explorer, PubMed, ScienceDirect, and arXiv. The selected search keywords were (“sleep apnea” OR “sleep apnoea”), due to the different spellings of the word apnea, along with the AND operation and: “unsupervised feature learning”; “semi-supervised learning”; “deep belief net”; “CNN”; “convolution neural network”; “autoencoder”; “deep learning”; “recurrent neural network”; “RNN”; “long short-term memory”; “LSTM”. A total of 255 articles were found, specifically: 93 on the Web of Science; 77 on PubMed; 51 on IEEE Xplorer; 25 on ScienceDirect; 9 on arXiv. A total of 116 duplicate articles were removed from the list.
The title and abstract of each article were analyzed and 19 were selected as relevant to the topic. The inclusion criteria analyzed the keywords apnea and deep network. The main exclusion criterion was non-English articles. Works that were not explicitly developed for sleep apnea detection, but could be adapted for that purpose, were also excluded. Two papers were added due to their relevance though they did not appear in the search and two were removed despite of their appearance in the search. A relevant article, found by analyzing the references of the already selected articles, was included despite not appearing in the search engines. Therefore, a total of 21 articles were selected for this review. The flow chart of the search strategy is presented in
Figure 1, with n indicating the number of articles.
The last decade was chosen for this work since most of the articles (20 articles) were published in 2017 (five articles) and 2018 (15 articles). Only one was published in 2008. Therefore, within one year, the number of published articles was three times higher, highlighting the importance of this topic and the need for a review to consolidate the developed approaches and point out new research lines.
A word cloud, presented in
Figure 2a, was created from the articles’ original titles. It was challenging to understand the critical features of the implemented deep networks because of synonyms words, abbreviations and acronyms for the same word, and there were also articles and prepositions which contained no information. Therefore, a modified text with acronyms, without connecting words and the most selected words of
Figure 2a, was also used to produce a word cloud presented in
Figure 2b. Connecting words like using, every form of detect, classification, sleep, apnea, and events were also removed. In addition to the searched keywords for this review, a validation of the keywords selection of the papers is presented in
Figure 3. From this modification and exclusion of the original text, it was possible to verify that most of the works use ECG (electrocardiography) sensors as the source signal. CNN (convolution neural network) and LSTM (long short-term memory) were the most commonly mentioned classifiers.
6. Performance Indicators
Multiple metrics can be used to assess the performance of the classification. The most common parameters shared among all the works are calculated by considering the true positive
, true negative
, false positive
and false negative
values. These parameters can be expressed as defined by Baratloo et al. [
87] where
is the number of cases correctly identified with the disorder(/patients/apnea),
is the number of cases incorrectly identified with the disorder,
is the number of cases correctly identified as normal(/healthy/ non-apnea) and
is the number of cases incorrectly identified as normal. However, an interchangeable definition of TP and FP was used in some of the reviewed works [
43,
46]. It is possible to define the accuracy (Acc), specificity (Spc), precision or positive predictive value (PPV) and recall or sensitivity (Sen) as:
For binary classifiers (models with only two possible outputs), recall has the same definition as Sen. However, these metrics can be strongly affected by imbalanced classes in the dataset. Other metrics are used to address this issue such as a combined objective (
CO):
and the area under the receiver operating characteristic curve (AUC). The receiver operating characteristic curve can be created by considering the true positive rate (TPR) versus the false positive rate (FPR) with different thresholds for the classifier [
88]. Then the area under the curve is calculated to determine the AUC values. An alternative metric is the
score, given by:
A weighted proportion,
, can be introduce to the
producing:
where
is the class index,
is the total number,
is the number of classes
.
Other ways of solving the imbalance could be down-sampling or up-sampling. A balanced bootstrapping is also proposed and used [
55]. A comprehensive review of learning from the imbalanced dataset [
89], handling the problem [
90], and used technique in deep learning [
91] was discussed in the literature.
8. Discussion and Conclusions
This systematic literature review has synthesized and summarized the published deep classification methods for sleep apnea detection. From the selected 21 studies, the main findings are provided below.
It was verified that a significant number of papers were published in the last two years, indicating a strong interest in the research community on this topic. Comparison between the deep networks and parameter choice of the deep network is still a matter of ongoing research and a very hot topic. In addition to that, which sensor or signal is best for the apnea detection is still in question.
The ECG sensor based signal was the most commonly used, which could be justified as indicated by Mendonça et al. [
27], that for a single source sensor, ECG signals provided the highest global classification. However, sleep apnea is directly related to respiration. Thus, this higher accuracy with ECG signals could happen due to the use of public datasets that are less affected by noise [
27]. For the works based on a single sensor, Pathinarupothi et al. [
33] achieved the best results using the SpO
2 signal comparing IHR from ECG. Therefore, the universality of better ECG signals performance is not true. However, a direct comparison between different works between the different signals performance parameters is not fair for this review because of the use of different classifiers and different databases.
It was verified that using more than one signal from sensors improves the predictive capability of the models as reported by Haidar et al. [
53]. This is understandable because the gold standard of sleep apnea tests uses several signals. However, the main research goal of most of the work is to achieve a respectable result using fewer sensors.
Most of the work with deep networks outperformed the shallow networks except for the work of T. Kim et al. [
52]. In their work, a deep network performed slightly less than the shallow network. However, they used deep network with human engineered features. Similar kinds of work where authors [
57] used features with deep network MHLNN outperformed classical machine learning techniques. Therefore, for the work of T. Kim et al. [
52] may be a feature selection process or hyperparameter choice of the deep network.
CNN was the more commonly used classifier and approach based on both CNN1D and CNN2D as was presented. However, it was not possible to indicate what this is the best type of network since the testing conditions were different in all works. However, McCloskey et al. [
59] compared both and verified that 2-D spectrogram images of the nasal airflow performed better than raw 1-D signal with CNN. A similar conclusion was attained by Biswal et al. [
49] where RCNN with spectrogram representation achieved a higher accuracy. Analyzing the three works of Urtnasan et al. using CNN1D [
43,
46] and RNN [
45] where they had collected the data from the same hospital, it was possible to verify that RNN outperformed the CNN. However, more research is needed to reach a definitive conclusion. The same type of conclusion can be achieved by analyzing the works that have employed LSTM and GRU.
Hyperparameters optimization is also a problem in deep network implementation. Some works [
43,
46,
47] have verified that just blindly increasing the number of layers or neurons in the hidden layers did not increase the performance. Most of the works chose the hyperparameters with an educated guess or by trial and error methods. Others used a predefined search space and tried to find a best solution [
43,
46,
47]. A possible alternative solution was presented by Falco et al. [
36], were an EA was used to choose the hypermeters.
For performance purposes, dominating methodologies were hold-out and cross-validation methods. Hold-out does not test all the dataset. It is understandable that due to a long simulation time and the assumption of having the same effect due to a significant number of examples, many authors do not choose the cross-validation method when using deep learning. On the other hand, cross-validation of event-based apnea detection techniques is frequently used without ensuring subject independent (or this information was not mention specifically in the paper), which is essential to assess the generalization capability of the model. Some authors used dataset balancing methods or specific parameters to solve the class imbalance problem. It was also not clear for some works if the test dataset was balanced or not, which should not be done since it will change the natural distribution of data and, consequently, derail the generalization of the model. To have a fair test, a form of cross-validation with subject independence could be suggested as a good choice for future research.
There are two main classification strategies; event-by-event or global classification. Most of the works concentrated on event-by-event classification and eight works used global classification considering OSA severity classification. However, it is possible to do a global classification from event-by-event classification methods by using a threshold approach as indicated by Pathinarupothi et al. [
33]. This observation is considered extremely relevant for further research since it will allow the methods to be used for clinical diagnosis.