Heuristic Optimization of Deep and Shallow Classifiers: An Application for Electroencephalogram Cyclic Alternating Pattern Detection

Methodologies for automatic non-rapid eye movement and cyclic alternating pattern analysis were proposed to examine the signal from one electroencephalogram monopolar derivation for the A phase, cyclic alternating pattern cycles, and cyclic alternating pattern rate assessments. A population composed of subjects free of neurological disorders and subjects diagnosed with sleep-disordered breathing was studied. Parallel classifications were performed for non-rapid eye movement and A phase estimations, examining a one-dimension convolutional neural network (fed with the electroencephalogram signal), a long short-term memory (fed with the electroencephalogram signal or with proposed features), and a feed-forward neural network (fed with proposed features), along with a finite state machine for the cyclic alternating pattern cycle scoring. Two hyper-parameter tuning algorithms were developed to optimize the classifiers. The model with long short-term memory fed with proposed features was found to be the best, with accuracy and area under the receiver operating characteristic curve of 83% and 0.88, respectively, for the A phase classification, while for the non-rapid eye movement estimation, the results were 88% and 0.95, respectively. The cyclic alternating pattern cycle classification accuracy was 79% for the same model, while the cyclic alternating pattern rate percentage error was 22%.


Introduction
Sleep is a complex cyclical process that is usually examined using sleep-related metrics attained from signals recorded by polysomnography (PSG). This examination is considered the gold standard for sleep analysis. The scoring rules, defined by the American Academy of Sleep Medicine (AASM) manuals, assign to each thirty-second epoch (standardized scoring epoch) either the stage wake, Rapid Eye Movement (REM), or one of the Non-REM (NREM) stages [1].
The electroencephalogram (EEG) signals are used as a reference to define the sleep structure, which is composed of macrostructure and microstructure. The macrostructure is a stepwise profile characterized by repetitive NREM and REM cycles, according to the prevalent EEG activity, while transient and phasic events, shown in the brain's electrical activity, define the microstructure [2]. The paradigm composed of epochs lasting one second was employed to score the microstructure events since they have a shorter duration than the standardized scoring epoch [3].
The Cyclic Alternating Pattern (CAP) concept was proposed by Terzano et al. [4] to examine the microstructure of the NREM sleep by evaluating the sequences of transient electrocortical events, which are different from the EEG background activity. Specifically, the CAP is composed of an activation phase (A phase), characterized by a sequence of transient EEG variations, directly followed by a quiescent phase (B phase), denoting the intermittent recovery of background activity. Each phase can only be considered a valid CAP phase if its duration ranges from two to sixty seconds [4].
Several studies have been carried out to understand the role of CAP in the sleep process. It was proposed that CAP is significantly related to the creation, consolidation, and disruption of the sleep macrostructure [5]. CAP was identified as an EEG marker of sleep instability [6], functioning as a measure of the brain's effort to preserve sleep [7], thus, working as a sleep quality marker. Temporal relation between behavioral activities, autonomic functions, and CAP was observed [8]. Consequently, CAP was found to be correlated with the occurrence of several disorders, such as sleep apnea [9]. These works advocate the relevance of the CAP as a sleep quality marker. However, a large amount of information is generated during a full night EEG recording. Thus, manual scoring all the CAP events is unpractical, and misclassifications are likely to occur. As a result, the specialist agreement, when analyzing the same EEG signals, ranges from 69% to 78% (getting closer to the lower bound as the number of specialists involved in the analysis increases) [10,11]. Therefore, the development of automatic CAP detection algorithms is desirable and consubstantiates the necessity of this study. The main goal of the developed work is to create an automatic classifier for CAP assessment, which can be used to predict sleep quality.
Each A phase can be divided into three subtypes according to the amplitude and spectral characteristics of the EEG signal [4]. Several works have proposed automatic methods for classifying these subtypes [12,13]. Although these subtypes provide relevant information regarding the sleep process, for the sleep quality examination, the most relevant information is in the occurrence or not of CAP cycles to calculate the CAP rate (total CAP duration to the total NREM sleep duration ratio [4]). This metric is the most widely used microstructural parameter for clinical purposes [8]. It has the advantage of being characterized by a low night-to-night intraindividual variability, thus, allowing the appraisal of the quality of sleep by knowing the subject's age (a CAP rate higher than the average for the subject's age can be linked to poorer sleep quality [8]).
Most state-of-the-art works performed the A phase detection by feeding features, created by a feature creation process, to the classification procedure, which is either composed of tuned thresholds or a machine learning classifier. However, the feature creation process requires significant domain-specific knowledge. It is becoming considerably challenging to discover a new set of features that can achieve a higher performance than the methods reported, in the state-of-the-art. It is also relevant to note that combining two or more features does not ensure performance improvement, and the features usually need to be sorted to find the most relevant [14]. These difficulties can be resolved by using a deep learning classifier that automatically learns the relevant patterns directly from the input signal. These classifiers were identified in this work as Automatic Feature Creation (AFC) models.
Nonetheless, important patterns can only be found if there is enough data to train the classifier. Therefore, CAP analysis can become considerably challenging since the classification is based on a second by second evaluation with few data points. For this reason, a novel approach was followed in this work, evaluating consecutive overlapping windows which fed a One-Dimension Convolutional Neural Network (1D-CNN) that can exploit spatially local correlations in the signal by enforcing a local connectivity pattern amongst neurons of adjacent layers. Consequently, the 1D-CNN has the inherent capability to fuse the feature extraction (automatically identifying the distinctive patterns related to the A phases) and the classification processes into a single adaptive learning model. These models are relatively simple to train (compared to the large deep learning classifiers) and have minimal computational complexity, while attaining state-of-the-art performance levels of the complex deep learning models [15]. On the other hand, CAP was found to have temporal dependencies that can be identified by a recurrent neural network [16], such as the Long Short-Term Memory (LSTM). These classifiers were also previously found to be suitable for signal analysis [17]. Therefore, both 1D-CNN and LSTM were examined in this work.
It was reported by Mendonça et al. [18] that the deep learning models have difficulties recognizing the relevant patterns for two of the three subtypes, which compose the A phases, suggesting the need for examining feature-based methods in this work. Specifically, the LSTM was examined since it was identified as a suitable classifier for feature-based analysis with temporal dependencies [19]. The Feed-Forward Neural Network (FFNN) was also tested as it was identified in the state-of-the-art as possibly the best conventional classifier for A phase estimation, working as a benchmark for the other examined classifiers [20].
As a result, two approaches were followed in this work to perform both the A phase and NREM assessment. The first involved the AFC methodology, where the classifier performed the classification by evaluating the EEG signal without having an explicit feature creation. This work aims to assess if a model based on a machine learning classifier is suitable for CAP and sleep quality assessment and to identify if either AFC or feature-based models are the most appropriate to perform the CAP examination.
The main novelties of this article are: • Presentation of a novel algorithm for optimizing the structure of deep learning models (code is publicly available). The optimization of deep learning models' structure is a challenging task. As a result, there is a need for simple algorithms that can allow users to develop new models without requiring a detailed optimization procedure; • Proposal for a fully automatic sleep stability analysis based on CAP, which provides the A phase, CAP cycle, and CAP rate assessments. To the authors' best knowledge, this is the first time a single algorithm provides all these metrics with such high accuracy; • For CAP analysis, the performance of the machine learning models, using features, and deep learning models, with automatic feature extraction, was compared. To the authors' best knowledge, this is the first time this examination was carried out.
This work has the following organization: evaluation of the state-of-the-art in Section 2; presentation of the materials and methods in Section 3; performance assessment of the developed algorithms in Section 4; discussion of the results in Section 5; conclusions of the work in Section 6.

State-of-the-Art
Several works have proposed methods for the A phase detection, where the approach of considering each epoch as either "A" or "not-A" is common, leading to a binary classification problem. A technique to describe the sleep microstructure was proposed by Barcaro et al. [21], computing five band descriptors (one descriptor for each of the EEG characteristic bands), which provides a normalized measure of how much the amplitude in a particular frequency band differs from the background. A tuned threshold was then employed to perform the classification. Largo et al. [22] evaluated the signal's power of five frequency bands (by calculating the fast discrete wavelet transform) and analyzed two moving averages to identify the occurrence of A phases, classified by comparing with a threshold. Niknazar et al. [23] proposed a classification method that performed a similarity analysis between reference windows (from a database) and the windowed signal presented to the algorithm.
Mariani et al. [24] examined the five band descriptors, Hjorth descriptors, and differential variance (of the EEG signal), performing the classification with tuned thresholds. It was observed that differential variance attained the best performance. These features were also evaluated by Mariani et al. [20,25,26], classifying with a Support Vector Machine (SVM) with a Gaussian kernel, an FFNN, and a Linear Discriminant Analysis (LDA), respectively. Other classifiers were tested in the third work, however, LDA attained the highest accuracy. A method based on variable windows was also proposed by Mariani et al. [27], using three discriminant functions (one for each A phase subtype), which were then combined for the final score. Auto-covariance, Shannon entropy, Teager Energy Operator (TEO), and frequency-domain features (chosen by a sequential forward selection method) were evaluated by Mendonça et al. [28], and multiple classifiers were tested. Best results were attained using an FFNN.
A deep learning approach was proposed by Mostafa et al. [29], classifying two-second segments of the EEG signal with a Deeply-Stacked Auto Encoder (DSAE). A similar approach was employed by Mendonça et al. [16], feeding the EEG signal to an LSTM. Hartmann and Baumert [19] have also used an LSTM to perform the classification, fed with entropy-based features, TEO, differential variance, and frequency-based features.
Two approaches were found in the state-of-the-art for the CAP cycle assessment. The first, employed by Mostafa et al. [29], fed the output of the A phase classifier to an FFNN. The second, used by Mendonça et al. [16,28], provided the A phase classification's output to a Finite State Machine (FSM) to apply the CAP cycle scoring rules [4].

Materials and Methods
AFC and feature-based approaches were developed for the CAP analysis. This was accomplished using the methodology presented in Figure 1 (developed in Python 3 using TensorFlow). The proposed algorithm for the AFC methods is composed of seven steps, starting by pre-processing the input signal, which was then segmented to create either the overlapping windows (for the 1D-CNN) or the time steps data (for LSTM). These were then fed to the classification procedures composed of two parallel classifiers. Each one-second epoch was classified as either "A" or "not-A" by one classifier, and as "NREM" or "not-NREM" by the other classifier. Afterward, a correction procedure was employed, in the post-processing step, to reduce the misclassifications by correcting the isolated "A" or "not-A" classifications and reclassifying the "A" as "not-A" when the NREM classifier indicates a "not-NREM" epoch. The estimation of a sleep quality metric (CAP rate) was performed in the final step. A similar approach was employed for the feature-based methods. However, a new step was included, for the feature creation, between the pre-processing and the data segmentation.

Studied Population
Recordings from nine females and ten males, fifteen free of neurological disorders and four with sleep-disordered breathing, were selected from the Physionet CAP Sleep Database [4,30]. The recordings were performed at the Sleep Disorders Center of the Ospedale Maggiore of Parma. The evaluation was implemented with the EEG monopolar derivation (C3-A2 or C4-A1) signals, which were considered essential for CAP scoring [4]. The relevant characteristics of the population are presented in Table 1. The annotations regarding the sleep macrostructure and the occurrence of the A phases were provided by expert neurologists. The CAP cycles were identified by applying the scoring rules defined by Terzano et al. [4] to the annotated A phases. The total number of examined epochs (each with one second of EEG data) was 592,641.

Pre-Processing Resampling Procedure
The sampling frequency of the records ranges from 100 Hz to 512 Hz. Hence, a resampling procedure was applied to all signals to create a uniform database. Specifically, the records were resampled by decimation [31] at the lowest sampling frequency; therefore, 59,264,100 sample points were examined. A constant reduction factor was used for the sampling rate, r, and a standard lowpass filter (Chebyshev type I filter with order eight, a passband ripple of 0.05 dB, and normalized cutoff frequency of 0.8/r [32]) was used to down-sample the signal and avoid aliasing. This filter was selected as it is recommended by the standard Python, MATLAB, and R libraries to perform decimation, and also as the Chebyshev type I filters have small transition bands and roll off fast (good properties for decimation). Afterward, the resampling procedure selected each rth point from the filtered signal to produce the resampled signal, which was then standardized (subtract the mean and divide the result by the standard deviation) to reduce the effect of systematic signal variations [33]).
Several studies recommended the removal of artifacts related to movements and the cardiac field to improve the classifier's performance [13,34]. However, some events can be labeled as an artifact and yet be related to the occurrence of an A phase intended to be detected. The proper removal of eye movement and cardiac field artifacts requires both electrocardiogram and electrooculogram signals, making the algorithm more complex and less suitable for hardware implementation. For these reasons, no artifact removal procedure was employed.

Pre-Processing Segmentation Procedure
A segmentation process was employed to create the epochs. Each epoch corresponds to one label of the dataset, which defines the second (epoch's duration) as either "A" or "not-A" for the A phase classification, and as either "NREM" or "not-NREM" (not-NREM includes REM and wake periods) for the NREM classification. However, each epoch contains 100 sample points (signal resampled at 100 Hz), which may not be enough for the 1D-CNN classifiers to find the relevant patterns. Therefore, overlapping windows were examined for this classifier to evaluate if additional information can improve the classification's performance. Three approaches were tested for the overlapping, considering either the first, the central, or the last 100 samples of the window as the ones corresponding to the epoch's label. Therefore, the first scenario overlaps on the right, the second on the right and left, and the third on the left.
For the LSTM classifier, the time steps concept was employed where the features fed each LSTM block, and were either from an epoch of the pre-processed input signal or from the features created in the feature creation procedure. The number of time steps and the number of hidden units that produced the LSTM layer's outputs are parameters that require tuning.

Feature Creation
A feature creation procedure was used for the feature-based methods, and three categories of features were examined. The first category was composed of features produced from symbolic dynamics, performing segmentation analysis, and one amplitude variation metric. These features identify the abrupt variations in the signal's amplitude that occur during the A phases. The symbolic dynamics transformed the input signal into a sequence of symbols by examining several thresholds for the signal's amplitude, which are multiples of the signal's standard deviation, σ. A total of nine thresholds were used since it was previously identified as a suitable number for the A phase examination [18]. Thus, for each sample point of the epoch (which is composed of 100 sample points), the algorithm evaluated if the point's amplitude is lower than either −5 × σ, −4 × σ, −3 × σ, −2 × σ, −σ, 2 × σ, 3 × σ, 4 × σ, emitting the symbol 1, 2, 3, 4, 5, 6, 7, 8, 9, respectively. The number of each emitted symbol (for each input window) was then considered as the value for the feature. As an example, if symbol 1 was emitted 10 times, symbol 2 was emitted 15 times, symbol 3 was emitted 5 times, symbol 4 was emitted 10 times, symbol 5 was emitted 50, and symbol 6 was emitted 10 times, then the value for the features A 1 to A 9 are 10, 15, 5, 10, 50, 10, 0, 0, 0, respectively.
An amplitude variation metric was also examined by calculating the variation in the current epoch's, E, maximum amplitude with respect to the previous two epoch's maximum amplitudes by where max is the operation that searches for the maximum value. This feature was used since it can possibly designate the onset epoch of an A phase as, by definition, the amplitude of the epoch must be 2/3 higher than the previous two epochs [4]. The second category of features examined the Power Spectral Density (PSD) of the five characteristic EEG frequency bands, specifically, Delta (PSD D ), Theta (PSD T ), Alpha (PSD A ), Sigma (PSD S ), and Beta (PSD B ). These features were employed as they were previously identified as relevant for A phase analysis as the A phases are composed of characteristic frequency patterns on these bands [28]. The PSD was calculated using the Welch's method with the Hanning window, H, and an overlap, ϕ, of 50% for a given frequency, ζ, by [35].
where P is the number of points in the evaluated segment, M is the examined segment's length, floor is the floor function, and x is the input signal. The last category of features combined the concepts from the two previous categories by calculating the ratio of the maximum (max) amplitude's value of the epoch to the assessed PSD of the epoch for each evaluated EEG frequency band by denoting as APSD D for the delta band, APSD T for the theta band, APSD A for the alpha band, APSD S for the sigma band, and APSD B for the beta band. These ratio based features were considered since they combined the information of both time and frequency, which are relevant for the A phase assessment as the activation phases are composed of phasic and transient activities [4]. The relevance of the features for the A phase classification was assessed by the Minimal-Redundancy-Maximal-Relevance (mRMR) algorithm, which is a classifier-independent method [36]. This algorithm assessed the maximal statistical dependency criterion considering the mutual information θ, which for two discrete variables, I and J, is defined as The maximum dependency on the target class, ρ, was assessed individually by evaluating the dependence of the selected features ψ τ (for τ = 1, 2, . . . , L) through [36].
The evaluation of the minimum (min) redundancy lessens the issue of large dependency among the selected features and was performed by The algorithm ranked the features by simultaneously estimating D and R through the operation.
The features were ordered by the mRMR ranking from most to less relevant. The optimal number of features was identified by testing the 20 possible feature sets, where the first was composed of only the feature identified as the most relevant, the second by the two features identified as the most relevant, and so on, up to the last set, which was composed of all features. The features that composed the set that attained the highest performance for the considered reference performance metric were selected for the performance examination.

Classification
Three machine learning classifiers were tested to perform the A phase and NREM detection. The FFNN is a conventional shallow neural network composed of one input layer, one hidden layer, and one output layer. Each neuron of the network applies an activation function, Γ, that considers the bias, B, the number of connections, C, and their weight, W, through [37].
The hyperbolic tangent function was selected to be the activation function, defined as [37].
The soft-max function was used as the activation function of the output layer to provide a probabilistic score according to the probability distribution χ for the input g over the Q possible results through [37].
For the CNN classifier, the model with one dimension was selected since it can identify relevant patterns from challenging one-dimensional biomedical signals, using a small number of neurons and hidden layers [16,[38][39][40]. The small networks are easier to train and implement, requiring less computational resources to develop the algorithm [14].
The 1D-CNN was composed of a sequence of three main groups of layers. First was the input layer, followed by groups of convolution and pooling layers, and classification layers formed the last group. The transformation of the inputs was performed by convolution operations, , on the convolution layers by [37].
where n is the number of kernels (K), and X are the inputs. These layers allowed the recognition of the most relevant patterns present on the physiologically driven signal for the desired classification. The Rectified Linear Unit (ReLU) was employed as the activation function that supports these layers' complex pattern learning since it can provide a good classification performance while diminishing the vanishing gradient problem [37]. The ReLU is defined as [37].
The data dimensionality was reduced by employing a subsampling layer after the convolution layer. For this purpose, a max-pooling operation was used, mapping a subregion to its maximum value. This layer regulates the networks' complexity and reduces overfitting, which improves the generalization capability [37].
Fully connected (dense) layers were used at the end of the network to improve the learning ability of the nonlinear parameter and perform the classification [37]. Specifically, two dense layers were employed. The first was located between the last subsampling layer and the output layer to map the data (using the ReLU as the activation function). The output layer applied the soft-max function (providing a probabilistic score for each class).
All memory cells of the LSTM are controlled by three gates at each time step z. For the input signal x z , the input (i) and output (o) gates control the flow of activations through [41].
where the sigmoid function was used as activation function, defined as [37].
Ω are the recurrence weights, and h is the hidden state given by using the hyperbolic tangent function as the activation function, and s is the cell state, defined as where activation function was the same as Equation (16) and f is the forget gate given by Both s and f used the sigmoid function as the activation function. A dense layer was employed as the output layer, applying the soft-max function. The output class of all classifiers was given by the highest score through a max operation.

Post-Processing Procedure and CAP Assessment
A correction procedure was applied in the post-processing to correct misclassifications and was composed of two stages. Considering the shortest possible A phase lasts two seconds, and that binary classification provided an output for every second, thus, an output class bounded by two opposite classes (isolated classification) was treated as an error. Hence, in the first stage, a succession of 101 was corrected to 111 and 010 to 000. The NREM classification was then used in the second stage. Taking into consideration that CAP is only defined in the NREM sleep, consequently, if the A phase classifier referenced an epoch as "A" when the NREM classifier indicated as "not-NREM", then the "A" was reclassified as "not-A".
The correction procedure's outputs fed an FSM, which implements the CAP scoring rules [4] to assess the CAP cycles. CAP rate was the estimated sleep quality metric and it was calculated by dividing the total number of epochs classified as CAP (output of the FSM) by the total number of epochs classified as NREM.

Performance Assessment and Optimization of the Classifiers
The performance of the developed algorithms was measured by considering the Accuracy (Acc), Sensitivity (Sen), and Specificity (Spe), defined as [42].
where t p , t n , f p , and f n are the true positives (for the A phase assessment, it reflects the number of epochs related to an activation phase correctly identified, while for the NREM classification, it indicates the number of epochs related to the NREM periods correctly recognized), true negatives (for the A phase classification it indicates the number of epochs related to the "not-A" class correctly recognized, while for the NREM assessment it indicates the number of epochs related to the "not-NREM" class correctly identified), false positives, and false negatives, respectively. The diagnostic aptitude of the classifiers was assessed by the Area Under the receiver operating characteristic Curve (AUC) [43]. The Significance of the results was determined according to the Wilcoxon rank sum test (left-tailed), displaying the p-value when comparing the results against the FFNN (standard model used as a benchmark), evaluating how significant performance improvements are. The statistical analysis was performed considering a significance level of 0.05. The FSM performed the CAP cycles classification, hence, no probabilistic output was created, and the AUC was not computed. However, the CAP rate error and the CAP rate percentage error were assessed as predictive metrics of the overall capability of the model to estimate the CAP rate, and these metrics were calculated by CAP rate percentage error = abs(CAP rate error) where CAP P is the CAP rate predicted by the developed method, CAP a is CAP rate assessed by the database labels, and abs is the absolute value function.
The classifiers' hyper-parameters optimization was empirically performed by a search methodology, selecting the configuration which attained the highest AUC (considered reference performance metric). Random Sub-sampling Validation (RSV) was employed for the optimization procedure, randomly choosing ten subjects to compose the training set and nine for the validation set, ensuring subject independence of the sets. Each validation procedure was repeated ten times to achieve statistically significant results. Error optimization for all classifiers was performed by the Adam algorithm [44] (learning rate of 0.001 and batch size of 1024) to allow a fair comparison of the results. An early stopping procedure was used to reduce the simulation time and avoid overfitting the classifier. The training procedure was stopped (before the end of the maximum number of training cycles, defined as 50) if no relevant improvement in the AUC (improvement lower than 1%) of the validation set was reached within five consecutive epochs.
A complete grid search optimization approach for all hyper-parameters of the classifiers is not computationally feasible. Therefore, only the most relevant parameters were tuned for each classifier. For the FFNN optimization, the number of neurons employed for the hidden layer was varied from 100 to 400, in steps of 100. On the other hand, the Heuristic Oriented Search Algorithm (HOSA) employed in this work follows the concepts presented by Mendonça et al. [40] and Mostafa et al. [45], for the LSTM or 1D-CNN optimization to assess the most relevant architecture for the classifiers by considering a heuristic search for the parameters considered to be the most relevant for the examined models.
Yamashita et al. [46] identified the most important hyper-parameters to tune a 1D-CNN, where the dominant parameters are the kernel size, number of kernels, and the number of layers [47]. Hence, the performed search concentrated on these hyper-parameters. Considering that the kernel size will define the extent of the features that will be identified and that each sample point of the segmented windows has relevant information, thus, a kernel size of two with a unitary stride was chosen. The optimal number of kernels was identified by starting with a value of eight, which was successively increased by a factor of two without changing the remaining parameters [48].
The overlapping duration, O, of the segmented windows was iteratively changed (testing the three scenarios of overlapping, A p , where the database label corresponds to either the first, central, or last second of data from the overlapping window W) for each tested combination of the relevant hyper-parameters. The algorithm started without overlapping, and the duration of overlapping was increased in steps of four seconds up to a maximum window, O max , of 35 s (the upper limit was empirically found to be above the saturation point for the best A phase AUC). The searching procedure was improved by using the group of layers concept (GofLayer) [40], where each group was composed of one convolution layer, followed by one subsampling layer, and a 10% dropout was applied at the output of the group. A downsample of factor two was applied in the subsampling layers with the chosen stride and filter size of two. These values are frequently used for 1D-CNN as they can reduce the dimensionality of the data while maintaining the highest excitations from the convolutional feature maps [47]. The employed algorithm for optimization is presented in Table 2 and starts with a network composed of: one input layer (I pt ) where the input data (named Data) was fed; one group of layers; two dense output layers (D e ). The number of GofLayer, G, was iteratively incremented until the maximum value G max (chosen to be four). The number of kernels K of the convolution layer, for the first GofLayer, was 16, and the maximum limit was 128, using a step 2 M where M start ≤ M ≤ M max (M start and M max were four and seven, respectively). The subsequent GofLayer were introduced, with either the same or twice (increment of the multiplier, MUL max , of two) the number of kernels of the previous group of layers (leading to linear growth in the number of simulations). The value for the number of neurons of the first dense layer (D e ), N, started at 50 (N start ), and was incremented in steps of 50 (N step ) until the maximum value of 150 (N max ) was reached. This recurrent process occurred until no relevant improvement in the AUC (considering the minimum threshold, t r , increase of 1%) was attained, signifying that the best network, Net, was found. For the LSTM-based classifier, the input layer, I p , was followed by either a Bidirectional LSTM (BLSTM) or an LSTM layer. The subsequent evaluated layers were chosen to be equal to the first recurrent layer, and the last recurrent layer of the tested architecture could be followed by a dense layer. The number of recurrent layers, Gr, was increased one by one until reaching five, the chosen maximum number (Gr max ), or was stopped earlier if no significant improvement in the AUC was attained (examining a minimum threshold increase of 1%) when comparing with the model with Gr-1 layers. The number of time steps, T, employed by the recurrent layers was varied from five (T start ) to 35 (T max ) in steps of ten (T step ).
The number of hidden units, Nh, used for the recurrent layers (the same Nh was used for all recurrent layers for the models with a cascade of LSTM layers) was varied from 100 (Nhstart) to 400 (Nh max ) in steps of 100 (Nh step ). The D e weights were initialized with a normal distribution and the number of hidden units was chosen to be either half (applying the floor function to round the result), the same, or twice the number of hidden units that were employed by the previous recurrent layer. It was empirically observed that the use of more than 35 time steps or more than 400 hidden units did not lead to a significant increase in performance. A 10% dropout was employed between the last recurrent layer and the first dense layer to reduce the possibility of overfitting. Table 2 presents the algorithm's pseudo-code.
The NREM classifier was only tuned for the best overlapping scenario (for the 1D-CNN) or best number of time steps (for the LSTM) identified for the A phase classifier since the NREM classifier was fed with the same input as the A phase classification, and is only intended to be used in the correction procedure and for the sleep quality metric estimation. No balancing operation (oversampling the minority class or undersampling the majority class so that all classes have the same number of samples) was implemented in any training or testing dataset since it can alter the expected distribution of the data. Conversely, it was observed that the classifier's performance could be significantly improved by using costsensitive learning (applying a greater cost when misclassifying an element of the minority class compared to an event from the majority class). This observation is particularly relevant for the A phase classification since it is sturdily unbalanced (significantly more "not-A" than "A" events). Hence, this approach was used to develop the classifiers [49].
The performance of the algorithms (whose classifier's hyper-parameters were previously selected) was evaluated by the Leave One Out (LOO) method as it can provide less biased results for classifiers with few samples [50]. A total of 19 evaluation cycles were performed, each repeated 50 times to attain statistically significant results, considering the average of the performance metrics of the repetitions as the result of the evaluation cycle. For each cycle, the testing dataset was composed of data from one subject (each subject was only once selected to create the testing dataset). The data from the remaining subjects were used to compose the training dataset, hence ensuring subject independent results.

Experimental Evaluation
Three main examination steps were performed for the experimental procedure. The first and second comprised the development of the AFC and feature-based classifiers, respectively, while the third evaluated the performance of the models for the A phase, NREM, CAP cycle, and sleep quality metric estimations.

Development of the AFC Classifiers
For the 1D-CNN, the number of examined combinations was 2136. Each network was simulated ten times. Thus, the total number of examined classifiers (using RSV) was 21,360. The simulation time required for optimization was significantly reduced by using HOSA when compared with an extensive grid search, which would have required testing all possible combinations of parameters for each classifier while attaining a classifier with good performance. An extensive grid search analysis (exhaustive search) would have resulted in an unreasonable number of simulations as the total number of possible combi-nations can easily lead to millions of network structures being tested, which would not be computationally viable, and most likely would not considerably improve the performance compared to the attained classifier (using the proposed methodology). The developed algorithm can optimize a network with the size of the 1D-CNN employed in this work, in one to two weeks, depending on the complexity and the number of parameters to be tested. These results are considerably fast, even when compared to other heuristic-based optimization algorithms, such as genetic algorithms, that can require multiple months to finish the simulations (for a network of similar size to the 1D-CNN examined in this work) [14].
The optimal 1D-CNN structure for the A phase classification (identified by the HOSA) was composed of 64 kernels in the first convolution layer, 128 kernels in the second convolution layer, and 100 weights in the first dense layer. For the NREM classification, the classifier was composed of 32 kernels in the first convolution layer and 64 kernels in the second convolution layer, using the same number of weights in the first dense layer. Therefore, it was concluded that the best performance was attained using two groups of layers (GofLayer). A similar result was previously reported by Mostafa et al. [45], where it was observed that the use of two clustered layers (composed of one convolution layer, followed by batch normalization and a pooling layer) led to the highest improvement in the considered performance metric.
It was observed that the second scenario for overlapping (epoch's label refers to the central 100 sample points) attained the best AUC, which was considerably better than the other two scenarios, with the optimal window length of 19 s. This result is likely linked to the average A phase duration, found to be around 13 s [4]; hence, extending the window length too much can possibly introduce excessive information from the background activity, leading to misclassifications.
These results suggest that there is a strong temporal dependency for the A phases as introducing more information to the classifier significantly improved the classification capability. The low performance of the first scenario was associated with misclassifications of the onset boundary. Such a scenario occurred when the current epoch (data points related to the label) was "not-A" and the following epochs were "A" (and the sampling points of these epochs were present on the segmented window), leading the classifier to classify the current "not-A" as "A". This effect was lessened in the third scenario even though the converse effect occurred, related to the A phase offset boundary detection when the current epoch under classification was "A", and the sampling points associated with a subsequent "not-A" epochs were present in the segmented window. Consequently, it led the classifier to wrongly classify the current epoch as "not-A".
Both onset and offset misclassification issues occurred in the second scenario. However, these were diminished as the classifier has contextual information from the previous and next epochs. It was also observed, for all scenarios, that the proper detection of the offset boundary was challenging, occurring several misclassifications towards the end of the longer A phases where the classifier oscillated between "A" and "not-A". This effect was previously reported by Terzano et al. [4], indicating that the A phases can display ambiguous limits due to inconsistent voltage changes in the EEG signal. Nonetheless, postprocessing lessened this problem (if two consecutive A phases are separated by an interval shorter than two seconds, then they should be combined in a single A phase). However, these oscillations were still the most notable reason for the misclassifications. It was also observed that increasing the window length beyond 31 s (having 30 s of overlapping) was counterproductive as further information led to misclassifications.
For the LSTM-based classifier, it was noticed that the best structure found by the HOSA was composed of an LSTM layer with 100 hidden units using 25 time steps, followed by a dense layer with 50 hidden units. The cascade LSTM architecture led to a lower AUC, and the use of BLSTM instead of LSTM in the recurrent layer had an AUC increase of less than 1%. Therefore, the LSTM was preferred rather than BLSTM since it attained a better complexity to performance ratio. A total of 256 network architectures were examined, and each test was repeated ten times using RSV. Therefore, the total number of evaluated classifiers was 2560. It was observed that the proper offset detection was again the primary source of misclassifications, although the increase in the number of time steps allowed the model to lessen this problem. However, the use of more than 25 time steps led to a lower AUC, possibly suggesting that the model could not extract more relevant information from the input data and started to overfit. The best network's architecture for the NREM classification using the 25 time steps was composed of one LSTM layer followed by one dense layer, with 300 and 150 hidden units, respectively. It was observed that the best performance was reached when using only one recurrent layer and these results agree with the findings reported by Yadav et al. [51], which have observed that a model with one LSTM layer outperformed models with cascade recurrent layers.
The learning curves of the classifiers are presented in Figure 2. It was observed that both classifiers could possibly improve the performance if more data were available in the database, and LSTM would perhaps benefit more from the additional data (the slope of the LSTM linear tendency line is higher than the 1D-CNN linear tendency line). On the other hand, the performance of both classifiers is similar when 100% of the data was used for the model's development, thus, the performed comparative analysis regarding which classifier is more suitable for the intended classification is fairer.

Development of the Feature-Based Classifiers
The relevance of the features for the A phase classification was assessed by the mRMR algorithm (each simulation of the presented results was repeated 50 times to attain statically significant results), and the ordered sequence (from most to less relevant) was: PSD D ; A; PSD S ; PSD T ; APSD B ; PSD A ; APSD D ; APSD S ; APSD A ; PSD B ; APSD T ; A 3 ; A 7 ; A 2 ; A 9 ; A 8 ; A 4 ; A 1 ; A 6 ; A 5 . The PSD D and A features were expected to be the most relevant since 61% of the database's A phases belong to the A1 subtype that is characterized by highvoltage slow waves, where delta waves are the most prevalent. On the other hand, the A2 subtypes compose 21% of the database labels and have a mixture of high-voltage slow waves with low-amplitude fast rhythms, whereas the A3 subtypes have a predominance of low-amplitude fast rhythms [4]. Therefore, it was anticipated that the frequency-based features would be more relevant for the A phase assessment. However, the amplitude-based features are still important to detect the high-voltage waves.
For the FFNN optimization, each tested value for the number of hidden units was examined for all 20 feature sets ordered by the mRMR algorithm. It was observed that the best performance was attained using the 14 most relevant features with 400 hidden units for both A phase and NREM classifications (using RSV for the performance assessment, repeating each simulation ten times). The structure of the LSTM-based classifiers previously identified as the best for the A phase or NREM classification was employed for the feature-based classification to allow a fairer comparison of the results, and the best performance was attained using the 12 most relevant features.
The learning curves are depicted in Figure 3. Similar to the AFC models, the inclusion of more data could possibly improve the performance of the classifiers. However, the variation in performance of the LSTM is likely to be significantly lower for the featurebased methods. It was also observed that the LSTM has a significantly higher AUC than the FFNN, suggesting that the performance of the LSTM-based classifier is expected to be superior.

Performance Evaluation
The performance of the tuned classifiers was assessed using LOO, repeating each simulation 50 times to provide reliable estimates of the models' performance, being the results presented in Table 3 (the Appendix A tables present the results for all subjects, where subjects 1-15 are free of neurological disorders, while subjects 16-19 were diagnosed with sleep-disordered breathing). Regarding the A phase estimation, by examining the table's results, it is possible to conclude that the FFNN-based classifier attained the lowest Acc, Spe, and AUC while the feature-based LSTM reached the best performance for all performance metrics, having significant improvements when comparing against the FFNN in eight of the eleven studied metrics. On the other hand, the AFC LSTM attained the most unbalanced results (largest difference between Sen and Spe), suggesting that the AFC classifier could not find patterns in the data that are as relevant as the ones present in the used features. The AFC classifier based on the 1D-CNN surpassed the AFC classifier based on the LSTM for the A phase assessment. However, the opposite occurred in the NREM classification, where the AFC classifier based on the LSTM performed better. The FFNN was the worst classifier for the NREM assessment, while the feature-based LSTM was the best. For the CAP assessment, it was observed that the model which used the AFC classifier based on LSTM attained a better Acc and Sen than the classification based on the 1D-CNN, which reached the highest Spe of all models. The FFNN-based model had the lowest Acc and Sen.
It was observed that the lowest CAP rate percentage error was attained by the model based on the AFC LSTM, while the FFNN-based model had the worst performance. On the other hand, the 1D-CNN and the model with the LSTM fed with features reached a similar average value, although the 1D-CNN results have a larger variation. Figures 4 and 5 depict the normalized CAP rate error and the CAP rate percentage error boxplots, respectively.  It is possible to forecast the sleep quality by knowing the subject's predicted CAP rate, considering that a higher CAP rate most likely designates a poor sleep quality. In contrast, the reverse probably means good sleep quality [8]. If the subject's age is known, then sleep quality guess can conceivably be performed by comparing the predicted CAP rate against what is the average CAP rate for the subject's age, considering that a higher value denotes poor sleep quality and a lower value designates good sleep quality. By following this simplistic approximation, the accuracy of the sleep quality prediction (by comparing with the estimate based on the CAP rate from the dataset) for the 1D-CNN, AFC LSTM, FNN, and feature-based LSTM was 74%, 79%, 68%, and 90%, respectively.

Discussion
By evaluating the attained results, it is possible to conclude that the use of features leads to the best performance. However, if more data were available, probably, the AFC classifiers would significantly improve the results (as it is visible in Figure 2). A similar conclusion can be attained for the CAP cycle's assessment. The achieved results are emphasized by the difficulties associated with CAP analysis, as reported by Mendez et al. [12], which predicted that the CAP phase assessment could be affected by up to 25% of subjectivity and ambiguity. Another relevant factor is the specialist agreement for CAP analysis, examining the same EEG signals, which ranges from 69% to 78% (getting closer to the lower bound as the number of specialists involved in the analysis increases) [10,11]. Hence the performance of the proposed algorithms is either in the agreement range or slightly superior to the upper bound, advocating the viability of the algorithms for clinical applications.
The Acc of the CAP cycles classification was lower than the A phase classification, and it was verified that this was due to three factors. The first was the misclassification around the A phase's offset boundary (oscillation between "A" and "not-A" at the end of the longer A phases), which led the FSM to either overestimate or underestimate the CAP cycles. The second factor was the occurrence of several A phase misclassifications during long "not-A" periods. These mainly occurred during periods of significant variation in the EEG signal, usually lasting more than three seconds and are separated by less than 60 s, leading the FSM to classify these events as a CAP cycle. This second problem was sturdier in the AFC classifiers, possibly suggesting the low Sen. The last factor was the high impact that the NREM classification had in the CAP assessment, where the lower performance created several f p , and f n , which led the FSM to either overestimate or underestimate the CAP cycle duration and also affected the CAP rate estimation. It was also observed that the subjects suffering from sleep-disordered breathing were the most challenging to be assessed, conceivably due to the low number of subjects present in the database (when compared with the number of subjects free of neurological disorders) and due to the dynamics of the EEG signal, which are likely to be different for these subjects (with possible variation in the prevalence of each A phase subtype).
A summary of the results reported by the state-of-the-art, which had performed binary A phase classification is presented in Table 4. Most of the works, which attained a similar accuracy to the proposed work, examined a significantly smaller population for the development of the models. Specifically, Largo et al. [22] tested 12 subjects, considering one hour of data for each subject, Niknazar et al. [23] examined six subjects, Mariani et al. [20,25] evaluated four subjects, and Mariani et al. [26] studied eight subjects. Hartmann and Baumert [19] examined 15 subjects and reached a similar performance as the best model examined in this work. Mariani et al. [27] attained a higher Acc while using a similar population, but with a significantly lower Sen. However, both subjects free of neurological disorders and subjects suffering from sleep-disordered breathing were considered in this work, while the other state-of-the-art results with similar performance have only considered subjects free of neurological disorders. It is also relevant to notice that CAP analysis is characterized by a strong unbalance between the number of "A" and "not-A" events (approximately 90% of the database annotations refer to "not-A" events) [18]. Hence a variation in the Spe has a greater impact in the Acc than a variation in the Sen. As a result, a model with a high Spe and low Sen will have a high Acc. This effect can be understood by examining the average metric, which was around 81% for all the best performance works, suggesting that the focus should be on attaining balance results to improve the clinical applicability. Furthermore, even though the traditional methods based on thresholds can achieve a considerable performance with low complexity algorithms, the studies that have examined these methods usually consider a low number of subjects and frequently evaluate only a part of the full-nigh EEG signal. These methods will likely be problematic to generalize to a broader population since the thresholds need to be tuned for the examined population.
A comparative analysis was not implemented for the developed NREM classifiers since no other work was found performing a second by second NREM assessment (the standard defined by the AASM is to use an epoch of 30 s). Nonetheless, the accuracy reported by state-of-the-art works for NREM classification, considering a 30 s epoch, ranges from 72% to 98%, depending on the number of classes considered [52]. Hence the developed work is within the range while using a challenging approach of classifying every second.
A total of three works were found in the state-of-the-art performing the CAP cycles assessment. Mostafa et al. [29] applied an FFNN for the classification, reporting an Acc of 62%, while Mendonça et al. [16,28] employed an FSM and reported an Acc of 79%, using a feature-based method for the A phase classification, and 76%, when an LSTM classified the A phase. By comparing the results attained in this work, it was concluded that Mendonça et al. [28] reached the same Acc as the developed feature-based LSTM method, while the other works reported a lower performance. However, it is important to bear in mind the higher number of subjects examined in this work. When comparing the AFC-based classifiers, the developed method based on LSTM reached a higher performance for the CAP cycle assessment than Mendonça et al. [16].
By examining the normalized CAP rate error presented in Figure 4 it is possible to conclude that subject 17 (subject with sleep-disordered breathing) has the larger normalized error for the models based on 1D-CNN, AFC LSTM, and FFNN, possibly due to the low CAP accuracy of the models for this subject. For the model based on LSTM fed with features, subject 8 was the most challenging, leading to the higher CAP rate error, possibly due to the low A phase accuracy, which led the FSM to overestimate the CAP cycles duration. By examining the CAP rate percentage error, it was observed that the model based on the AFC LSTM has the best results (lowest average value), followed by the model based on LSTM fed with features. The FFNN-based model has the highest average value. By inspecting the boxplots of the CAP rate percentage error presented in Figure 5 it is notorious that the model based on the AFC LSTM has the lowest variation in the results, suggesting that this model is the most suitable for the CAP rate examination. These results are likely to be related to the performance for the CAP assessment since the model based on the AFC LSTM has the most balanced results, with an accuracy that is similar to the best results attained by the model based on LSTM fed with features.
Only the work reported by Mariani et al. [27] was found in the state-of-the-art performing the CAP rate appraisal. The reported CAP rate percentage error was 17%. The same value was attained by using the model based on the AFC LSTM. Nevertheless, Mariani et al. [27] evaluated only subjects free of neurological disorders, while in this work, subjects diagnosed with sleep-disordered breathing were also examined. Thus, there is a larger variation in the dataset's CAP rate (most sleep-disordered breathing subjects have a higher CAP rate).

Conclusions
Two approaches for automatic CAP analysis were developed, estimating the occurrence of the A phases, the CAP cycles, and the CAP rate. The first was based on AFC, where the classifiers automatically identify the relevant patterns from the input data, while the second comprised the use of features created by a feature creation procedure that extract relevant information from the input data to feed the classifiers. It was observed that the feature-based LSTM attained the best performance, although the results for the A phase assessment reached by the 1D-CNN were similar. The performance for the CAP cycle assessment achieved by the feature-based LSTM and the AFC LSTM was similar. These results suggest that the low Sen of the AFC LSTM for the A phase estimation (related to the overestimation and underestimation of the A phase duration) has not affected the CAP cycle assessment. It is also likely that the inclusion of more data could improve the AFC models' performance, possibly surpassing the feature-based LSTM results.
The proposed methods perform the analysis by evaluating the signal from only one EEG monopolar derivation without requiring any manual manipulation of the signal or the removal of artifacts. It was observed that the A phase classification performance was similar to the best state-of-the-art algorithms. A second by second based NREM classification was also proposed, which was used in the correction procedure and for the sleep quality metric estimation. The CAP rate error was found to be low, supporting the diagnostic capability of the algorithms for sleep quality estimation. It is important to highlight that the attained results are considerably good when considering the challenges of the bioengineering fields, as the results have even surpassed the specialist agreement when analyzing the same EEG signals, advocating the relevance of the work.
The next steps in this research are to further validate the developed algorithm in a larger dataset and examine the A phase subtypes to reach a deeper understanding of the CAP events, which can lead to a reduction in misclassifications.

Conflicts of Interest:
The authors declare no conflict of interest.