Multiple Time Series Fusion Based on LSTM: An Application to CAP A Phase Classification Using EEG

The Cyclic Alternating Pattern (CAP) is a periodic activity detected in the electroencephalogram (EEG) signals. This pattern was identified as a marker of unstable sleep with several possible clinical applications; however, there is a need to develop automatic methodologies to facilitate real-world applications based on CAP assessment. Therefore, a deep learning-based EEG channels’ feature level fusion was proposed in this work and employed for the CAP A phase classification. Two optimization algorithms optimized the channel selection, fusion, and classification procedures. The developed methodologies were evaluated by fusing the information from multiple EEG channels for patients with nocturnal frontal lobe epilepsy and patients without neurological disorders. Results showed that both optimization algorithms selected a comparable structure with similar feature level fusion, consisting of three electroencephalogram channels (Fp2–F4, C4–A1, F4–C4), which is in line with the CAP protocol to ensure multiple channels’ arousals for CAP detection. Moreover, the two optimized models reached an area under the receiver operating characteristic curve of 0.82, with average accuracy ranging from 77% to 79%, a result in the upper range of the specialist agreement and best state-of-the-art works, despite a challenging dataset. The proposed methodology also has the advantage of providing a fully automatic analysis without requiring any manual procedure. Ultimately, the models were revealed to be noise-resistant and resilient to multiple channel loss, being thus suitable for real-world application.


Introduction
Information fusion technologies enable the combination of information from multiple sources in order to unify and process data.These technologies can thus transform the information from different sources into a representation that provides effective support for automatic analysis [1].In essence, there are two fundamental methods to process the information from multiple sources.The first, known as centralized fusion, employs a fusion center to receive and process all the information from the different sources, while in the second, known as distributed fusion, each source provides a local estimation from its measured data to the fusion node which then performs the fusion.The first method can attain optimal performance.However, the second has a higher robustness, a relevant characteristic especially when biomedical sensors, such as electroencephalogram (EEG), are used since these can be easily contaminated with noise or can lose contact [2].
Information fusion was applied with success in numerous fields [3], among these, body sensors' analysis attained significant developments with revolutionary applications in health-care and fitness examination [4].The fusion of information from multiple sources allows the reduction of noise effects, improves the robustness against interference, and reduces ambiguity and uncertainty, seeing that the use of an individual source of information is often not sufficient to provide a reliable examination.
The hierarchy of information fusion can be divided in three main levels.First is the data level fusion techniques, such as Kalman filter and averaging methods, operating at the lowest level of abstraction to combine raw data from multiple sources [5].The second performs the fusion at the feature level, where feature sets extracted from different data sources are combined to create a new feature vector.The last one is carried out at the decision level and deals with the selection (or creation) of an hypothesis from the set of hypotheses, and is usually performed by fuzzy logic, Bayesian inference, classical inference, or heuristic-based schemes (such as majority voting) [4].The data level and feature level fusion are generally done before classification or any hypothesis selection or creation about the data.Afterwards, the decision level fusion is done.
Cyclic Alternating Pattern (CAP) is characterized by sequences of transient electrocortical events in the brain, divergent from the background activity.The CAP concept can be used to examine the sleep microstructure during the non-rapid eye movement sleep and is composed of an initial phase of brain activation, named A phase, followed by a period of return to the background activity, denoted B phase.Both phases must have a duration between two and 60 seconds to be considered valid.Two or more successive CAP cycles define a CAP sequence [6] [7] [8].
The activity in the brain is identified using different signals such as the ones coming from the EEG channels [5].In this view, A phase classification is a suitable problem for fusion based approaches.Therefore, it was hypothesized in this work that the fusion of multiple EEG channels could provide more relevant information for the automatic A phase classification when compared to single-channel models.In other words, the main goal of this work is to develop an automatic classifier for the A phase assessment based on the signals from multiple EEG channels.
CAP has shown to be related to formation, consolidation, and disruption of the sleep macrostructure, working as a measure of the brain's effort to maintain sleep [8] [9] [10].
It was also acknowledged as an EEG marker of sleep instability and a temporal relationship between the CAP, behavioral activities, and autonomic functions was observed [10].Hence, the CAP was found to be linked with the incidence of several sleep disorders including insomnia [11], Nocturnal Frontal Lobe Epilepsy (NFLE) [12], sleep apnea [13], periodic limb movements [14], and idiopathic generalized epilepsy [15].Therefore, the employment of CAP analysis by the sleep centers can lead to significant advances in the diagnosis and characterization of sleep quality.However, the introduction of CAP analysis as a regular clinical practice faces some obstacles, namely (a) the time required for manually scoring a whole night polysomnography (the gold standard for sleep analysis [16]), due to the large amount of information produced during whole night EEG recording, (b) combining different information from different sensors or channels, (c) the need for qualified personnel to perform the manual scoring, and (d) the fair inter-scorer specialist agreement, that varies from 69% to 78% [17].
Therefore, manual scoring is considerably problematic, as the process is unpractical and prone to misclassifications.For these reasons, the development of algorithms for automatic CAP analysis with information fusion is desirable, supporting, thus, the necessity for this study.
It was also observed that CAP is a global EEG phenomenon that comprises extensive cortical areas, suggesting that the A phases could be visible on all EEG channels [6].However, the state of the art works which proposed methodologies for automatic A phase analysis perform the examination using only one EEG channel (usually with one monopolar derivation).Although this approach can lead to less complex models, it is also reductive and restrictive since a large amount of information coming from the other channels is discarded, disregarding at the same time the fact that the A phase activity can occur over multiple cortical areas.
In addition, most of the methods proposed in the state of the art for A phase detection employ classification with features created by the researchers.Nevertheless, significant domain-specific knowledge is required for the feature creation process and it is becoming increasingly challenging to discern a new set of features that can reach a better performance than the methods already reported in the state of the art.Also there is the need for feature sorting which does not guarantee a performance improvement [18] [19].These complications can be surpassed by a deep learning model, which can automatically learn the relevant patterns from the input signal.However, a significant gap in the state of the art regarding deep learning applications for CAP analysis was identified.
CAP phases have a strong temporal dependency that can be captured by recurrent neural networks, e.g.Long Short-Term Memory (LSTM) [20], and the activity can be measured in different EEG channels.Therefore, a novel approach was followed in this work where the information from multiple EEG channels was fused by a proposed deep learning channel fusion methodology, composed of LSTM, concatenation, and fully connected (dense) layers.
On the other hand, the structure and/or hyperparameters of a deep learning classifier are usually selected through an experimental search (usually a grid search), which performs an exhaustive evaluation of multiple combinations of parameters.However, this approach requires a significant amount of time and computational resources, which can be impracticable for deep learning models [21].Two heuristic based algorithms, namely, Genetic Algorithms (GA) and Particle Swarm Optimization (PSO), were used in this work as an alternative to the grid search approach to find the optimal structure, number of channels, and hyperparameters of the models [18].Therefore, two models were developed to perform the channel fusion of EEG channels for the CAP A phase assessment, one was tuned by a GA and the other by the PSO.It is also intended to study the optimization algorithms characteristics to determine which can lead to the best performance.
The key novelties of this work can be summarized as follows: -Proposal of a novel method for information fusion based on a deep learning model which is responsible for extracting the features, performing the feature level fusion, and performing the classification.The structure of the classifier was tuned by the optimization algorithm hence, all the fusion and classification procedures were optimized and performed automatically by the deep learning model which learned the relevant patterns directly from the data.
-Independent evaluation of two optimization algorithms for finding the optimal structure of a deep learning classifier.The optimization of deep learning models is a well-known difficulty in the machine learning field since the simulations are usually slow, hence, there is a need to study suitable algorithms to haste this process.
-Combined examination of subjects free from neurological disorders and subjects with a sleep related disorder using information (the signal) from multiple EEG channels to assess the CAP A phases.The state of the art standard is to only examine one channel for the analysis, which is contrary to the specification of the CAP protocol where it specifies that the analysis should preferably be carried out over multiple channels.
-Development of systems tolerant to noise (until a signal to noise ratio of 0 dB) and able to handle the loss of 66% of the information, i.e. loss of two channels.
It is important to highlight here that the CAP A phase assessment was used as an example of the application of the proposed fusion of multiple time series.This means that the suggested approach was developed to be generic, and thus be applied to other research and industry applications.
The paper has the following organization: an overview of the state of the art for CAP A phase analysis is presented in Section 2; the employed materials and methods are presented in Section 3; the model's performance is evaluated in Section 4; a discussion of the obtained results is carried out in Section 5; the paper is concluded in Section 6.

Overview of the state of the art for CAP A phase analysis
A literature review was conducted based on the PRISMA style, covering papers published until June 2021.The search was conducted using the IEEE Xplore, PubMed, Web of Science, and cited literature in the included articles.The keywords used in the search were "cyclic alternating pattern" and "CAP AND A phase AND EEG".The inclusion criterion was the presentation of a method for the A phase detection while the exclusion criterion was the absence of a performance metric that can advocate the capability of the model.The search was performed on articles written in English.The search yielded 2420 and 239 results for the first and second keywords string, respectively.After removing the duplicated and the non-relevant articles, the total of selected articles were 26, published from 2002 to 2021.
Two main research lines were found in the state of the art regarding the CAP analysis.The first comprises the evaluation of the A phase subtypes [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] while the second performs the A phase detection for the CAP cycle estimation, usually considering each epoch as either "A" or "not-A", leading to a binary classification problem.Although the A phase subtypes can provide significant information about the sleep process, this information is not required for the sleep stability assessment based on CAP cycle analysis since it only considers the occurrence of an activation and not the subtype.Therefore, the research line followed by this work comprises the A phase binary classification.A total of 16 articles were found performing this analysis.Largo et al. [32] [33] examined the power of five EEG frequency bands (delta, theta, alpha, sigma, and beta).The fast discrete wavelet transform was applied to the signal of one EEG channel, and evaluated two moving averages with a short and a long duration.
The relationship between the averages was named as activity index and it was used to detect the occurrence of A phases.The classification model was tuned by a GA, and the A phases assessment was performed by comparing with a threshold.Niknazar et al. [34] employed a classification method based on a similarity analysis between the windowed input signal and reference windows from a database, evaluating the signal from one EEG monopolar derivation (C4-A1 or C3-A2).Barcaro et al. [35] employed a technique to describe the sleep microstructure by computing band descriptors, one for each of the five EEG characteristic bands evaluated in the F4-C4 channel, to measure how much the amplitude of a frequency band differs from the background activity.
Afterwards, a tuned threshold was used to implement the classification.
Mariani et al. [36] also employed a threshold based classification, evaluating the differential variance of the EEG signal from one monopolar derivation (C4-A1 or C3-A2), Hjorth descriptors (activity and mobility, computed in the low delta and high delta bands), and the band descriptors.The highest accuracy was attained by the first feature.
These features were also used by Mariani et al. [37], [38], [39], and [40].A Forward Neural Network (FFNN) was used in the first work [37] examining the signal from one monopolar derivation (C4-A1 or C3-A2), while a Support Vector Machine with Gaussian kernel was employed in the second [38], evaluating the F4-C4 channel.Four classifiers were tested in the third work [39] and the Linear Discriminant Analysis (LDA) attained the highest accuracy, when evaluating the EEG signal from one monopolar derivation (C4-A1 or C3-A2).A variable window was employed in the fourth work [40] for the feature creation process using an EEG monopolar derivation (C4-A1 or C3-A2), then the features were fed to a different discriminant function for each A phase subtype.Afterwards, the outputs were combined for the A phase classification.
Mendonça et al. [41] and [42] evaluated the Shannon entropy, log-energy entropy, Teager Energy Operator (TEO), auto-covariance, standard deviation, and power spectral density of the five characteristic EEG frequency bands.LDA was employed in the first work [41] to perform the classification while multiple classifiers were examined in the second work [42].It was concluded that FFNN attained the best accuracy.In both works, the signal from one EEG monopolar derivation (C4-A1 or C3-A2) was examined.
Sharma et al. [43] applied a five level wavelet decomposition to produce six subbands and four features were extracted for each sub-band, specifically, the three Hjorth parameters (activity, mobility, and complexity) and the wavelet entropy.These features were then fed to an ensemble of bagged tree classifier.The examination was performed by extracting features from two channels, one monopolar (C4-A1) and one bipolar (F4-C4) EEG derivations.
Mostafa et al. [44] proposed a deep learning method to classify each two second window with a deeply-stacked auto encoder fed with the signal from one EEG monopolar derivation (C4-A1 or C3-A2).Hartmann and Baumert [45] examined the same EEG derivation, feeding entropy and frequency based features, differential variance, and TEO to an LSTM to the classification procedure.Mendonça et al. [20] and [46] also employed an LSTM.However, the classifier was directly fed with the EEG signal from one monopolar derivation (C4-A1 or C3-A2).
From the state of the art analysis, it was concluded that tunable thresholds were frequently used at the beginning of the automatic CAP analysis which then were later

Materials and methods
The developed model estimates the CAP A phases, in a second-by-second assessment, by examining the pre-processed signals from multiple EEG channels.
Those signals were fused by the deep learning classifier which performed the automatic feature extraction and classification.Specifically, distributed fusion was employed in this work since it is suitable when the sources of information come from similar sensors [47].Each EEG channel was fed to one LSTM which was used to extract features from each signal.Afterwards, the extracted features were concatenated by the fusion node to produce the fused feature vector (feature level fusion [4]) employed to perform the A phase classification.The classifier's output was then post-processed to diminish the misclassification, and the model's performance was assessed.Two optimization algorithms were examined to tune the model with the goal of finding the optimal number of channels, number of time

Studied population
Recordings from the CAP Sleep Database [6] [48] were selected to develop the model.This database is publicly available and has annotations provided by sleep experts regarding the A phase occurrence and duration.Relevant information for the CAP analysis is present in both EEG bipolar and monopolar derivations since the CAP is a global EEG phenomenon, comprising broad cortical areas [6].As reported by Mariani et al. [38], CAP analysis is usually performed using only the signal from one monopolar derivation (either C4-A1 or C3-A2).However, such methodology is prone to have a large number of false positives (identified A phases) as many activations correspond to changes in amplitude and/or frequency on the central lead but are regular EEG rhythms on the others.Therefore, CAP scoring should be performed by scoring multiple channels [38].
In that view, the goal is to use as many derivations as possible while keeping the model's complexity feasible to be used in the current available hardware.It was observed that the state of the art works examined either the F4-C4 channel or one monopolar derivation (C4-A1 or C3-A2).Nevertheless, Terzano et al. [6] indicated that all bipolar derivations can properly detect the A phases, hence, the Fp2-F4 was also examined in this work.Hence, the three examined deviations are Fp2-F4, F4-C4, and C4-A1.
Eight subjects Free of Neurological Disorder (FND) were chosen from the dataset since these were the ones that have the examined channels available.To provide a broader representation of the general population, eight more subjects having the same three EEG deviations with a sleep related disorder (also available in the dataset) were included.NFLE was chosen to be the studied disorder since the epileptic manifestations are likely to act as a subcontinuous "internal noise" which can induce a substantial growth of all CAP related parameters, reflecting the degree of sleep instability [12].According to our best knowledge, no state of the art work examined a combination of normal subjects and subjects with NFLE in the task of automatic classification of CAP A phases.
Therefore, the considered population was composed of eight normal subjects (reference for normal sleep quality) and eight subjects prone to have poor sleep quality.
A population of 16 subjects (eleven females and five males) was found to be either equal or higher than the works available in the state of the art performing the CAP A phase analysis.
The average total sleep time of the studied population was 463.97 minutes, ranging from 370.5 to 553.5 minutes, with a standard deviation of 54.21 minutes.The average subject's age was 32.88 years old, ranging from 16 to 67 years old, with a standard deviation of 11.43 years old.The number of one second epochs related to the occurrence of an A phase was 67118 and the total number of one second epochs was 518723.

Classification and channel fusion
The information fusion concept was employed to combine data from multiple EEG channels [4].This fusion was performed at the feature level where the multiple feature vectors were combined to form the joint feature vector from which the classification was performed.The features were automatically created for each EEG channel by feeding the pre-processed signal to the designated LSTM layer for the channel.
Each LSTM layer is composed of memory cells which sequentially process the input and preserve their hidden state through time [49].Each cell is controlled by three gates.
The input gate (I) defines the flow of activations into the cell while the output gate (O) controls the flow of activations to the remaining network.The forget gate (F) is responsible for adaptively resetting the cell's state.For the time step t and cell c, these operations are defined as [50]   () where σ is the sigmoid function given by σ (α) = 1/(1+e -α ), x (t) is the input vector, U are the input weights, W are the recurrence weights, and b are the bias.The network's output, h, is given by [50] ℎ where tanh is the hyperbolic tangent function calculated as tanh(α)=2σ(2α)-1, and s An LSTM layer can examine the data sequence in only one direction (conventional LSTM model) or in two directions, denoted as Bidirectional LSTM (BLSTM).
Although the BLSTM models use more parameters when compared to the conventional LSTM models, it is likely that these models can find more relevant patterns on the fed data.At the end, the softmax function, given by softmax(α)=e α /∑ e α   , was used by a fully connected layer to normalized the output.At the end, binary classification output was obtained by applying the max operation.

Optimization procedure
Two optimization algorithms were studied to find the best structure of the classifier for the A phase assessment, evaluating an encoding array.These stochastic algorithms were used in this work as an alternative to the conventional grid search, which is considered unfeasible especially when many parameters have to be tuned [18].The pseudocodes for GA and PSO are presented in algorithms 2 and 3, respectively, where the goal is to find the solution which maximizes the Performance Metric (PM).
The GA was selected since it is one of the most commonly used algorithms for complex design optimization problems, using Darwinian principles of biological evolution [18].On the other hand, PSO methodology is based on information sharing, such as occurs in nature in the flocks of birds and schools of fish, and this algorithm was selected since it has large flexibility and is capable of finding the globally best solution in complex (possibly multimodal) search spaces [51].
Algorithm 2. Pseudocode for the GA variant used in this work.

21:
For each bit b in q do: Mutation

22:
If rand() ≤ mprob (g) do: Elitism 25: 26: Evaluate the chromosomes in Q Algorithm 3. Pseudocode for the PSO algorithm variant used in this work.

13:
For each particle k in S (i) do:

14:
For each bit b in k do: If rand() < σ(vk,b (i) ) do: For each particle k in S (i) do:

Genetic algorithm
GA is a type of metaheuristic algorithm that has previously shown to be capable of finding an improved solution over time by replicating the best solutions from generation to generation and producing offspring from these solutions [52].
For this work, the algorithm was initialized with a random individual generation, using mutation and crossover operators over a defined number of generations to reach a solution, which optimized the model to a given metric.
Coded chromosomes were employed to characterize the population P = [p1, p2, …, pz], where z is the size of each generation, g.Each p was decoded using a decoding table (see section 3.5), and the quality of the solution (fitness assessment) was assessed by the selected PM.
The algorithm stopped if the maximum number of generations, G, or if the patience value, Pa, (number of consecutive generations that the algorithm did not produce an improved solution) reached the maximum patience, Pamax.The initial population of P was randomly generated and then sorted according to the performance of each chromosome.Afterwards, a new cycle started for the creation of the offspring population, Q, with size z.According to the crossover probability, each new member of the offspring population, q, was created either by a two-point crossover operation between two different elements randomly chosen from P, or by cloning the most fitted element selected from a tournament of two.In the two-point crossover operation, each crossover produced one offspring, and each of the elements of P can be chosen to participate in a tournament of two, implementing the no-replacement tournament selection [53].The approach chooses the most fitted element of each tournament to produce the cross-over without allowing the same chromosome to be the winner of the two tournaments, since the tournaments are repeated until two different elements of P are selected.This two-point crossover approach was adopted because it was reported that it outperforms other conventional crossover operations [54].It is important to note here that in both cases, all the chromosomes have an equal probability of being picked for a tournament, i.e.,

2𝑧 − 1 𝑧(𝑧 − 1)
, if and only if,  ≥ 3 The most fitted elements will, however, have a higher probability of being selected in each tournament, and consequently, used for crossover or cloning.
A mutation operation (that performs the logical not operation) was applied to all elements of the chromosome of each q according to the mutation probability, mprob.
Therefore, the estimated number of mutations on a given iteration g is given by where Nbits is the number of bits used to encode the problem.The implemented methodology for the GA follows the convention of starting with high exploration (using a high mprob) and then progressively changing into exploitation (decreasing mprob every five generations).It is worth noting that if both mutation and crossover rates are too high, then the GA will head toward random search, while the opposite leads to a hill climbing algorithm.Hence the gradual change from exploration to exploitation is more suitable [55].
The two best p of each generation were considered elites ensuring that they were moved to the next generation.Subsequently, the performance of each q was assessed and stored.P (without the two elites) and Q were combined and sorted according to the performance scores (attained PM by the model defined by the chromosome), from most to least fitted, and the best z -2 members were chosen to compose the new P.
Afterwards, the two elites were introduced in P which was then sorted from most to least fitted (according to the performance scores).Subsequently, a new generation started and the process was repeated until either g was equal to G or Pa was equal to Pamax.

Particle swarm optimization
PSO is a population-based stochastic optimization algorithm that uses agents (called particles), organized in a swarm (S), in order to search for the optimal solution(s) in a (possibly complex) search space.Each particle p, in its turn, is a candidate solution for the optimization problem at hand.
The algorithm was initially proposed in 1995 by R. Eberhart and J. Kennedy [56][57].These authors suggested a collective search strategy in which particles consider the best position found by the other particles (in other words, the social information) and its individual best position (also known as the cognitive information) in order to explore the search space and converge to the optimal solution(s).
In short, PSO can be described in three main steps: (i) initialize the swarm by randomly positioning the particles in the search space; until a stopping criterion is met: (ii) compute, for each particle, its new velocity (v) and position (x), and (iii) for each particle, when a better solution is found, update the cognitive and social position information.
It is important to note here that the social position information is shared using information links between particles.These information links allow particles to be fully connected, and thus share information with every particle in the swarm or create neighbors of particles where the knowledge is restricted to the particles that belong to the same neighborhood.
In order to optimize the structure and hyperparameters of the deep learning classifier used in this work, a discrete binary PSO [58] variant was used.The velocity of a particle, at every iteration i and dimension d, was thus updated as follows   (+1) =  ()   () +  1  1 () (  () −   () ) +  2  2 () (  () −   () ) (7) where ω is the inertia weight parameter [59], c1 and c2 the cognitive and social weight respectively, and r1 and r2 two uniformly distributed pseudorandom numbers.Finally, p is the personal best position found by the particle and l the best position found by the neighboring particles.After computing the velocity of the particles, the position of each particle is changed according to where rand() denotes a pseudorandom number drawn from a uniform distribution on the interval [0, 1] and σ the sigmoid function.
The particles were organized in a ring topology, where each particle only shares information with the two immediately adjacent neighborhoods.The rationale behind the choice of this topology has to do with the fact that in a ring topology, the social information flows slowly, which simultaneously slows down the convergence speed.
This behavior is particularly important in multi-modal complex optimization problems like the one presented in this paper.Having a low convergence rate improves the algorithm's exploration capabilities, prevents the premature convergence of the algorithm and, therefore, reduces the susceptibility of PSO to getting trapped in a local minimum [60] [61].The inertia weight parameter (ω), on the other hand, was updated following a negative non-linear time-varying approach.

Performance metrics and validation methodology
The performance in the experimental results was assessed by the Accuracy (Acc), Sensitivity (Sen), and Specificity (Spe) of the predictions against the ground truth (database labels) by [62]  =  +   +  +  +  (9) where TP is the number of instances of class "A" classified as class "A", TN is the number of instances of class "not-A" classified as class "not-A", FP is the number of instances of class "not-A" classified as class "A", and FN is the number of instances of class "A" classified as class "not-A".The diagnostic ability of the algorithm was evaluated by the Area Under the receiver operating characteristic Curve (AUC) [63], considering that the positive class was "A".
The normalized diversity of the population or particles at each generation or iteration (distance-based measure) was computed as [55] [64] where L is the length of the chromosome or particle, z is the number of chromosomes or particles, and Ham is the Hamming distance, given by the number of positions where the bits of the two chromosomes differ.
Taking into consideration that the optimization procedure is considerably time consuming, Two-Fold Cross-Validation (TFCV) was used to find the optimized solution with a cold start of the classifier in each run.TFCV was performed by dividing the subjects into two datasets (ensuring subject independent datasets by using the data from each subject exclusively in only one of the datasets).The AUC of the two TFCV cycles was averaged to find the mean AUC considered as the PM for the model under examination.The Adam algorithm [65] was used for training since it was found to be the most suited for the CAP analysis based on LSTM [20].Cost-sensitive learning was employed to deal with the strong data unbalancement (instead of using a balancing operation that can alter the expected distribution of the data) since for some subjects more than 80% of the epochs can refer to the "not-A" class.
When the best structure of the classifier was found, the Leave One Out (LOO) method was used to assess the performance of that model, with a cold start (the classifier weighs were randomly initialized to not perform retraining) of the classifier in each run.This method was employed as it can provide less biased results when a low number of samples is available [66].Hence, a total of 16 evaluation cycles were executed.The training set, employed for each cycle, was composed of data from 15 subjects and the data from the left out subject composed the testing set.Each subject was only chosen once to compose the testing set.

Implementation
A resampling procedure was applied to attain a uniform database, since the sampling frequency of the records varies between 100 Hz and 512 Hz.All signals were resampled at the lowest sampling frequency by decimation [67].A constant reduction factor was employed for the sampling rate, s, and a standard lowpass filter (Chebyshev type I filter with order eight, normalized cutoff frequency of 0.8/s, and passband ripple of 0.05 dB) was used to avoid aliasing and downsample the signal.Thus, a resampling process chooses each s th point from the filtered signal to generate the resampled signal.This signal was then standardized, by subtracting the mean and dividing the result by the standard deviation, with the goal of reducing the effect of systematic variations in the signal [68].
The removal of artifacts related to cardiac field and eye movements during sleep was recommended by several studies as an approach that can marginally improve the performance of the classifier [27] [69].Nevertheless, the accurate removal of these artifacts requires, at least, the electrooculogram and electrocardiogram signals, leading to a further complex model.Therefore, these artifacts were not removed.
Epoch's duration (D) was selected to be one second, which is in line with the standard duration for CAP analysis, and it corresponded to the database labels.Since the signals were resampled at 100 Hz, the input dimension was 100 for each time step.
For this work, AUC was selected as the PM since it can provide an estimate of the diagnostic ability of the algorithm, without being significantly affected by class unbalance.For both studied optimization algorithms, the used learning rate was 0.001 and the batch size was 1024.The optimal classification threshold for the test dataset of the LOO examination was identified by finding the optimal cut off point of the receiver operating characteristic curve estimated on the training dataset.
Four activation functions were assessed by the optimization algorithm to introduce nonlinearities in the network: tanh; sigmoid; Rectified Linear Unit (ReLU); Scaled Exponential Linear Unit (SELU).
An encoding array, presented in Table 1, was employed to perform the optimization search.A total of 15 coded chromosomes or particles, each composed of 15 bits, were employed to characterize the population (P elements) at each generation or iteration, g, by using the decoding indicated in Table 1.

Table 1.
Encoding array examined by the optimization algorithms.For GA, the quality of the solution (fitness assessment) for each element of the population was assessed by the average AUC (employed optimization metric since it reveals the diagnostic ability of the model) estimated by TFCV.The values of G and M were chosen to be 20 and 15, respectively.The crossover probability was 90%.The initial mutation probability was 20% and the value was decreased 30% every five generations until the minimum of 1% was reached.The GA parameters were selected to be in line with the ones employed by Largo et al. [32], reported as suitable for CAP analysis using a GA.

Number Locus
To allow a fair comparison with GA, a total of 15 particles were employed with PSO (with the same encoding array defined in Table 1), besides keeping the fitness assessment and the stopping criterion as defined previously for GA.Concerning the specific PSO parameters, c1 was set to 0.6 and c2 to 0.3 in order to lead to the convergence of PSO considering the inertia values [70].The initial and final values of ω were defined to be 0.9 and 0.4, respectively [71] [72].In order to have the same rate of change as the mutation operation in the GA, the value of ω was decreased by 9% every five generations until the minimum value of 0.4 was reached.
An overview of the implemented model is presented in Fig. 2. Since binary classification was employed, an epoch was considered as misclassified when the predicted label was bounded by two opposite classifications, denoting an isolated classification.Therefore, in the post-processing, a sequence of 010 was corrected to 000 and 101 to 111.

Fig. 2.
Overview of the implemented model fusing the signal of three EEG channels, using a dense layer to transform the LSTM and concatenation layers outputs.

Experimental results
The algorithms were developed in Python 3 using TensorFlow's libraries to implement the classifier, running in NVIDIA's GeForce GTX 1080 Ti graphics processing unit.The first step was the search for the best structure of the classifier, performed by the optimization algorithms using TFCV.For the classifiers whose structure was found to be the best by the optimization algorithms, a second performance assessment was carried out by LOO (with a cold start of the classifier in each run).

Optimization of the classifier
The optimal parameters found by the optimization algorithms are presented in Table 2. Figures 3 and 4 present the AUC variation and the diversity of the chromosomes or particles through the evaluated generations or iterations, respectively.The simulation time was 1058067 seconds (12.25 days) and 859373 seconds (9.95 days) for the GA and PSO algorithms, respectively.A total of 300 different networks were simulated by GA while PSO simulated 255 different networks.It is important to notice that if a full grid search methodology was employed, the total number of examined networks would be 28672 which is computationally infeasible.
It was observed in Table 2 that both optimization algorithms identified a similar optimal structure, using the three EEG channels, a single BLSTM layer for each channel with the same shape, and employing the dense layers (one after each BLSTM layer and one after the concatenation layer).On the other hand, the chosen number of time steps was 25 for PSO, which was relatively higher when compared to GA which was 10, with a 10% lower dropout.The selected size and activation function for the dense layer was also different.The total number of trainable parameters was 934202 and 723602 for GA and PSO, respectively.
PSO was able to find the best solution at the second iteration, early stopping at iteration 16 (see Fig. 3).This could mean, however, that PSO converged prematurely, getting trapped into that local optimum.Nevertheless, it was significantly faster than GA, which reached the best solution at generation 15.PSO also maintained a higher diversity in the population (see Fig. 4).These results were expected as PSO is prone to converge faster while GA maintains the cycle of offspring creation which has the tendency to progressively decrease the diversity of the population.

Performance assessment
The results obtained by the LOO method using the optimal configurations found by GA and PSO are presented in Tables 3, with the 16 subjects; with only the 8 subjects FND; with only the 8 subjects who have NFLE.Fig. 5 depicts the AUC for each of the subject (subjects 1 to 8 are FND while subjects 9 to 16 have NFLE).
By examining the results from Table 3, when the 16 subjects were used, it is possible to conclude that the configuration found by PSO reached an Acc and Spe which are approximately 3% and 4% better than the configuration found by GA, respectively.
However, the results are less balanced when compared to the configuration found by GA which attained a Sen almost 5% higher.Nevertheless, the AUC of both configurations was approximately the same (82%), indicating that the performance of the two models is equivalent and that both optimization algorithms identified suitable configurations for this analysis.Another relevant aspect, highlighted in Fig. 5, is the variation of the performance according to the subjects, demonstrating that the models have an average absolute difference of 1%, and both are capable of working with subjects FND and subjects with NFLE, advocating the feasibility of the proposed model for clinical applications.
When comparing the LOO results (in Table 3) of the models using only the eight subjects FND or only the eight subjects which have NFLE against the LOO results with the 16 subjects, it is possible to observe that a superior performance for most performance metrics was reached when using LOO with the 16 subjects.These results were expected since the models were optimized to find the best solution when taking into consideration a population with both subjects FND and subjects with NFLE.
Therefore, the proposed model has the key advantage of being capable of working with both a population FND and a population with sleep disorders (in this case with NFLE).

Robustness evaluation
To evaluate the robustness of the proposed fusion method, two different tests were performed.The first examined the effect of losing the information from one or two channels, while the second evaluated the effect of introducing noise in the input signals.
All results were attained by training the model with the three channels and without introducing noise, and then the models were tested with missing the channels or with the introduction of noise.
The results of the first test were attained by using LOO on the full population (16 subjects), covering all possible scenarios, and are presented in Fig. 6.For the scenario where no channels were lost indicating three (all) working channels, one channel was lost (indicated as two working channels in the figure) and two channels were lost (indicated as one working channel in the figure).The lost channel is replaced by one of the working channels or channel; as for two working channels it can be replaced by either one of them, where for one working channel all three channels' inputs are replaced by the remaining channel.By evaluating the results from Fig. 6 it is possible to conclude that losing one channel does not considerably change the AUC.Losing two channels (worst case scenario) decreased the AUC median less than 3% for both models, advocating the robustness of the models.
To evaluate the effect of having noise in the input signals, all EEG channels were contaminated with Additive White Gaussian Noise (AWGN) with varied Signal to Noise Ratio (SNR) from -20 to 20 dB (range considered as suitable for this type of analysis [73]).The results are presented in Fig. 7 where it is visible that the model whose structure was selected by GA is less affected by noise than the structure selected by PSO, conceivably due to the larger number of time steps used by the structure selected by PSO (15 time steps more than the structure selected GA), which means that more noise will affect the model.Nevertheless, both models maintained a good performance until the SNR was 0 dB, which is a value considerably lower than the usual SNR of EEG sensors [74].Therefore, the proposed solution is also resistant to the introduction of noise in the input channels.

Discussion
A comparison between the results reported by the previous state of the art works and the results attained in this work is presented in given that an A phase can only be scored if it is visible in all EEG channels.The relevance of using multiple channels is even more emphasized in this work as both optimization algorithms selected as the best solution the use of three EEG channels.
Contrary to what was done in the developed models, most of the state of the art works have performed a manual removal of the wake or rapid eye movement periods [36] [39], which can boost the performance of the classifier, however it leads to a methodology which is not suitable to be implemented in a fully automatic scoring algorithm.Additionally, several state of the art works have further removed the epochs not related to the CAP phase events, further lessening the fully automatic applicability of the model [43].
For biomedical applications, it is important to have a balanced performance to provide a reliable clinical diagnosis.Taking into consideration the significant unbalance that characterizes CAP analysis (considerably more events related to "not-A" than "A"), it is not possible to focus the performance assessment only in the Acc since without reporting the Sen and Spe it is not possible to assess if the performance is balanced or not.For this reason, the AUC is preferable, although this metric was not reported by the majority of the state of the art works and, therefore, the mean metric was proposed in this analysis as an alternative to check how balanced the results are.By considering this metric, it is possible to conclude that the best state of the art works, which have included sleep disorder patients in the analysis, is the same as attained in this work (76%).However, Mendonça et al. [20] [41] have examined patients with sleep disordered breathing while subjects with NFLE were examined in this work.Sharma et al. [43] also evaluated subjects with NFLE but attained a lower Acc, highlighting how difficult it is to examine subjects with this disorder.
It is also important to notice that some state of the art works used a threshold based approach instead of a machine learning classifier [34] [35], which is likely to be difficult to generalize to a broader population.The works based on the manual creation of features to be fed to a classifier also have the disadvantage of requiring significant domain knowledge that hampers the researcher work [18].Moreover, that methodology usually requires a feature selection procedure to determine the subset of features that are more relevant for the examined problem.On the other hand, the deep learning approach employed in this work automatically creates features, and can be further improved as more data is available, making the model more suitable for large scale examinations.The proposed model automatically extracts features by identifying patterns in time from the input time series, using a deep learning classifier.However, one of the most challenging aspects of using deep learning models is the need to optimize the structure and hyperparameters.To address these problems, two optimization algorithms were examined as an efficient alternative to the traditional grid search approach to optimize.
As a result, it was observed that the optimal structure for the classifier identified by the two optimization algorithms was similar and selected the input with three EEG signals, denoting the importance of using multiple channels to properly detect the CAP A phases.
It was observed that the obtained performance is in the upper range of the best state of the art works, although a significantly more challenging methodology and subjects data were employed in this work, examining a population composed of subjects FND and subjects with NFLE, using a fully automatic analysis instead of requiring to manually isolate the non-rapid eye movement sleep epochs as it is done in most of the state of the art works.It was also observed that the models are resilient to noise and channel failure, making them even more suitable for real-world clinical applications.
It is relevant to notice that the proposed architecture is flexible enough to be altered to include more layers (for example, a combination of convolution layer followed by an LSTM layer instead of only the LSTM layer) or to change the current layers (for example, change the LSTM to a gated recurrent unit).
Three main paths were identified as future work in this research.The first is to further validate the proposed methodology to include more channels in the analysis.The second one is to add different sensors to the fusion model.The last one is to implement a similar methodology to other research and industry applications.
replaced by the features based machine learning classification.The most recent articles employ deep learning methods, pointing to the occurrence of a new trend where the feature creation process is performed automatically by the classifier.Another relevant aspect is that only the monopolar derivations and the F4-C4 channel were previously studied by state of the art works.The summary of the state of the art analysis is presented in Fig 1.

Fig. 1 .
Fig. 1.Overview of the state of the art examination.

Algorithm 1 . 1 : 2 : 6 : 7 : 5 T ← O 15 : 16 : 19 : 1 20: 2 21:
steps for the classification's recurrence, and structure of the classifier.The pseudocode of the developed model is presented in Algorithm 1.The code developed for this work was made open-source, being publicly available in a GitHub repository (https://github.com/Dntfreitas/GA_PSO_DEEP_LEARNING).Pseudocode for the experimental procedure.Input: EEGchannels ← {Fp2-F4, C4-A1, F4-C4} For each ch in EEGchannels do: For each opt in optimizers do: While opt's termination criteria is not met do: O0, O1, …, Oep − 1], where Oep is the output of the ep th epoch of the deep learning model Post-processing ► Sec.3.For each output out = 0, …, ep − 1 do: If out > 0 and out < length(O) do: Tout = majority(Oout − 1,Oout,Oout + 1) Find the best structure for GA and PSO for the deep learning model, using PM as reference ► Sec.4.Performance comparison between GA and PSO ► Sec.4.Robustness evaluation for noise and channel failure ► Sec.4.3 (t) is the cell's internal state, updated by   () =   ()   (−1) +   () ℎ (∑  ,   ()  + ∑  , ℎ  (−1)  +   ) Each LSTM cell receives a time step of data with duration D, composed of I input points.The type of LSTM, the number of channels, n, number of time steps, T, and number of LSTM layers (stacked if more than one) were chosen by the optimization algorithm.Each cell has multiple hidden units and the total number of hidden units, H, of the last cell defines the output of the LSTM layer (the epoch's data fed to the last cell corresponds to the database label for the current evaluated epoch).When two LSTM layers were stacked, the sequence of vectors of the first layer was returned to the second layer whose last cells' outputs defined the output.The LSTM layers' outputs the features h1, h2, …, hn that were automatically crafted from each input channel.These features were then transformed to f = [f1 [h1(1), h1(2),…, h1(H)], f2 [h2(1), h2(2), …, h2(H)], …, fn [hn(1), hn(2), …, hn(H)]] by the concatenation layer, where f1, f2, …, fn are either the outputs of the LSTM (h1, h2, …, hn), or are the dense layers' transformations of the LSTM outputs, according to the decision of the optimization algorithm.These channels were fused, at the feature level, by the concatenation layer which merges all the features into a sequence f, i.e. the input of the fusion node is the set of features h and the output is f.If a dense layer was used to transform the LSTM layer's outputs then, a second dense layer (with the same configuration as the first dense layer) was used to transform the concatenation layer's output.

Fig. 3 .
Fig. 3. Variation of the AUC of the best solution found by the optimization algorithms.

Fig. 4 .
Fig. 4. Diversity of the chromosomes or particles over the optimization algorithms'

Fig. 5 .
Fig. 5. AUC estimation using LOO for the models optimized by GA (BLSTM+GA) and

Fig. 6 .Fig. 7 .
Fig. 6.Violin plots of the results attained by LOO when all three channels are available

Table 2 .
Optimal configurations found by the optimization algorithms.

Table 3 .
Performance attained by the LOO method for the best models identified by the optimization algorithms.Results are presented in the format "mean ± standard deviation

Table 4 .
[38]xamining the table, it is clear that the previous works which have only examined FNB subjects attained the best performance, highlighting the difficulties associated with the assessment of subjects with sleep disorders.Although the use of sleep disorder subjects made the classification process more challenging, the produced results can be better generalized for clinical applications.Another relevant factor is the average number of examined subjects by the state of the art works, which is 12, while 18 were examined in this work, emphasizing the viability of the achieved results.It is also important to highlight here the examination of multiple channels considering that, apart from Sharma et al.[43]who evaluated two EEG channels, all state of the art works examined only one EEG channel, which is contrary to the recommendation to score CAP utilizing multiple channels[38],

Table 4 .
Comparative analysis between results reported by the state of the art works and the results attained in this work with subjects FND and Sleep Disorder Patients (SDP).A novel methodology to fuse time series signals at the feature level is proposed in this work, and it was evaluated in a challenging real-world scenario of CAP A phase classification.However, this methodology can be used in other contexts when it is intended to fuse information from multiple time series for classification or regression.