Automated Multiclass Classification of Spontaneous EEG Activity in Alzheimer’s Disease and Mild Cognitive Impairment

The discrimination of early Alzheimer’s disease (AD) and its prodromal form (i.e., mild cognitive impairment, MCI) from cognitively healthy control (HC) subjects is crucial since the treatment is more effective in the first stages of the dementia. The aim of our study is to evaluate the usefulness of a methodology based on electroencephalography (EEG) to detect AD and MCI. EEG rhythms were recorded from 37 AD patients, 37 MCI subjects and 37 HC subjects. Artifact-free trials were analyzed by means of several spectral and nonlinear features: relative power in the conventional frequency bands, median frequency, individual alpha frequency, spectral entropy, Lempel–Ziv complexity, central tendency measure, sample entropy, fuzzy entropy, and auto-mutual information. Relevance and redundancy analyses were also conducted through the fast correlation-based filter (FCBF) to derive an optimal set of them. The selected features were used to train three different models aimed at classifying the trials: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and multi-layer perceptron artificial neural network (MLP). Afterwards, each subject was automatically allocated in a particular group by applying a trial-based majority vote procedure. After feature extraction, the FCBF method selected the optimal set of features: individual alpha frequency, relative power at delta frequency band, and sample entropy. Using the aforementioned set of features, MLP showed the highest diagnostic performance in determining whether a subject is not healthy (sensitivity of 82.35% and positive predictive value of 84.85% for HC vs. all classification task) and whether a subject does not suffer from AD (specificity of 79.41% and negative predictive value of 84.38% for AD vs. all comparison). Our findings suggest that our methodology can help physicians to discriminate AD, MCI and HC.


Introduction
Dementia due to Alzheimer's disease (AD) is a progressive neurodegenerative disorder associated with cognitive, behavioral and functional alterations. AD prevalence increases exponentially with age, from 1% in people between 60 and 64 years up to 38% in people over 85 years [1]. Since Institute on Aging and Alzheimer's Association (NIA-AA) criteria, whereas HC were elderly subjects without a cognitive impairment and with no history of neurological or psychiatric disorder [23]. Inclusion and exclusion criteria for each group can be found in our previous study [20].
All participants and patients' caregivers were informed about the research background and the study protocol. Moreover, all of them gave their written informed consent to be included in the study. The Ethics Committee at the Río Hortega University Hospital (Valladolid, Spain) endorsed the study protocol, according to The Code of Ethics of the World Medical Association (Declaration of Helsinki).

EEG Recording
Five minutes of spontaneous EEG activity were recorded using a 19-channel EEG system (XLTEK ® , Natus Medical, Pleasanton, CA, USA). Specifically, EEG activity was acquired from Fp1, Fp2, Fz, F3, F4, F7, F8, Cz, C3, C4, T3, T4, T5, T6, Pz, P3, P4, O1, and O2, at a sampling frequency of 200 Hz. Subjects were asked to stay in a relaxed state, awake, and with closed eyes during EEG acquisition. During the recording procedure, EEG traces were visually monitored in real time, and muscle activity was identified to avoid high-frequency noise. Additionally, independent component analysis (ICA) was performed to minimize the presence of oculographic, cardiographic, and myographic artifacts [7]. Afterwards, EEG signals were digitally filtered using a finite impulse response filter designed with a Hamming window between 1 and 70 Hz and a notch filter to remove the power line frequency interference (50 Hz, Butterworth filter). Finally, an experienced technician selected artifact-free epochs of 5-s by visual inspection.
We randomly divided our EEG database into training and test sets. The training set was formed by: 20 AD patients (45.85 ± 8.36 trials per subject, mean ± standard deviation, SD), 20 MCI subjects (46.85 ± 10.68 trials per subject) and 20 HC subjects (45.60 ± 7.93 trials per subject). The recordings not selected for the training set were assigned to the test set: 17 AD patients (44.53 ± 10.10 trials per subject), 17 MCI subjects (49.82 ± 8.29 trials per subject) and 17 HC subjects (44.24 ± 7.81 trials per subject). No statistically significant differences were found in age (p-value > 0.05, Kruskal-Wallis test) and gender (p-value > 0.05, chi-squared test) among AD, MCI, and HC groups. Table 1 shows relevant socio-demographic and clinical data for each group.

Methods
The methodology followed in this study is represented in Figure 1. After EEG-signal recording and data pre-processing, both spectral and nonlinear features were computed. Then, FCBF was applied to the training set to automatically select an optimum set of features. Finally, three different multiclass classification approaches (LDA, QDA, and MLP) were adopted to settle the group for each trial and subject. Block diagram of the steps followed in the EEG analysis: data collection, pre-processing, feature extraction, feature selection and classification.

Spectral Analysis
A typical approach to characterize electromagnetic brain recordings is based on the analysis of their spectral content [24][25][26]. Spectral parameters are based on the normalized power spectral density in the frequency band of interest (PSDn). In this request, the following spectral parameters have been calculated from the PSDn: relative power (RP), median frequency (MF), individual alpha frequency (IAF), and spectral entropy (SE).


RP represents the relative contribution of different frequency components to the global power spectrum. RP is more appropriate than absolute power to analyze EEG data, as RP provides independent thresholds from the measurement equipment and lower inter-subject variability [27]. RP is obtained by summing the contribution of the desired spectral components: where 1 and 2 are the low and the high cut-off frequencies of each band, respectively. In this study, RP was calculated in the conventional EEG frequency bands: delta (δ, 1-4 Hz), theta (θ, 4-8 Hz), alpha (α, 8-13 Hz), beta-1 (β1, 13-19 Hz), beta-2 (β2, 19-30 Hz) and gamma (γ, 30-70 Hz).  MF offers an alternative way to quantify the spectral changes of the EEG, and it is a simple index that summarizes the whole spectral content of the PSDn. MF is defined as the frequency that comprises 50% of the PSDn power: Previous studies suggested that MF provides a better performance for the characterization of brain activity than mean frequency, whose original definition is based on the computation of the spectral centroid [28].  IAF evaluates the frequency at which the maximum alpha power is reached. Alpha oscillations are dominant in the EEG of resting normal subjects, with the exception of irregular activity in the delta band and lower frequencies. This issue involves that the PSD displays a peak around the alpha band. The IAF estimation in the present work is based on the calculation of the MF in the extended alpha band (4-15 Hz), as previous EEG studies on AD recommended [29]. This is shown in the following equation:  SE estimates the signal irregularity in terms of the flatness of the power spectrum [30]. On the one hand, a uniform power spectrum with a broad spectral content (e.g., a highly irregular signal like white noise) provides a high entropy value. On the other hand, a narrow power spectrum with only a few spectral components (e.g., a highly predictable signal like a sum of sinusoids) yields a low SE value. The equation for calculating SE would be: Figure 1. Block diagram of the steps followed in the EEG analysis: data collection, pre-processing, feature extraction, feature selection and classification.

Spectral Analysis
A typical approach to characterize electromagnetic brain recordings is based on the analysis of their spectral content [24][25][26]. Spectral parameters are based on the normalized power spectral density in the frequency band of interest (PSD n ). In this request, the following spectral parameters have been calculated from the PSD n : relative power (RP), median frequency (MF), individual alpha frequency (IAF), and spectral entropy (SE).

•
RP represents the relative contribution of different frequency components to the global power spectrum. RP is more appropriate than absolute power to analyze EEG data, as RP provides independent thresholds from the measurement equipment and lower inter-subject variability [27]. RP is obtained by summing the contribution of the desired spectral components: where f 1 and f 2 are the low and the high cut-off frequencies of each band, respectively. In this study, RP was calculated in the conventional EEG frequency bands: delta (δ, 1-4 Hz), theta (θ, 4-8 Hz), alpha (α, 8-13 Hz), beta-1 (β 1 , 13-19 Hz), beta-2 (β 2 , 19-30 Hz) and gamma (γ, 30-70 Hz). • MF offers an alternative way to quantify the spectral changes of the EEG, and it is a simple index that summarizes the whole spectral content of the PSD n . MF is defined as the frequency that comprises 50% of the PSD n power: Previous studies suggested that MF provides a better performance for the characterization of brain activity than mean frequency, whose original definition is based on the computation of the spectral centroid [28]. • IAF evaluates the frequency at which the maximum alpha power is reached. Alpha oscillations are dominant in the EEG of resting normal subjects, with the exception of irregular activity in the delta band and lower frequencies. This issue involves that the PSD displays a peak around the alpha band. The IAF estimation in the present work is based on the calculation of the MF in the extended alpha band (4-15 Hz), as previous EEG studies on AD recommended [29]. This is shown in the following equation: Entropy 2018, 20, 35

of 15
• SE estimates the signal irregularity in terms of the flatness of the power spectrum [30]. On the one hand, a uniform power spectrum with a broad spectral content (e.g., a highly irregular signal like white noise) provides a high entropy value. On the other hand, a narrow power spectrum with only a few spectral components (e.g., a highly predictable signal like a sum of sinusoids) yields a low SE value. The equation for calculating SE would be: Nonlinear Analysis Alterations caused by AD and MCI also modify complexity, variability and the irregularity of the EEG activity [9,12,[31][32][33][34]. Hence, to complement the spectral analysis, five global nonlinear methods were also calculated: Lempel-Ziv complexity (LZC), central tendency measure (CTM), sample entropy (SampEn), fuzzy entropy (FuzzyEn), and auto-mutual information (AMI).

•
LZC estimates the complexity of a finite sequence of symbols. LZC analysis is based on a coarse-graining of measurements. Therefore, the EEG signal must be previously transformed into a finite symbol string. In this study, we used the simplest possible way: a binary sequence conversion (zeros and ones). By comparison with a threshold T d , the original signal samples are converted into a 0-1 sequence P = s(1), s(2), . . . , s(N) with s(i) defined by: The threshold T d is estimated as the median value of the signals amplitude in each channel because it is more robust to outliers. The string P is then scanned from left to right and a complexity counter c(N) is increased by one every time a new subsequence of consecutive characters is encountered in the scanning process. In order to obtain a complexity measure that is independent of the sequence length, c(N) should be normalized. For a binary conversion, the upper bound of c(N) is given by b(N) = N/ log 2 (N) and c(N) can be normalized via b(N): LZC values are normalized between 0 and 1, with higher LZC values for more complex time series. The detailed algorithm for LZC measure can be found in [35]. • CTM quantifies the variability of a given time series on the basis of its first-order differences. For CTM calculation, scatter plots of first differences of the data are drawn. The value of CTM is computed as the proportion of points in the plot that fall within a radius ρ, which must be specified [36]. For a time series with N samples, N − 2 would be the total number of points in the scatter plot that can be plotted by representing Subsequently, the CTM of the time series can be computed as: where Thus, CTM ranges between 0 and 1, with higher values corresponding to points more concentrated around the center of the plot (i.e., corresponding to less degree of variability).

•
SampEn is an embedding entropy used to quantify the irregularity. It can be applied to short and relatively noisy time series [37]. To compute SampEn, two input parameters should be specified: a run length m and a tolerance window r. SampEn is the negative natural logarithm of the conditional probability that two sequences similar for m points remain similar at the next point, within a tolerance r, excluding self-matches [37]. Thus, SampEn assigns a nonnegative number to a time series, with larger values corresponding to greater signal irregularity. For a time series of N points, The distances among vectors are calculated as the maximum absolute distance between their corresponding scalar elements. B i is the number of vectors that satisfy the condition that their distance is less than r. The counting number of different vectors is calculated and normalized as [37]: Repeating the process for vectors of length m + 1, B m+1 (r) can be obtained and SampEn can be defined as: • FuzzyEn provides information about how a signal fluctuates with time by comparing the time series with a delayed version of itself [38]. As SampEn, higher FuzzyEn values are associated with more irregular time series. To compute FuzzyEn, three parameters must be fixed. The first parameter, m, is the length of the vectors to be compared, like in SampEn. The other ones, r and n, are the width and the gradient of the boundary of the exponential function, respectively [38]. Given a time series X(n) = {x(1), x(2), . . . , x(N)}, the FuzzyEn algorithm reads as follows: 1.
Compose N − m + 1 vectors of length m such that: where x 0 (i) is given by: 2.
Compute the distance, d m ij , between each two vectors, X m i and X m j , as the maximum absolute difference of their corresponding scalar components. Given n and r, calculate the similarity degree, D m ij , between X m i and X m j through a fuzzy function µ(d m ij , n, r): 3.
Define the function φ m as: FuzzyEn(m, n, r) = ln[φ m (n, r)] − ln φ m+1 (n, r) . (15) • AMI is the particularization of mutual information applied to time-delayed versions of the same sequence. Mutual information is a metric derived from Shannon's information theory to estimate the information gain from observations of one random event on another [31]. AMI estimates, on average, the degree to which a time-delayed version of a signal can be predicted from the original one. Thus, more predictable time series, and accordingly more regular, lead to higher AMI values.
The AMI between X(n) and X(n + k) is [31]: where P Xk [X(n)] is the probability density for the measurement X(n), while P XXk [X(n), X(n + k)] is the joint probability density for the measurements of X(n) and X(n + k). In this study, the AMI was estimated over a time delay from 0 to 0.5 s and was then normalized, so that AMI(k = 0) = 1.

Feature Selection: Fast-Correlation-Based Filter
The aforementioned characterization of the EEG may lead to the extraction of several features that provide similar information about the brain dynamics in AD, MCI, and HC. Consequently, a feature selection stage was also included. In our study, FCBF was used to discard those redundant features that share more information with the other ones than with the variable that defines the group membership. FCBF is based on symmetrical uncertainty (SU), which is a normalized quantification of the information gain between each feature and the group membership variables [15]. It consists of two steps: relevance and redundancy analyses of the features.

•
In the first step, a relevance analysis of the features is done. Thus, SU between each feature X i and the group membership Y is computed as follows: where H(·) is the well-known Shannon's entropy, H(X i |Y) is the Shannon's entropy of X i conditioned on Y, and I is the number of features extracted (in our study, I = 14 features). SU is normalized to the range [0, 1], with a value of SU = 1, indicating that, when knowing one feature, it is possible to completely predict the other, and a value of SU = 0 indicates that the two variables are independent. Then, a ranking of features is done based on their relevance since the higher the value of SU is, the more relevant the feature is.

•
The second step is a redundancy analysis used to discard redundant features. SU between each pair of features SU(X i , X j ) is sequentially estimated beginning from the first-ranked ones. If X i shares more information with X j than with the corresponding group Y, SU(X i , X j ) ≥ SU(X i , Y) (with X i being more highly ranked than X j ), the feature j is discarded due to redundancy and it is not considered in subsequent comparisons. The optimal features are those not discarded when the algorithm ends.

Classification Approach
The described AD-MCI-HC diagnosis problem corresponds to a pattern classification task. Specifically, it can be modeled as a three-class classification problem. Bayesian decision theory establishes the rule to make such a decision to minimize the probability of misclassification [39]. We have implemented LDA, QDA, and MLP models to ensure that our conclusions take into account a variety of classification methodologies. In this study, we classify trials using each trained model, and, then, every subject is classified by means of a majority vote of all its trials [22].

Linear and Quadratic Discriminant Analysis (LDA and QDA)
LDA takes an input vector and assigns it to one out of the K classes using linear hyperplanes as decision surfaces [40]. This classifier assumes that different classes generate data based on different Gaussian distributions, whose parameters are estimated with the fitting function during the training. In order to predict the classes of new data, the trained model finds the class with the smallest misclassification cost assuming that the covariance matrices of each class are identical (homoscedasticity) [40].
QDA is a classification approach closely related to LDA. However, there is no assumption that the covariance of all classes are identical among them and it establishes a quadratic decision boundary between classes in the feature space [40].

Multi-Layer Perceptron Artificial Neural Network (MLP)
MLP is an artificial neural network that maps an input vector onto a set of output variables using a nonlinear function controlled by a vector of adjustable parameters. The use of neural networks for classification issues has some advantages. First, no prior assumptions about the distribution of the data are required, since neural network algorithms adjust themselves to the environment by means of the training or learning process. Thus, complex relationships can be modeled by these algorithms [41].
An MLP consists of three or more layers (an input and an output layer with one or more hidden layers) of neurons, with each layer fully connected to the next one. In our study, we have evaluated MLP networks with a single hidden layer of neurons, since networks with this architecture are capable of universal approximation [42]. MLP utilizes backpropagation in conjunction with an optimization method, such as gradient descent, with the aim of finding appropriate weights to connect neurons each other. Backpropagation is based on the definition of a suitable error function, which is minimized by updating the weights in the network [39].
In order to predict the classes for new data, the trained MLP model provides the posterior probability of belonging to each class. A three-class classification problem involves the use of three output neurons, one neuron per group. In our study, the number of neurons in the hidden layer (n h ) and a regularization parameter (u) were optimized by cross-validation leaving all trials of a subject out in every iteration in the training set. This procedure was carried out 30 times to minimize the effect of network random initialization and then the results were averaged [43]. NETLAB toolbox was used to implement the neural network classifier [44].

Statistical Analysis
The three-class diagnostic ability of the models was assessed in terms of accuracy (Acc, overall percentage of subjects rightly classified) and Cohen's kappa (k). k measures the agreement between predicted and observed classes, avoiding the part of agreement by chance [45]. On the other hand, the performance of the models for HC vs. all and AD vs. all comparison was described by sensitivity (Se, percentage of positive subjects appropriately classified), specificity (Sp, percentage of negative subjects correctly classified), Acc, positive predictive value (PPV, proportion of positive estimations of the models that are true positive results) and negative predictive value (NPV, proportion of negative estimations of the models that are true negative results).

Results
According to the proposed methods, we calculated 14 features from each EEG channel. Nine spectral features: RP(d) (where RP(d) represents de RP value for the d band), RP(q), RP(a), RP(b 1 ), RP(b 2 ), RP(g), MF, IAF, and SE, and five derived from the nonlinear methods: LZC, CTM, SampEn, FuzzyEn, and AMI. The results were obtained based on all the artifact-free trials within the five-minute period of recording. Results from all EEG channels were averaged in order to achieve one value per trial for each method.

Training Set
In order to select the optimal value of the different input parameters of each feature, only a training set was used. The optimal value for r (CTM) was obtained by evaluating the range r ∈ [0.01, 0.5] (step = 0.005). Values of r <0.01 were not considered, since they led to a CTM value close to 0 for every subject, whereas values of r >0.5 were also discarded since they led to CTM values equal to 1 regardless the group. For both SampEn and FuzzyEn, m and r optimal values were obtained by evaluating all the combinations for m = 1, 2 and r ∈ (0.1·SD, 0.25·SD) (step = 0.05), where SD is the standard deviation of the time series [38,46]. In the case of FuzzyEn, values of n = 1, 2, 3 were also evaluated to obtain its optimal value [38]. We chose those configurations (r = 0.075 for CTM; m = 1 and r = 0.1·SD for SampEn; and m = 1, r = 0.1·SD, and n = 3 for FuzzyEn) for which the corresponding CTM, SampEn, and FuzzyEn values showed the lowest p-value (Kruskal-Wallis test) among the three groups. Table 2 summarizes the averaged results for each group, taking into account only the training set. After feature extraction, FCBF was applied to the training set. The final FCBF optimal set was composed of three features: two spectral measures (IAF and RP(d)) and a nonlinear one (SampEn). The MLP model was obtained according to the optimal values for n h and u. Both were optimized by cross-validation, leaving all trials for each subject out in every iteration. For each value of u between 0 and 100 (step = 5), we varied the number of neurons in the hidden layer from 1 to 20 (step = 1) in order to compute the k value. This procedure was carried out 30 times to minimize the effect of network random initialization. Then, the k values were averaged [43]. The optimal values (highest k for trials) were u = 45 and 11 neurons in the hidden layer, as Figure 2 shows. On the other hand, since LDA and QDA models have no tuning parameters to be optimized, these were trained using all trials in the training set.
of u between 0 and 100 (step = 5), we varied the number of neurons in the hidden layer from 1 to 20 (step = 1) in order to compute the k value. This procedure was carried out 30 times to minimize the effect of network random initialization. Then, the k values were averaged [43]. The optimal values (highest k for trials) were u = 45 and 11 neurons in the hidden layer, as Figure 2 shows. On the other hand, since LDA and QDA models have no tuning parameters to be optimized, these were trained using all trials in the training set.

Test Set
Once the models were trained, their diagnostic ability was only evaluated using the test set. The overall accuracy of the models in the three-class classification task was 58.82% with LDA, 60.78% with QDA, and 62.75% with MLP. Additionally, we obtained k values of 0.3824 with LDA, 0.4118

Test Set
Once the models were trained, their diagnostic ability was only evaluated using the test set. The overall accuracy of the models in the three-class classification task was 58.82% with LDA, 60.78% with QDA, and 62.75% with MLP. Additionally, we obtained k values of 0.3824 with LDA, 0.4118 with QDA and 0.4412 with MLP. These results show that MLP outperformed the discriminant analyses classifiers. Table 3 displays the confusion matrices of each model, i.e., the model class estimation for each subject versus their actual group. As expected, the three models had higher difficulties when classifying MCI trials and subjects, as this is an intermediate state between HC and AD. Table 3. Confusion matrices of each model: trials and subjects' classification in the test set.

LDA
QDA MLP Table 4 shows Se, Sp, Acc, PPV and NPV for each method for HC vs. all and AD vs. all, derived from confusion matrices. MLP showed the highest diagnostic performance when determining whether a subject is not healthy (HC vs. all classification tasks: Se = 82.35% and PPV = 84.85%). Furthermore, the network showed the highest diagnostic capability when determining whether a subject does not suffer from AD (AD vs. all comparison: Sp = 79.41% and NPV = 84.38%). LDA and QDA showed similar tendencies although reaching lower diagnostic performance than MLP, as Table 4 shows.

Spectral and Nonlinear Characterization of AD and MCI
Our spectral results suggested that AD and MCI elicit a slowing of spontaneous EEG activity. Further inspection of RP values revealed that AD patients reached higher RP values in low frequency bands (q) and lower RP values in high frequency bands (b 1 , b 2 and g) than HC subjects. For the MCI group, a slight slowing of neural oscillations was found in comparison with HC. This increase of slow rhythms in spontaneous EEG activity was also observed by means of MF and IAF. Both spectral parameters were lower for AD patients than for MCI and HC subjects. These findings confirm the trend reported in previous studies: AD and MCI are accompanied by a progressive slow-down of EEG [24,25]. Finally, our SE results showed changes in the frequency distribution of the power spectrum. However, the physiological explanations for all of these alterations are not clear. The most extended hypothesis is that a significant cerebral cholinergic deficit underlies cognitive symptoms, as memory loss. A loss of cholinergic innervation of the neocortex might play a critical role in the EEG slowing associated with AD [24]. Analogously, the slowing of neural oscillations in AD could also be due to the loss of neurotransmitter acetylcholine, since the cholinergic system modulates spontaneous cortical activity at low frequencies [26].
Regarding the nonlinear parameters that quantify the complexity and irregularity of EEG recordings, our findings showed lower LZC, SampEn, FuzzyEn and higher AMI values for AD patients than for HC subjects. For these measures, MCI subjects showed intermediate values between AD and HC. Previous EEG studies also reported a loss of complexity and irregularity associated with early AD and MCI by means of nonlinear measures [9,12,[31][32][33][34]. Additionally, CTM values were higher in AD patients and lower in HC subjects. This result suggests a decrease on variability in AD, as Abásolo et al. previously reported [12]. Taking into account the different nature of the nonlinear parameters, our results showed that the brain activity from AD patients is less complex, more regular and less variable than in MCI and HC subjects. These changes can be associated with both loss of information content and alterations in information processing at the cerebral cortex [47]. The decrease of EEG complexity can also be due to the loss of neurons or synapses, since they are associated with the complex dynamical processing within the brain neural networks [33].

Towards a Screening Protocol of AD
Previous studies explored several EEG features for AD and MCI discrimination from HC, focusing on binary discrimination problems (AD vs. HC, MCI vs. HC and AD vs. MCI) [16][17][18][19][20]. To the best of our knowledge, only one study performed a three-way classification, although via binary classifiers [21]. McBride et al. reached an accuracy value of 85.42% when comparing HC vs. all and 83.33% for AD vs. all (eyes closed resting condition) [21]. Although their results are slightly higher than ours (78.43% and 76.47% for both comparisons, respectively), several advantages of our methodology should be noticed. Firstly, their database was composed by only 47 subjects, in contrast to the 111 subjects recruited for our study. This data limitation also led the authors to validate its proposal through a leave-one-out cross-validation procedure instead of using a hold-out approach (training and test sets). As they obtained a different model for each iteration, the inclusion of new subjects would imply changes in every iteration of cross-validation. However, once our model is trained, the subsequent runtime to apply new data is trivial. It allows us to classify new data just feeding the trained model with the standardized version of them, simplifying the screening protocol.
In contrast to the above-mentioned studies, our MLP single model can be used not only for the three-class classification task but also in binary assessments of healthy vs. cognitively impaired subjects. As derived from Tables 3 and 4, it has shown the ability to detect whether a subject suffers from AD or MCI in 28 out of the 34 non-healthy subjects (82.53% Se)-with a positive post-test probability of 84.85% (28 subjects rightly classified out of 33 subjects predicted as AD or MCI)-and only predicting two out of 17 AD patients as HC. In addition, the same model also showed the ability to discard AD in 27 out of the 34 subjects not suffering from it (79.41% Sp), including 15 out of the 17 HC (88.24%). These results highlight the clinical usefulness of our proposal, which might be expressed as a screening strategy similar to:

1.
If the MLP model predicts AD, recommend beginning a treatment since most probably (89.47%, 17 out of 19 subjects) the patient suffers from AD or MCI.

2.
If the MLP model predicts HC, do not treat the patient, since most probably (88.89%, 16 out of 18 subjects) he/she does not suffer from AD; consider a regular evaluation of the subject in the persistence of symptoms in order to minimize the number of AD and MCI missed subjects.

3.
If the MLP predicts MCI, conduct a regular evaluation of the patient since doubts arise about the cognitive status of the subject.

Limitations and Future Research Lines
Despite the fact that we showed the usefulness of our proposal, some limitations need to be addressed. Although we used a large data sample to train and validate the models (5122 trials), they were obtained from 111 subjects. Hence, analyzing more recordings from different subjects would enhance the generalization ability of our results. Moreover, taking into account the MCI heterogeneity, it would be useful to characterize different subtypes and conduct a longitudinal analysis to characterize subjects with stable MCI and those who progress to AD. Finally, only three classification approaches (LDA, QDA, and MLP) have been used in this study. In future research works, the usefulness of other advanced classification methods, such as spiking neural networks and support vector machines, should be evaluated.

Conclusions
To sum up, our results show that both AD and MCI elicit changes in the EEG background activity: a slowing of EEG rhythms, alterations in the frequency distribution of the power spectrum, a complexity loss, a regularity increase and a variability decrease. Our proposal has shown that spectral and nonlinear features allows us to characterize the brain abnormalities associated with AD and MCI. In addition, we have shown the high diagnostic ability of different three-class models trained with this EEG information, particularly when predicting AD and HC status. These results highlight the usefulness of our proposal in order to help physicians classify AD, MCI and HC from EEG data.