Applying Deep Learning on a Few EEG Electrodes during Resting State Reveals Depressive States: A Data Driven Study

The growing number of depressive people and the overload in primary care services make it necessary to identify depressive states with easily accessible biomarkers such as mobile electroencephalography (EEG). Some studies have addressed this issue by collecting and analyzing EEG resting state in a search of appropriate features and classification methods. Traditionally, EEG resting state classification methods for depression were mainly based on linear or a combination of linear and non-linear features. We hypothesize that participants with ongoing depressive states differ from controls in complex patterns of brain dynamics that can be captured in EEG resting state data, using only nonlinear measures on a few electrodes, making it possible to develop cheap and wearable devices that could be even monitored through smartphones. To validate such a perspective, a resting-state EEG study was conducted with 50 participants, half with depressive state (DEP) and half controls (CTL). A data-driven approach was applied to select the most appropriate time window and electrodes for the EEG analyses, as suggested by Giacometti, as well as the most efficient nonlinear features and classifiers, to distinguish between CTL and DEP participants. Nonlinear features showing temporo-spatial and spectral complexity were selected. The results confirmed that computing nonlinear features from a few selected electrodes in a 15 s time window are sufficient to classify DEP and CTL participants accurately. Finally, after training and testing internally the classifier, the trained machine was applied to EEG resting state data (CTL and DEP) from a publicly available database, validating the capacity of generalization of the classifier with data from different equipment, population, and environment obtaining an accuracy near 100%.


Introduction
Depression is a common mental disorder that comprises primary symptoms, such as depressed mood or anhedonia and loss of interest or pleasure, and secondary symptoms, such as appetite or weight changes, sleep difficulties (insomnia or hypersomnia), psychomotor agitation or retardation, fatigue or loss of energy, diminished ability to think or concentrate, feelings of worthlessness or excessive guilt and suicidality [1]. It is estimated that 3.8% of the global population is affected to some degree by depressive states at some moment of their lives, including 5% among adults and 5.7% of adults older than 60 years. Globally, 280 million people suffer from depression, and it is one of the critical/imperative conditions covered by WHO's Mental Health Gap Action Programme (mhGAP) [2]. The Global Burden of Diseases, Injuries and Risk Factors Study showed that depression caused 34.1 million of the total years lived with disability (YLD), ranking as the fifth-largest cause of YLD [3].

•
Higuchi's fractal dimension (Higuchi) is an approximate value for the box-counting dimension used on fractal analysis to graph a real-valued time series that has been used for more than 20 years in neurophysiological domains [14]. • Spectral entropy (SpectEnt) describes the complexity of a system based on applying the standard formula for entropy to the power spectral density of EEG data. It quantifies the irregularity of EEG data [15]. • Singular value decomposition entropy (SVDEnt) characterizes the information content or regularity of a signal depending on the number of vectors attributed to the process [16]. • Sample entropy (SampEnt) is a modification of approximate entropy, used for assessing the complexity of physiological time series signals [17]. • Detrended fluctuation analysis (DFA) is a stochastic process, chaos theory, and timeseries analysis. DFA is a method for determining the statistical self-affinity of a signal. It is useful for analyzing time series that appear to be long memory processes [18]. • Permutation entropy (PermEnt) gives a quantification measure of the complexity of a dynamic system by capturing the order relations between values of a time series and extracting a probability distribution of the ordinal patterns [19].
In the second step, different classifiers were also tested: • Logistic regression (LR): a baseline dichotomic classifier from machine learning (ML). Convolutional neural network (CNN): another DL algorithm mainly used in image data analysis due to its biological visual architecture similarity. • Long short-term memory (LSTM): a DL algorithm mainly used in sequential data analysis like natural language processing or time-series analysis. • CNN + GRU (CNNGRU): a DL algorithm suggested by Liu et al. [11] that combines a CNN as a feature extractor plus a gated recurrent unit (GRU), an improvement of LSTM.
In the third step, after fixing the environments with the previous steps' results, the selected model was tuned, trained, and validated with the data of our own experiment.
Finally, in the fourth step, the machine trained with our own dataset, resulting from such architecture and methodological approach, was tested with external data from a different lab, population, and environment to verify the generality of this approach.

Materials and Methods
This study protocol was approved by the Human Research Ethics Committee of the University of La Laguna, Tenerife, Spain, with number CEIBA 2021-3100, to protect the participants' right according to the Declaration of Helsinki and was accepted by the board of the University Institute of Neuroscience (IUNE) from Universidad de La Laguna.
To select the participants for the two groups, an online pre-screening was carried out through different faculties of the Universidad de la Laguna. The Beck depression inventory II (BDI, online version) [20] was chosen as the best measurement of the symptoms of depression. Control participants were selected among BDI's lower than 13 and depressive participants with BDI scores higher than 20. The BDI consists of 21 items assessing the severity of symptoms. Minimal levels range from 0 to 13; this measure was used to select controls for our study. Mild depression ranges from 14 to 19 points, moderate 20-28, and severe higher than 29 points. As it was difficult to find students with severe depression for our study, we took as depressive anyone between moderate and severe depression. The control group (n = 24) had a mean BDI of 3.87 (SD = 2.32 We used a Neuroscan system with an Easycap of 64 electrodes to carry and acquire EEG data with a sampling rate of 500 Hz. Two electrodes placed up and below the left eye were used to keep control of blinks and vertical eye movements (VEO).
The participants were asked to remove jewelry and piercings. Then they signed an informed consent form and received instructions to ensure their understanding of the EEG procedure and task. The participants were asked to stay calm and keep their eyes on a fixation point while we recorded EEG for 3 min for the open-eyes resting-state condition. Afterwards, they were asked to close their eyes and stay as calm as possible for 3 min while we recorded the EEG closed-eyes resting-state condition (EC). Impedances were kept less than 10 kΩ during the whole experiment. For this analysis, we used only time segments of the EC condition, as it seemed to be the most accepted while classifying depression [6]. After this, the participants completed the Edinburgh handedness inventory [21].

Preprocesing
One participant of the DEP group was discarded due to data corruption. Therefore, the total number of participants submitted to the analyses were 49 (25 DEP and 24 CTL).
The collected individual EEG data were preprocessed automatically with MNE-Python [22] using the artifact subspace reconstruction (ASR) algorithm [23], implemented with Python's "asrpy" package. An average reference was used, and a passband lower than 0.1 Hz and higher than 120 Hz was applied. Furthermore, a notch filteri of 50 Hz and a low-pass band filter attenuated frequencies above the cut-off (90 Hz), and with the filtered data, new files were created for each participant.

Time-Window Span Analysis
From each preprocessed participant file, a time window that ranged from 1 s to 17 s in gaps of 2 s was cropped after the trigger that identified the EC condition. These segments were resampled to 200 Hz to reduce processing time. An auxiliary miscellaneous channel we named 'Y' was added to the data structure to keep the participant characteristics using 0 to represent DEP and 1 to represent CTL.
All the segments were equalized on channels, concatenated and normalized through Normalize scalar with a L2 metric. The whole dataset was then split on vectors (x, y) of 200 measures each one (200 was the sampling frequency); x kept the voltage values of each electrode and y the type of participant (CTL or DEP). Higuchi's fractal dimension (Higuchi) and sample entropy (SampEnt) features were calculated over all the electrodes of each (X, Y) vector, mapping them to a new (X, Y) pair where X contained the 2 non-linear features with mean values of all the electrodes and Y kept the type of participant.
A 10 K-fold cross validation test harness was applied through StratifiedKfold from the Scikit-learn python package [24] to a LR classifier (M = 0, SD = 1) with such (X, Y) vectors that were split randomly each time as X_train and X_test. The score results were kept to be analyzed later. This procedure was repeated for each of the time-window span segments (1,3,5,7,9,11,13,15,17), and their accuracy score results were kept.

Key Electrodes Identification
Once the time-window span was fixed, (x, y) vectors were constructed with each one of Desikan's brain areas associated with electrodes [13]. Then such (x, y) vectors were processed to obtain new (X, Y) vectors keeping Higuchi and SampEnt on X and the type of participant in Y. The same procedure of 10 K-fold cross validation was applied, and the accuracy scores kept for later analysis. The entire procedure was repeated for each of the 32 electrodes, grouping the associated Desikan's brain areas.

Second Step: Exploring Features and Classifiers
In dynamic systems, entropy is associated with the rate of information production as a measure of the uncertainty linked to random variables, so the more information we have the less uncertainty we get. By measuring entropies on EEG data, when we seek accurate classification of complex dynamical processes, we are trying to reduce uncertainty. Some features built in this way can have shared knowledge, measured by mutual information. Therefore, the ranking of features through mutual information is a common way to reduce dimensionality when selecting features.

Features Extraction and Selection
A group of six non-linear features were extracted with the "antropy" python package, developed by Raphael Vallat [25]. The selected features were obtained from the timewindow span of the selected electrodes provided by previous steps. A ranking feature was applied using mutual information, as this method is based on the measure of uncertainty reduction when comparing a target value of one variable against the feature assigned value; higher values represent a high connection between the feature and the target.
SelectKBest function from SKlearn was used. Once ranked, they were added one by one and retested while the accuracy of the LR classifier kept improving. Finally, six combinations were tested. The same procedure of K-fold test harness was applied as previously described keeping the scores for further analysis.

Testing Different Classifiers
Once fixed the features to be used, a comparison analysis between machine learning (logistic regression, support vector machine) and deep learning algorithms (multilayer perceptron, convolutional neural network, long short-term memory, CNNGRU) was performed, in order to check which approach gave the best classification scores. Machine learning approaches were based on Scikit-learn python packages while deep learning approaches were based on Keras python packages [26].

Third
Step: Tuning, Training, and Validating with the Experimental Data Tuning and Training the Selected Model The selected models were tuned and trained with the fixed parameters already explored on previous steps. The data were divided into 80% for training and 20% for val-idation. Once the models were fixed, its classification capacity was tested with the data collected at the lab. That means that the same files that were used to develop the models were then verified through the classifiers.

Fourth
Step: Validating the Models with External Data The success obtained with our own data could represent a situation wherein the models learned to classify some particular noise that correlated with the type of participant. Therefore, we applied the models to an external public database containing an EEG restingstate dataset of participants that were tested with the same BDI questionnaire as our experimental sample. The external sample was identified in PRED-CT's EEG resting-state depression and controls data [27]. The selected external participants were CTL (n = 75, M = 1.73, SD = 1.65) and DEP (n = 30, M = 25.10, SD = 3.19); a t-contrast test between CTL and DEP gave t (183) = −49.12, p < 0.001. We selected DEP participants with scores higher than 25 in the BDI, since this was the cutting point for our experimental DEP participants with whom the classifier was trained.
The files of the external participants, with the same ranges of BDI scores, were preprocessed applying the same methods as described previously, which means that they were preprocessed automatically using the ASR algorithm, average-referenced, passband-filtered (0.1 Hz and 120 Hz), notch-filtered at 50 Hz and low-pass band filtered to attenuate the frequencies above the cut-off (90 Hz), then a time window of 15 s was cropped from each file after the trigger that identified the EC condition. These segments were resampled to 200 Hz, and an auxiliary miscellaneous channel named 'Y' was added to keep the class of the associated group (CTL = 1, DEP = 0); in addition to keeping the three electrodes selected (AFz, FC2, F2), the resulting (X, Y) vectors were fed to the locally trained models to test its performance through accuracy, precision, recall, and F1 scores.

First
Step: Exploring the Time-Window Span and Key Electrodes 3.1.1. Time-Window Span Analysis As shown in Figure 1, a time-window span of 15 s was appropriate to keep enough data from each participant file when using the LR classifier and Higuchi and SampEnt as basic non-linear features. Paired t-tests were applied around the selected 15 s point, and pair t-tests between 15 s and 9 s were so significatively different that they were not included.

Identifying Key Electrodes
To detect which group of electrodes were more accurate in classifying controls/depressive participants, we applied an exploratory loop following Desikan's brain structural areas mapped on associated electrodes as shown in Giacometti et al. [13]. Fifteen seconds of the closed-eyes resting state were cropped from each participant file and processed by extracting two non-linear features (Higuchi and SampEnt) to feed a five-fold cross-validation logistic regression (LR) classification. This was systematically applied on the associated regions' electrodes as an exploratory baseline (see Table 1).

Identifying Key Electrodes
To detect which group of electrodes were more accurate in classifying controls/depressive participants, we applied an exploratory loop following Desikan's brain structural areas mapped on associated electrodes as shown in Giacometti et al. [13]. Fifteen seconds of the closed-eyes resting state were cropped from each participant file and processed by extracting two non-linear features (Higuchi and SampEnt) to feed a five-fold cross-validation logistic regression (LR) classification. This was systematically applied on the associated regions' electrodes as an exploratory baseline (see Table 1).  Statistical comparisons with a paired t-test were applied to the validation scores of the four selected brain areas: the lateral orbitofrontal cortex (LOFC), caudal anterior cingulate cortex (CACC), precuneus (PREC), and superior frontal gyrus (SFG). The results, illustrated in Figure 2, showed significant differences between CACC and PREC (t (48) = 49.25, p < 0.0001) and between CACC and SFG (t (48) = 31.49, p < 0.0001) but not between CACC and LOFC (t (48) = 0.028, p = 0.977). Therefore, the caudal anterior-cingulate area and the associated electrodes (FC2, AFz, F2) with high mean accuracy (M= 0.987, SD = 0.0006) were kept for further analyses.

Second
Step: Exploring Features and Classifiers 3.2.1. Features Extraction and Selection Figure 3 shows the distributions of non-linear features: permutation entropy (PerEnt), sample entropy (SampEnt), single-value decomposition entropy (SVDEnt), detrended fluctuation analysis (DFA), spectral entropy (SpectEnt), and Higuchi's fractal dimension (Higuchi) by groups (CLT on the left and DEP on the right) after applying a MinMaxscaler. Figure 2. Accuracy comparison between brain areas associated with the electrodes at the lateral orbitofrontal cortex (LOFC), caudal anterior cingulate cortex (CACC), precuneus (PREC), and superior frontal gyrus (SFG). p-value annotation legend: ns: non-significant; **** p < 0.0001. Figure 3 shows the distributions of non-linear features: permutation entropy (Pe-rEnt), sample entropy (SampEnt), single-value decomposition entropy (SVDEnt), detrended fluctuation analysis (DFA), spectral entropy (SpectEnt), and Higuchi's fractal dimension (Higuchi) by groups (CLT on the left and DEP on the right) after applying a MinMaxscaler.  According to their distributional values, the ranking of features was calculated using mutual information before testing their capacity to classify the groups. Figure 4 shows a comparison of its importance. According to their distributional values, the ranking of features was calculated using mutual information before testing their capacity to classify the groups. Figure 4 shows a comparison of its importance.  Next, six combinations of non-linear features were tested and compared, beginning with the one with highest importance and adding a new feature according to their rank at each combination. Below are the combinations: To assess the relative accuracy of the combinations, a series of t-tests were performed between the obtained observations. The comparison, illustrated in Figure 5, indicates the superior accuracy of COMB5, which is significant with respect to COMB0, COMB1, COMB2, (p < 0.001), and non-significant with respect to COMB3, COMB4 (p > 0.05).

Testing and Comparing Different Classifiers
Having fixed the time-window span to 15 s, the brain area CACC with its associate electrodes (FC2, AFz, F2) and the nonlinear features (PerEnt, SampEnt, SVDEntr, DFA SpectEnt, Higuchi), a 10-fold cross-validation test harness to ML and DL classifiers w applied: logistic regression (LR), support vector machine (SVM) with radial basis functio (RBF) kernel, multilayer perceptron (MLP), convolutional neural networks (CNN), lon short-term memory (LSTM), and newest CNN + GRU (CNNGRU) proposed classifier b Liu W. [11]. The accuracy was selected as a metric score. The t-test comparisons betwee pairs of classifiers in accuracy led to the results illustrated in Figure 6. As can be seen, th highest accuracy was obtained for the LSTM and the CNNGRU, which did not differ si nificantly. However, the LSTM showed better performance than CNN (p <0.01) or LR

Testing and Comparing Different Classifiers
Having fixed the time-window span to 15 s, the brain area CACC with its associated electrodes (FC2, AFz, F2) and the nonlinear features (PerEnt, SampEnt, SVDEntr, DFA, SpectEnt, Higuchi), a 10-fold cross-validation test harness to ML and DL classifiers was applied: logistic regression (LR), support vector machine (SVM) with radial basis function (RBF) kernel, multilayer perceptron (MLP), convolutional neural networks (CNN), long short-term memory (LSTM), and newest CNN + GRU (CNNGRU) proposed classifier by Liu W. [11]. The accuracy was selected as a metric score. The t-test comparisons between pairs of classifiers in accuracy led to the results illustrated in Figure 6. As can be seen, the highest accuracy was obtained for the LSTM and the CNNGRU, which did not differ significantly. However, the LSTM showed better performance than CNN (p <0.01) or LR, MLP, and SVP (p <0.0001). CNNGRU showed better performance than CNN (p < 0.0001) and LR, MLP, and SVP (p < 0.0001). LSTM and CNNGRU outperformed the other classifiers, so we decided to keep bot the first one represents a more classical approach, and CNNGRU is a promising approac

Third Step: Tuning, Training, and Validating with the Experimental Data
The tuned LSTM model consisted of four stacked LSTM layers with a Batch norma ization layer previous to the final dense layer with sigmoid activation and kernel regula izer L2. The tuned CNNGRU model had a convolutional 1 D layer, a MaxPoling layer, GRU layer, then a flattened layer prior to the final dense layer with sigmoid activatio Both models used binary cross entropy as the loss function, Adam optimizer, one hundre training epochs, and a batch size of 32.
A check pointer was used to save the best weights to reconstruct the models for pr dictions without the necessity of retraining them. The data processed with previous r sults was split, with 80% for training and 20% for validation purposes. The graph of th loss function values as the training/validating progress during training epochs (Figure is used to detect overfitting and underfitting. In underfitting cases, the validation curve above the training one, while in overfitting, the validation curve, after matching the trai ing in some point, starts to grow up, while the training curve remains low. In Figure 7, w see that, in the LSTM case (left), the validation curve converges to training more accurate after 80 training epochs, while in the CNNGRU model (right), both curves match eac other, so neither over-nor underfitting affects the trained machines. LSTM and CNNGRU outperformed the other classifiers, so we decided to keep both; the first one represents a more classical approach, and CNNGRU is a promising approach.

Third Step: Tuning, Training, and Validating with the Experimental Data
The tuned LSTM model consisted of four stacked LSTM layers with a Batch normalization layer previous to the final dense layer with sigmoid activation and kernel regularizer L2. The tuned CNNGRU model had a convolutional 1 D layer, a MaxPoling layer, a GRU layer, then a flattened layer prior to the final dense layer with sigmoid activation. Both models used binary cross entropy as the loss function, Adam optimizer, one hundred training epochs, and a batch size of 32.
A check pointer was used to save the best weights to reconstruct the models for predictions without the necessity of retraining them. The data processed with previous results was split, with 80% for training and 20% for validation purposes. The graph of the loss function values as the training/validating progress during training epochs (Figure 7) is used to detect overfitting and underfitting. In underfitting cases, the validation curve is above the training one, while in overfitting, the validation curve, after matching the training in some point, starts to grow up, while the training curve remains low. In Figure 7, we see that, in the LSTM case (left), the validation curve converges to training more accurately after 80 training epochs, while in the CNNGRU model (right), both curves match each other, so neither over-nor underfitting affects the trained machines. Once the trained models were kept fixed, the experimental files were preprocessed following the selected processing context analyzed through previous steps and fed to the classifier in order to detect the goodness of the obtained success.

Fourth Step: Validating the Model with External Data
The success of the models in classifying our experimental participants in step three with an area under the curve (AUC) of 1.0 could suggest that the model learned some artificial noise that correlated with our environment and/or participants. For this reason, we were compelled to test the trained models with the aforementioned external restingstate database.
To validate this approach, we downloaded a database from PRED-CT, the EEG resting-state depression and controls data [27]. Figure 9 (left) shows the ROC and AUC obtained after processing and feeding the external data into the LSTM model and (right) the same with CNNGRU model. It confirmed our hypothesis that a four-step methodological design focusing on only non-linear features, few electrodes, and a short time-window span is enough to obtain a good classifier that could help in detecting a depression state. Once the trained models were kept fixed, the experimental files were preprocessed following the selected processing context analyzed through previous steps and fed to the classifier in order to detect the goodness of the obtained success. Once the trained models were kept fixed, the experimental files were preprocessed following the selected processing context analyzed through previous steps and fed to the classifier in order to detect the goodness of the obtained success. Figure 8 (left) shows the receiver operating curve (ROC) of the LSTM model after applying the validation test and (right) the ROC curve of the CNNGRU model with the same data; in both cases, the local classification gave 100% accuracy with an area under the curve (AUC) of 1.

Fourth Step: Validating the Model with External Data
The success of the models in classifying our experimental participants in step three with an area under the curve (AUC) of 1.0 could suggest that the model learned some artificial noise that correlated with our environment and/or participants. For this reason, we were compelled to test the trained models with the aforementioned external restingstate database.
To validate this approach, we downloaded a database from PRED-CT, the EEG resting-state depression and controls data [27]. Figure 9 (left) shows the ROC and AUC obtained after processing and feeding the external data into the LSTM model and (right) the same with CNNGRU model. It confirmed our hypothesis that a four-step methodological design focusing on only non-linear features, few electrodes, and a short time-window span is enough to obtain a good classifier that could help in detecting a depression state.

Fourth Step: Validating the Model with External Data
The success of the models in classifying our experimental participants in step three with an area under the curve (AUC) of 1.0 could suggest that the model learned some artificial noise that correlated with our environment and/or participants. For this reason, we were compelled to test the trained models with the aforementioned external restingstate database.
To validate this approach, we downloaded a database from PRED-CT, the EEG restingstate depression and controls data [27]. Figure 9 (left) shows the ROC and AUC obtained after processing and feeding the external data into the LSTM model and (right) the same with CNNGRU model. It confirmed our hypothesis that a four-step methodological design focusing on only non-linear features, few electrodes, and a short time-window span is enough to obtain a good classifier that could help in detecting a depression state.
Models obtained a high performance both in the classification of data obtained in the laboratory and in the classification of data that originated from a public database.   Table 2.

Discussion
This study demonstrated that applying a deep-learning-based classifier with just three electrodes during 15 s of close-eyed EEG resting-state data, collected from a sample of college students, allows them to be accurately classified as depressed or non-depressed individuals. Once the classifier was internally trained, it successfully classified depressive and control participants with an accuracy of 100%. Moreover, it was successfully applied to detect depression in untrained EEG resting state data from a publicly available source, with a comparable accuracy of 99%.
To achieve this remarkable performance, a data-driven methodological approach, consisting of training and comparing various machine learning and deep learning classifiers, using a selection of only three electrodes, a 15-s epoch, and a combination of six nonlinear features. It turned out that the LSTM model and the new CNNGRU model were the most accurate. Both the LSTM and CNNGRU models retain a memory of the previous biosignal history, so they are especially appropriate in the analysis of time series wherein the hypotheses are related to time-extended signals, as is the case of the EEG resting state [27]. Although our study was data-driven, the fact that LSTM and CNNGRU were the best classifiers for depression suggests that the use of non-linear features in deep learning methods captures the long-standing and complex non-linear dynamics of the brain, serving as a potential diagnostic tool to detect abnormal EEG signatures associated with depression.
Although the performance of both models on the same data is quite similar, the CNNGRU model is in some respects better, since it is less complex, having one less layer

Discussion
This study demonstrated that applying a deep-learning-based classifier with just three electrodes during 15 s of close-eyed EEG resting-state data, collected from a sample of college students, allows them to be accurately classified as depressed or non-depressed individuals. Once the classifier was internally trained, it successfully classified depressive and control participants with an accuracy of 100%. Moreover, it was successfully applied to detect depression in untrained EEG resting state data from a publicly available source, with a comparable accuracy of 99%.
To achieve this remarkable performance, a data-driven methodological approach, consisting of training and comparing various machine learning and deep learning classifiers, using a selection of only three electrodes, a 15-s epoch, and a combination of six non-linear features. It turned out that the LSTM model and the new CNNGRU model were the most accurate. Both the LSTM and CNNGRU models retain a memory of the previous biosignal history, so they are especially appropriate in the analysis of time series wherein the hypotheses are related to time-extended signals, as is the case of the EEG resting state [27]. Although our study was data-driven, the fact that LSTM and CNNGRU were the best classifiers for depression suggests that the use of non-linear features in deep learning methods captures the long-standing and complex non-linear dynamics of the brain, serving as a potential diagnostic tool to detect abnormal EEG signatures associated with depression.
Although the performance of both models on the same data is quite similar, the CNNGRU model is in some respects better, since it is less complex, having one less layer than the LSTM model, and also requiring a shorter training time than LSTM. Nevertheless, overall, the method developed in this article offers higher accuracy than the CNNGRU reported by Liu et al. [11], mainly due to the specific selection of features and electrodes. They needed 16 electrodes to classify depressive people with an accuracy of 89.63%, while the current methodology reached almost perfect accuracy with a selection of only three electrodes and a few non-linear features. These methodological choices are crucial, as the CNNGRU model also improved its performance with our own data, using our selected electrodes, features, and procedures.
The group of three electrodes selected in this study (FC2, AFz, F2) have been associated with the activity of the caudal anterior cingulate cortex, according to Desikan's brain structural areas mapped on Giacometti's electrodes mapping algorithm [13]. This is congruent with the literature reporting that the symptoms of depression are associated with anatomical and functional impairments in the caudal anterior cingulate cortex [28,29]. Note, however, that the anatomical parcellation methods and the assignment of groups of electrodes to brain regions are primarily descriptive. To provide more accurate neuroanatomical and functional data on depression, neuroimaging methods or source estimation algorithms for high-density EEG set-ups are necessary. The minimal approach of three-electrodes EEG, used in this study, is appropriate for searching for efficient machine learning classification algorithms but not for accurate source estimation.
With our data-driven method, we found that the choice of an EEG epoch of 15 s to train the algorithm was appropriate to detect depressive states. This fits well with the proposal that the measures of functional connectivity and network organization of the brain tend to stabilize around 12 s [30]. On the other hand, we can assume the non-linear features employed in our classificatory model are especially sensitive to events occurring in large time windows, which might index functional connectivity. For this reason, techniques with high temporal resolution, such as EEG and MEG, are potentially useful to capture the signatures of brain dynamics associated with depression and other mental health conditions [31].
However, it is possible that efficient classifiers of structurally based brain diseases require a combination of linear features [32], for instance, delayed cerebral ischemia and seizures after subarachnoid hemorrhage [33], while classifiers for functionally based diseases (like depressive states) rely on a combination of non-linear features [8]. From this point of view, we could think that depression seems to be a complex functional disease that cannot be accurately identified through simple linear features of the EEG.
This study has some limitations, which may be addressed in future research. It was conducted with a population of young university students, half of whom showed depressive states, according to a self-report questionnaire, although none of them had been diagnosed with depression, and the results cannot be generalizable to other populations and depressive conditions. However, given the simplicity of the EEG resting-state protocol and the efficiency of the LSTM and CNNGRU as classificatory tools reported herein, the current methodology could be applied in the future to broader and more heterogeneous samples of population, including groups of different ages and clinically diagnosed patients. The classificatory algorithms and methods developed herein provide objective measures of brain dynamics, which can be used as a complementary diagnosis tool in clinics. It could be trained to distinguish among different types of depression, depression levels, and even anxiety and/or other comorbid disorders. In addition, it could be used for following up, as an objective guidance of treatment results, applying the tool to depression patients before and after their treatment to test their improvement, beyond subjective impressions. Finally, this study raises the possibility of creating a portable EEG device with three electrodes that is easy to use and even linked to smartphones, which could be used on primary care services to screen depressive cases, facilitating prevention.

Conclusions
This study demonstrates that, using only non-linear features and a few electrodes, we can train an algorithm to accurately classify young participants as depressed or nondepressed. Furthermore, the trained classifiers generalized their performance to external untrained databases. Deep learning approaches outperformed machine learning ones, as they obtained better classification values. Classifiers based on EEG data with few electrodes, like the ones used in this study, could potentially be implemented as brain computer interfaces (BCI) to be easily employed as a complementary tool to support diagnosis.

Conflicts of Interest:
The authors declare no conflict of interest.