Ensemble Approach for Detection of Depression Using EEG Features

Depression is a public health issue that severely affects one’s well being and can cause negative social and economic effects to society. To raise awareness of these problems, this research aims at determining whether the long-lasting effects of depression can be determined from electroencephalographic (EEG) signals. The article contains an accuracy comparison for SVM, LDA, NB, kNN, and D3 binary classifiers, which were trained using linear (relative band power, alpha power variability, spectral asymmetry index) and nonlinear (Higuchi fractal dimension, Lempel–Ziv complexity, detrended fluctuation analysis) EEG features. The age- and gender-matched dataset consisted of 10 healthy subjects and 10 subjects diagnosed with depression at some point in their lifetime. Most of the proposed feature selection and classifier combinations achieved accuracy in the range of 80% to 95%, and all the models were evaluated using a 10-fold cross-validation. The results showed that the motioned EEG features used in classifying ongoing depression also work for classifying the long-lasting effects of depression.


Introduction
Depression is a major public health problem, creating a significant burden throughout the world. The World Health Organization (WHO) has predicted depression to be one of the most common causes of work disability [1]. According to disability-adjusted life-years or illness, depression ranks first in many European countries [2,3]. The largest aggregate study of the prevalence of mental disorders in the European population shows that clinically significant depression has been experienced by an average of 6.9% of the population in a 12 mo period [2].
Depression is a mental disorder characterised by a pathologically low mood with a negative, pessimistic assessment of oneself, one's position in the surrounding reality, and one's future. Depression causes emotional, psychological, and physical suffering, which lead to a decrease in the patient's quality of life, family, work, and social adaptation, and often to disability. However, the worst consequence of depression is the increased risk of committing suicide.
Currently, the most common way to diagnose depression is an interview conducted by a medical professional. In many cases, the interview is accompanied with a clinical questionnaire assessed by a medical doctor such as the Hamilton Depression Rating Scale (HAM-D), the self-reported Emotional State Questionnaire (EST-Q) [4], or Mini-Mental State Examination (MMSE) [5] to establish the diagnostic criteria. Other questionnaires, such as the Beck Depression Inventory (BDI) [6] and the Hamilton Depression Rating Scale (HDRS) [7], are also used for screening purposes.
Besides subjective clinical questionnaires, the brain activity of the patients can be monitored objectively by applying various imaging modalities such as computed tomography (CT), functional magnetic resonance imaging (fMRI), and electroencephalogram (EEG). Out of these techniques, EEG stands out as the simplest and most cost effective. Hence, detecting mental states and disorders by using various EEG feature representations, such as methods based on fast Fourier transform (FFT), discrete wavelet transform (DWT), power spectral analysis (PSA), and others [8][9][10][11][12][13], is an actively researched field showing promising results. Various advanced machine learning algorithms have been utilised in order to analyse different modalities of such data in order to introduce automated assessment of depression [13][14][15][16][17][18][19].
This paper reports the classification results obtained by using various linear and nonlinear features and provides a general insight into the feature calculation. The main contribution of the paper is the feature selection and best-performing feature combinations. This article also describes several classifier configuration that improve the classification accuracy.

Related Work
According to de Aguiar Neto et al. [20], absolute and relative band powers and various other linear and also nonlinear features described in this section have been recognised as promising biomarkers for characterizing a depressed brain.
The absolute band power (ABP) and relative band power (RBP) of EEG signals have been analysed with separate three-way multivariate analysis of variance (MANOVA) and showed that the RBP was greater in depressed patients than in controls at all electrode locations and increased ABP for some of the electrode locations [21].
The use of alpha power variability (APV) and relative gamma power (RGP) was proposed by Bachmann et al. [8]. While APV indicates the power and frequency variations in the alpha band, RGP characterises the high-frequency components. The differences between the depressed and control groups appeared statistically significant in a number of EEG channels, leading to a linear regression classification accuracy of 81%.
The spectral asymmetry index (SASI) indicates the relative asymmetry between higher and lower frequency bands. According to Hinrikus et al. [22], SASI values differed significantly in all channels between healthy and depressed patients. Single EEG channel analysis has already shown positive results in the detection of depression [8,23].
The nonlinear Higuchi's fractal dimension (HFD) calculates the fractal dimension of a signal in the time domain [24]. Bachmann et al. [25] applied the HFD method for EEG signals and evaluated this using Student's T-test for two-tailed distributions with twosample unequal variance, to find if a statistical difference existed between depressed and healthy subjects. The alterations were statistically significant in all the EEG channels and indicated 94% of the subjects as depressive in the depressive group, while HFD indicated 76% of the subjects as non-depressive in the control group.
The nonlinear Lempel-Ziv complexity (LZC), introduced by Lempel and Ziv [26], measures the complexity of a signal and has been successfully used on EEG signals for the detection of different mental states [27,28]. EEG data from severe Alzheimer's disease patients showed a loss of complexity over a wide range of time scales, indicating a destruction of nonlinear structures in brain dynamics [29][30][31].
Detrended fluctuation analysis (DFA) [32], which indicates long-time correlations of the signal, was applied to evaluate EEG signals and revealed a statistically significant difference between healthy and depressive subjects [33]. In addition, linear discriminant analysis (LDA) reached a classification accuracy of 70.6%, and by combining DFA and the SASI, classification accuracy increased to 91.2% [23].
A comprehensive study by Bachmann et al. [8] showed the diagnostic potential for linear (SASI, APV, RGP) and nonlinear (HFD, DFA, LZC) features to classify depression. Single-channel classification with logistic regression achieved an accuracy of 81% using APV or RGP measures. The combination of two linear measures, the SASI and RGP, reached an accuracy of 88%, and by combining linear and nonlinear measures, a classification accuracy of 92% was achieved [8].

EEG Recording Procedure
The Cadwell Easy II EEG (Kennewick, WA, USA) measurement equipment was used for EEG recordings with 18 channels (reference Cz), which were placed on the subject's head according to the international 10-20 electrode position classification system, as shown in Figure 1. During the recordings, the subjects were lying in a relaxed position with their eyes closed. EEG signals within the frequency band of 3-48 Hz were used for further processing. The sample rate was kept at 400 Hz for linear methods, while the downsampled signals with a sample rate of 200 Hz were used for nonlinear methods, due to the high computational load. The 20 min-long EEG recording was segmented into 10 s segments, and an experienced EEG specialist marked the first 30 artefact-free segments (5 min in total) by visual inspection, for the subsequent feature calculation. The gathering of questionnaires and EEG recordings were carried out by Tallinn University of Technology (TalTech), in accordance with the Declaration of Helsinki, and the process was formally approved by the Tallinn Medical Research Ethics Committee. All participants signed a written informed consent. The dataset itself was provided to the authors by Tallinn University of Technology under a legal agreement for research purposes. (Information about obtaining the dataset can be requested by contacting M. Bachmann at maie.bachmann@taltech.ee.)

Dataset
The recorded dataset consisted of the EEG signals from 20 subjects, who were selected for further analyses from 55 subjects, who regularly visited the occupational health doctor. The dataset consisted of 14 females and 6 males within the age range of 24-60 y. Half of the subjects selected had been diagnosed with depression at some point in their lives (referred to as depressed subjects for simplicity), while the healthy control group had never had a depression diagnosis. In addition, the healthy control group was chosen considering their low HAM-D and EST-Q scores, to ensure they did not exhibit any signs of depression or other mental disorders (see Table 1). All subjects were gender matched, and the subject age for healthy controls was chosen to be as close as possible to the age of depressed subjects.

Hamilton Depression Rating Scale
The HAM-D is the most widely used clinician-administered depression assessment scale. Although the rating scale has been criticised for use in clinical practice, in this study, it was used as additional information for selecting healthy subjects. In situations where more than one healthy subject was a match candidate for a depressive subject, the one with the lowest HAM-D score was chosen. The mean HAM-D score among the healthy subjects was 3.1, where the scores of 0-7 indicate no depression and a mean score of 9.3 for the depressive subjects corresponds to mild depression.

Emotional State Questionnaire
The Emotional State Questionnaire (EST-Q) [34] was originally compiled for use by the lecturers of the psychiatric clinic of the University of Tartu in Estonia. The self-assessed questionnaire consists of 28 statements assessing the major depressive and anxiety disorders and their associated symptoms during the last month. The questionnaire consists of 3 basic scales and 3 additional scales. Major scales include the depression (DEP), general anxiety (AUR), and panic agoraphobia subscales (PAF). Additional subscales include social anxiety (SAR), asthenia (AST), and insomnia, which was not used. The scale's total score can be used as an overall indicator of the severity of emotional symptoms. The EST-Q was used in the current study for selecting healthy subjects. The subscale values of all the selected subjects were below the threshold for the given condition, except for 2 healthy subjects, whose asthenia subscale was greater than 6. Other threshold values can be found in Table 1. If the scale value is greater than the listed threshold, then the subject has the given condition.

Features
EEG brain signals are nonlinear by nature and linked to particular brain activity, which can be analysed through various linear and nonlinear signal-processing methods.

Alpha Power Variability
The alpha band signal (8)(9)(10)(11)(12) was obtained by a pass-band filter. Next, the APV was calculated for the artefact-free 10 s segments in three steps. First, the alpha band signal power in time window T for N = 4000 samples was calculated as: where V(r) is the amplitude of the alpha band signal in a sample r and N is the number of samples in the time window T. Afterwards, APV was calculated as: where W 0 is the value of alpha band power averaged over 5 min and σ is the standard deviation of those segments.

Spectral Asymmetry Index
The SASI evaluates the power in higher and lower frequencies and was calculated as the relative difference between the higher and the lower EEG frequency band power. The balance of the powers characterises the EEG spectral asymmetry [22]. Powers in the frequency bands were calculated as: and: where F c is the central frequency of the EEG spectrum maximum in the alpha band and was calculated for each person individually. The SASI in channel m for a subject n was calculated as:

Nonlinear Features
Nonlinear methods are used to capture the chaotic behaviour in EEG signals, which occurs due to the underlying physiological activity occurring in the brain [35]. To describe the brain activity of the subjects, we used the Higuchi fractal dimension (HFD), Lempel-Ziv complexity (LZC), and detrended fluctuation analysis (DFA).

Higuchi Fractal Dimension
The fractal dimension provides a measure of the complexity of time series, such as EEG, and describes the fractal dimension of time series signals. The values of the HFD for each electrode were calculated according to Higuchi [24] with the parameter k max = 8.

Lempel-Ziv Complexity
The complexity of the signal can be quantified by the LZC [36], describing the spatiotemporal activity patterns in high-dimensional nonlinear systems. This can reveal the regularity and randomness in EEG signals. For LZC calculation, each signal segment was converted into a binary sequence s(n) as follows, where x(n) is the signal segment, n is the segment's sample index from 1 to N (segment length), and m is the threshold value. The binary sequence s(n) was scanned from left to right counting the number of different patterns. The complexity value c(n) was increased every time a new pattern was encountered. LZC values were calculated as follows: where b(N) is the upper bound of c(n): which was used to normalise LZC values to avoid variations in segment length.

Detrended Fluctuation Analysis
DFA is applied to evaluate the presence and persistence of long-range correlations in time in EEG signals. It has been discovered that the resting EEG of healthy subjects exhibits persistent long-range correlation over time [33]. DFA was calculated in the time domain according to the steps described by Peng et al. [32].
All methods were evaluated using a 10-fold cross-validation; in addition, to keep the training data as balanced as possible, each fold had an equal number of healthy and depressed subjects. In the case of predictions for the weighted and boosted ensemble, the training set in each fold underwent an additional 9 iteration procedures (see Figure 2), to obtain prediction results for all samples in the training fold. Afterwards, the weights W of the classifier votes were fit according to the results in the training set. Similarly, AdaBoost used predicted class results from the training set to calculate weights for each of the classifiers in the ensemble.

Feature Selection
It is known that cognitive disorders can introduce observable change in measured EEG recordings. Depending on the feature calculations used, each brain region might have a statistically significant difference when compared to cognitively normal patients' brains. Therefore, to select the most relevant electrode locations, we used feature subset selection methods that were applied in a preprocessing step before machine learning algorithms were applied. In particular, we used the F-test, which is widely used for showing a statistical significance between two classes, and ReliefF, which is a rank-based feature selector.

Univariate Feature Ranking Using F-Tests
The univariate feature ranking algorithm helps to understand the significance of each feature by examining the importance of each predictor individually using an F-test. Each F-test tests the hypothesis that the response values grouped by predictor variable values are drawn from populations with the same mean against the alternative hypothesis, such that the population means are different [37].

ReliefF
The base algorithm Relief, created by Kira and Rendell [38], is an inductive learning system that was initially developed for classifying binary problems using discrete and numerical features. The algorithm penalises the predictors that give different values to neighbours of the same class and rewards predictors that give different values to neighbours of different classes. ReliefF, which is an extended version of Relief algorithm, was developed by Kononenko et al. [39], by proposing the L1 distance for finding near-hit and nearmiss instances.

Machine Learning Algorithms
The supervised learning algorithms used in this study have been widely used in various EEG classification tasks according to survey papers published by Lakshmi et al. [40] and other articles, which describe the use of the following algorithms for binary classification: • Support vector machine (SVM) [41] with the radial basis function (RBF) kernel; • Linear discriminant analysis (LDA) [42] with the diagonal covariance matrix for each class; • Naive Bayes (NB) [43]; • K-nearest neighbours (kNN) [44] with 4 neighbours; • Decision tree (D3) [43].
In addition to individually evaluating the results for the listed classifiers and feature types, an ensemble approach was also implemented, where classifiers trained on all 9 feature types vote to predict the class label.

Ensemble Methods
The implemented ensemble [45] votes were weighted according to majority voting, where all weights are equal, and weighted voting, where weights are set according to classifier test set accuracy, which was obtained by the procedure shown in Figure 2. The ensemble assigns Label to a given sample according to the following equation: where m indicates the number of classifiers, w i is the classifier weight, and d i is the classifier The class label is decided as follows, As a third ensemble method, we chose adaptive boosting (AdaBoost) [46], to see if it was possible to find a more optimal weight combination, in comparison to the majority and weighted voting. The aim of AdaBoost is to convert a set of weak classifiers into a strong classifier.

Results and Discussion
The baseline accuracy was established by individually evaluating all the feature types. In Table 2, the results for classifiers reached acceptable accuracy, where the HFD and LZC reached above 80% with at least one of the classifiers. For other feature types, selecting all electrodes from a feature type did not guarantee the best classification results. For some of the feature types, it is shown that only a few electrodes provided statistically relevant information and the remaining electrodes could be considered as not relevant or redundant [8]. A brute-force approach can be used to check all feature combinations to find which feature sets perform better than others, but this would be a time-consuming process. Therefore, the most relevant features were determined according to feature ranking provided by the F-tests and ReliefF algorithm.
The selected feature evaluation started with the most relevant feature, and in each iteration, the next-less-relevant feature was added to the feature set used in classification. The ranking of the features was provided by the feature-selection algorithms. Each iteration underwent 10-fold cross-validation. The most optimal feature set was selected according to the highest root-mean-squared (RMS) value calculated from the accuracy of all five classifiers for each feature type. Figure 3 shows an example of feature selection according to the described procedure, where electrodes {O2, O1} were selected as the best option for the B rbp feature type, as the highest RMS value was at the O1 electrode. Similarly, the procedure was repeated for all feature types to obtain the best-performing features shown in Table 3. As a limitation, all proposed feature combinations had to be classified; therefore, it can be considered as a computationally heavy process when EEG with more electrodes or large datasets are used.  Compared to the baseline results (Table 2) and selected feature classification results from Tables 4 and 5, it can be observed that on average, the selected features based on the F-test ranking outperformed the baseline results, and ReliefF had the best overall classification results. In addition, to reduce the effect of subject order in the dataset, the obtained classification results represent the mean results of 100 iterations where the subject location in the training and testing set was randomised.  Table 3).  Table 3).
A more robust solution can be achieved using an ensemble approach where many weak classifiers contribute to the predicted class by voting. Each result shown in Table 6 was the result of combining nine classifiers of the same type. The features used in each feature type were selected according to Table 3. The ensemble approach further improved the results when F-tests and ReliefF feature selection algorithms were used. On average, the ReliefF classification results outperformed ensembles whose features were selected according to F-tests. The use of AdaBoost for classifier weight selection in most of the cases did not significantly improve the results compared to the majority and weighted voting ensemble. Due to the nature of AdaBoost, during the weight calculation process, the algorithm can reach optimal weights using only a few of the classifiers and ignore the rest, which can hinder the robustness of the ensemble.
Instead of focusing on the classification of feature types individually, combined features were also evaluated. Table 7 clearly shows the benefit of feature selection when compared to using all 162 features. For the most part, the classification results for features selected based on F-tests and ReliefF were higher than the baseline results for selecting all features. In addition, feature selection from all features gave promising results, especially, while using only the top-ranked features based on ReliefF. Features used in Table 7 (last row) were selected according to the same procedure used for feature types.

Conclusions
This study showed the results for linear (RBP, APV, SASI) and nonlinear (HDF, LZC, DFA) EEG features in various combinations for classification of long-lasting effects of depression.
The described feature types and classification methods (RBF SVM, LDA, NB, kNN, D3) were used to classify 20 age-and gender-matched subjects. The 10 healthy and 10 subjects who had depression were classified with 82.55% accuracy with the HDF using D3 and 80.70% with the LZC using the RBF SVM binary classifier. The results improved when the algorithms such as univariate feature ranking using F-tests and ReliefF were used, which improved the classification accuracy up to 91.5%. In addition, the ensemble setup with a majority voting reached 93.30% using the NB classifier. The results also suggest that electrodes A rbp .O1, A rbp .O2, and B rbp .O2 selected from all available features according to ReliefF were sufficient to classify the subjects with 80-95% accuracy. The best combination, which achieved significantly high accuracy among all classifiers, was an ensemble using ReliefF-selected features with equally weighted predictions for all feature types. The study shows that EEG features used in classifying patients with depression at the time of the recording can also be used to measure and classify the long-lasting effects of depression.
The obtained results give reasonable justification for further gathering of EEG data according to the currently used protocol to measure the long-lasting effects of depression. As future work, we are planning to raise funding for a large-scale study and further test the proposed approach with the aim of using it in assisted diagnostics.