Performance Evaluation of Epileptic Seizure Prediction Using Time, Frequency, and Time–Frequency Domain Measures

The prediction of epileptic seizures is crucial to aid patients in gaining early warning and taking effective intervention. Several features have been explored to predict the onset via electroencephalography signals, which are typically non-stationary, dynamic, and varying from person-to-person. In the former literature, features applied in the classification have shared similar contributions to all patients. Therefore, in this paper, we analyze the impact of the specific combination of feature and channel from time, frequency, and time–frequency domains on prediction performance of disparate patients. Based on the minimal-redundancy-maximal-relevance criterion, the proposed framework uses a sequential forward selection approach to individually find the optimal features and channels. Trained models could discriminate the pre-ictal and inter-ictal electroencephalography with a sensitivity of 90.2% and a false prediction rate of 0.096/h. We also present the comparison between the classification accuracy obtained by the optimal features, several features summarized from optimal features, and the complete set of features from three domains. The results indicate that various patient interpretations have a certain specificity in the selection of feature-channel. Furthermore, the detailed list of optimal features and summarized features are proffered for reference to those who research the corresponding database.


Introduction
Epilepsy is the fourth most common neurological disorder with approximately 50 million patients worldwide, and it affects people of all ages [1]. Many of the pre-onset symptoms are not visible to observers; thus, family and friends may inadvertently witness them all the time [2]. It is an irrefutable fact that a method to predict the occurrence of epileptic seizure would significantly improve the possibility of treatment.
Electroencephalography (EEG) can be used to study changes in brain activity, which is commonly applied in epilepsy research [3]. Epilepsy studies distinguish two types of EEG signals, based on the way they are recorded, namely intracranial EEG (iEEG) and scalp EEG (sEEG). The iEEG is obtained by invasive electrodes, while sEEG is recorded through electrodes attached to specific locations on the scalp. Since sEEG is obtained in a non-invasive way, it has the advantages of easy access and greater safety.
There are four states for EEG recording of a patient with epilepsy, namely inter-ictal, pre-ictal, ictal, and post-ictal (as shown in Figure 1). The seizure prediction problem involves designing models to distinguish between pre-ictal state and inter-ictal state, while seizure detection focuses on discriminating the ictal state. Although the model in seizure detection can detect seizures, it cannot be used for monitoring and treatment.
Nowadays, machine learning is an advanced technology for epilepsy prediction. Among them, feature extraction is the key procedure. In this area, feature extraction is one of the key procedures for improving performance of classification. Previous work mainly used one or a few features, or focused on the improvement of a certain feature.
Even when multiple features are used, the same features are extracted from different patients. However, due to the nonlinear and non-stationary aspects of EEG signals and the different seizure types between individuals, few and fixed features may not be suitable for all patients. Hence, the aim of this study is to examine whether a patient-specific feature design principle will achieve relatively high improvement rates. To address these issues, we extract features of EEG signals from 18 channels using time, frequency, and timefrequency domains. After that, we use a sequential forward selection method to optimize specific feature-channels for each patient based on the output of minimal-redundancymaximal-relevence (mRMR). Beyond that, several features statistically analyzed from optimal features and all features extracted from these three domains are also explored to discriminate between pre-ictal and inter-ictal states. In addition, the models trained with the optimal onset end ictal inter-ictal pre-ictal post-ictal features perform well on extra undefined pre-ictal window data of majority of patients. The contributions of this study lie in the following: 1.
We verify the importance of feature-specific in seizure prediction.

2.
We comprehensively summarize the features of time, frequency, and time-frequency domains and their interpretations in predicting seizures using EEG signals. 3.
The optimal features can provide guidance for studying each patient separately, and the several features summarized in these have implications for the general design principle of an epileptic prediction system.
The remainder of the paper is organized as follows. Section 2 provides the details of our proposed method. Section 3 presents the results of this method, discussions of the results, and comparisons with related work. Finally, the paper is concluded in Section 4.

Materials and Methodology
The framework suggested in the present study consists of four stages: preprocessing, feature extraction, feature ranking, and classification. The overall implementation process of this methodology is depicted in Figure 2.

EEG Data
The CHB-MIT Scalp EEG database [4] used in this work includes data from 22 pediatric patients with intractable seizures at the Children's Hospital Boston. EEG signals sampled at 256 samples per second usually contain 23 channels but in some cases contain 24 or 26 channels. The start and end time of seizures judged by clinical experts are stated in annotation files.
Each case contains between 9 and 42 recordings (edf files) from a single patient. All recordings were grouped into 23 cases. In particular, case chb21 was obtained from the same patient 1.5 years after obtaining chb01. For convenience, "patient" and "case" in this work have the same meaning. The detailed information on these 23 cases is listed in Table 1.
As shown in Figure 1, seizure prediction is used to distinguish pre-ictal and interictal states. The pre-ictal interval needs to be defined in the study of seizure prediction; nevertheless, it is still controversial. Mormann [2] believes that, when univariate measures are used to classify the EEG signal, it changes 30 min before onset. Taking into account that activities of EEG signals vary from patient to patient, we use 15 min as the pre-ictal interval. Other principles are as follows:

1.
When the second seizure occurs within 15 min after the end of the first seizure, the data of the last seizure are discarded.

2.
Use the previous consecutive recording to supplement the data that do not satisfy 15 min.

3.
Any previous recording with a gap larger than 5 s from this recording is marked as disconsecutive.
For eliminating low frequency activity and high frequency noise to obtain quasistationary signals, a 5th-order Butterworth filter band-passing between 0.5 and 30 Hz is employed.

Feature Extraction
Feeding the raw EEG signals directly into the classifier may adversely affect the output quality. Thus, we extract features from preprocessed EEG signals. Several features have been determined to discriminate the changes in EEG signals. These features can be categorized based on time domain, frequency domain, and time-frequency domain. A variety of the most common features from these three domains is extracted in this study. The extracted features are summarized in Table 2. Features are described and explained with mathematical expressions in Appendix A. Time domain analysis uses the EEG signal directly, that is, the analysis of electrode voltage amplitude with reference to time series. Time domain analysis can show how the signal changes according to time which is commonly used in other fields [5]. It can reflect the physical meaning of the signal straightforwardly. In terms of time domain features, the basic statistical characteristics (e.g., mean value, skewness and kurtosis) are computed, as well as the energy related, entropy related, randomly related, Hjorth parameters, and the number of zero-crossings and local extrema. In addition, line length can reflect changes in signal amplitude and frequency. The fluctuations in time series can be studied by a detrended fluctuation analysis.
Frequency domain analysis is the analysis of the signal with reference to frequency, while the spectral information of EEG is used. It can reflect the distribution of frequency components or signal power in the whole frequency range. Research in other fields [6] often benefits from frequency domain analysis. Power Spectral Density (PSD) is the measure of a signal's power content versus frequency. The fast Fourier transform is applied using Welch's method [7] to obtain the PSD of signals. The energy and entropy of PSD, intensity weighted mean frequency and bandwidth, edge frequency, and peak frequency are extracted in the frequency domain.
Time-frequency domain analysis, which is available for other fields [8,9], can obtain the relationship between frequency and time. It is necessary to have time-frequency analysis, as the EEG signals' frequency characteristics are changing constantly. Wavelet transform is known to capture both frequency and location in signals. Delta (0.5-30 Hz), Theta (4-7 Hz), Alpha (8)(9)(10)(11)(12)(13), and Beta (14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30) are four basic patterns of EEG signals. Following the discrete wavelet transform method and EEG patterns, we use a 3-level filter bank (as shown in Figure 3) with Daubechies 1 (db1) mother wavelet function to decompose each raw EEG signal. Finally, four different frequency sub-bands, which are detail coefficients and approximation coefficients, are formed. Basic statistics, energy based, randomly based, and line length introduced in the time domain are also used in the time-frequency domain. The difference is that features are extracted from four sub-bands in time-frequency domains. The selection of EEG signal channels is still an open research topic. In order to capture the characteristics of signals on all channels, we extract these 67 features from every channel. Having calculated the measurements using the above parameters, the 18 × 67 feature matrix is obtained. After aggregating this feature matrix, we acquire a 1206-dimensional vector (called channel-based feature vector or feature-channel vector) for every sample.

Feature Ranking
The minimal-redundancy-maximal-relevance algorithm [10] selects features not only based on the dependency between features and sample categories, but also considers the relationship between features. In this research, we reorder the variables in the extracted channel-based feature vector based on the mRMR criterion. Suppose there is a dataset D with N samples and M (1206) features. The purpose of feature ranking is to generate a new set S from an M-dimensional set F = { f i , i = 1, . . . , M}.
The Max-Dependency criterion uses mutual information as the dependency of features on target class c to select features. Since the Max-Dependency is hard to implement, the mean value of all mutual information values between individual feature f i and class c, Max-Relevance, is used as an approximation of Max-Dependency: where I(·) is the mutual information function, f i is the ith feature in F, and c is the class label vector. The mutual information of two random variables x and y is defined by their probability density functions p(x), p(y) and the joint probability density function p(x, y): Only considering dependency could have rich redundancy, so the relationship between features should be considered. Min-Redundancy condition, which is the mean value of all mutual information values between every two features in S, can be added to select mutually exclusive features. It has the following form: There are two selection schemes, the mutual information quotient (MIQ) and mutual information difference (MID), to combine D and R for discrete data. We denote the operator Φ(D, R) scheme, and use MIQ in this study. The final optimization is: The incremental search method is used to obtain the near-optimal solution. Then, when we want to select mth promising feature from the set {F − S m−1 }, the formula of (4) can be written as:

Classification
Support vector machines are classical classifiers most commonly applied in the seizure prediction problem. In this work, we combine a sequential forward selection approach and Support Vector Machine (SVM) to optimize feature subsets and train models.
SVM finds an optimal hyperplane by maximizing the distance between two categories [11]. Given training set T = {(x i , y i ) | i = 1, . . . , N} with N samples of n-dimensional vector, where x i ∈ R n , y i ∈ {+1, −1}. The problem of maximizing the margin can be written as a dual problem: where C is the penalty term that acts as an inverse regularization parameter. K(·) is the kernel function. The radial basis function (RBF) kernel used in this study on two samples x and z is defined as: where γ is kernel coefficient. After solving this problem, the optimal b * can be obtained using α * . Then, the decision function of classifying sample x becomes: Since hyperparameters directly control the behavior of the SVM algorithm, choosing hyperparameters plays a crucial role in the success of classifiers. Two hyperparameters (C and γ) should be tuning in our classifier. We use a grid-search method to find the most optimal hyperparameter combination. Grid-search exhaustive searches over specified parameter values, which is crude but effective.
Various combinations of feature-channels (n) are tried. The one with the highest model accuracy will be regarded as the optimal subset of a patient. The whole algorithm for finding the optimal combination is as shown in Algorithm 1.

Performance Evaluation
Cross-validation (CV) is a model validation technique for assessing how the results of a model will generalize to an independent dataset. In order to obtain a reliable outcome, a leave-one-out CV method is used. Compared with k-fold cross-validation, leave-one-out is more practical and deterministic. Algorithm 1 Finding optimal feature-channels using a sequential forward selection approach

Require:
The reordered feature-channel set S = {s i , i = 1, . . . , N}; Ensure: The optimal feature-channel subset S ; Initial the model accuracy Acc ⇐ 0; for each n ∈ [1, N] do Feed the feature subset {s 1 , . . . , s n } to SVM classifier; Tune the parameters of SVM using grid-search; Use leave-one-out to get model accuracy acc; if acc > Acc then Acc ⇐ acc S ⇐ {s 1 , . . . , s n } end if end for In leave-one-out CV, one observation contains some pre-ictal and inter-ictal samples, which correspond to a seizure. Suppose there are P observations for a certain case, where the value of P depends on the number of seizures (used) in Table 1. Leave-one-out uses one observation as the validation set and the remaining P − 1 observations as the training set. This is repeated on all observations to obtain P validation results. The average of these results is defined as the performance of this model.
Seven metrics are applied to measure the performance. Accuracy is defined as the percentage of correct classification in the total samples, which can mirror overall performance of the model when using balanced data. The false prediction rate (FPR) is defined as the proportions of the duration which is wrongly predicted as the pre-ictal in the inter-ictal period per hour. Sensitivity (SEN) is the rightness rate of pre-ictal prediction. Moreover, Area Under Curve (AUC), F1, and kappa are used to measure the classification ability. Cost is the average prediction time spent on a sample, including preprocessing, feature extraction, and classification. All experiments are implemented in Python 3.6 on a server of six 3.7 GHz Intel Core (TM) CPUs running Ubuntu 16.04.

Results and Discussion
This section presents the results of the proposed feature design method applied to the scalp EEG data, and compares it with other design principles. Then, visualization of required features and the generalization capacity of models are studied.

Performance of Different Numbers of Feature-Channels
As shown in Algorithm 1, combinations with different feature-channels are fed into the classifier to predict seizures in a patient-specific approach. In order to study the prediction performance of different n on different patients, five cases are randomly selected for detailed analysis. Figure 4a visualizes the mRMR scores of these five cases, which are chb01, chb02, chb11, chb19, and chb21, from top to bottom. The vertical axis represents 18 channels, and the horizontal axis represents 67 features. A brighter color indicates that the feature-channel is more valuable to this patient. The accuracies of five models with different n from 1 to 200 are presented in Figure 4b. The selection order of first n-ranked feature-channels is based on mRMR score.  As can be seen from Figure 4, finding a specific feature-channel subset for each patient is crucial to seizure prediction. The effects of same feature and channel on different patients are diverse. In addition, the number of required feature-channels differ from patient to patient. According to the accuracy curve of each patient, as features are added, the model will gradually reach the optimum. However, with the further increase of features, model accuracy will decline and tend to flatten. Hence, in order to highlight the difference, only the accuracy with n from 1 to 200 are displayed. The selection of feature subset is based on the evaluation of the mRMR algorithm. Therefore, the feature-channel that is added to the feature subset firstly has the maximal relevance with the category. For case chb11, the accuracy of model is 74.4% when only nonlinear energy-FP2-F4 was used in the model. When the fourth feature-channel is added, the accuracy of the model will decrease from 87.8% to 87.7%. This kind of fluctuation is very common in the training and evaluation process of some machine learning algorithms (e.g., SVM), for which the model is not guaranteed to achieve the best performance. However, this fluctuation will not affect the overall trend. Compared to the first six feature-channels, the added feature-channels are redundant as the feature subset is optimal. In particular, case chb01 only needs local extrema-FP1-F3 to achieve the optimal performance. The new redundant feature-channels will reduce the accuracy of the model. Finally, when using first 10, 1, 6, 36, and 17 featurechannels combination, respectively, the optimal model can obtained for cases chb01, chb02, chb11, chb19, and chb21.
The model trained with these optimal feature-channels is considered the best model for the patient. After 23 patients with 146 seizures in the CHB-MIT sEEG database were evaluated, the number of items in optimal feature-channel combination and classifier parameters of the best model are presented in Table 3. The complete list of optimal featurechannel combinations is described in Appendix B. The features and channels that contribute most to the seizure prediction in each patient are provided for future research. We can see that the number of optimal feature-channels and the combination varies greatly from patient to patient.

Comparison of Different Feature Design Principles
To examine whether the proposed prediction method is significantly different from the general feature design principle in previous works, comparison and discussion between optimal feature-channel subset, summarized feature subset, and complete feature-channel set are presented.
Summarized feature subset is the combination of the top 11 features (except channel) with the most occurrences in the optimal feature-channel combinations in Table A1. The statistics of a total of 62 features are presented in Figure 5, in which the features with a sky-blue bar are regarded as summarized features. These 11 features are extracted from 18 channels for all patients. Then, 198-D feature vectors are fed into the classifier in a patient-specific method.
The complete channel-based feature set is the combination of all feature-channels extracted from time, frequency, and time-frequency domains. Feature vectors with 1206-D are used to train the SVM classifier. Figure 6 presents the comparison of classification accuracy obtained by these three feature sets. The optimal feature-channel subset gives an overall accuracy of 90.4%, while summarized feature subset and complete feature set achieve an accuracy of 84.8% and 85.0%, respectively.
Other metrics are also applied to measure the performance of these three feature sets (as shown in Table 4). Models trained using optimal feature subsets reach a sensitivity of 90.2% with FPR of 0.096/h.  In summary, the overall performance of each patient model obtained when using the optimal feature-channel subset is higher than the other two. Comparing summarized feature subset and complete feature set, even when a generous amount of features in the complete feature set is used, the upgrade in performance is negligible. Features in the complete set may have shared similar contributions, making them redundant in the training process. Especially in chb19, there is a significant difference between the performance of these two feature sets. It also indicates that only specific features can effectively discriminate the signal states in some patients.  Figure 6. Comparison between the classification accuracy obtained using optimal, summarized, and complete feature sets. The method of iteratively selecting the optimal features for each patient still suffers from the problem of great time complexity. Therefore, in the case of ensuring acceptable performance, the 11 summarized features (line length-beta [12], local extrema [13], Higuchi FD-beta, zero-crossings, line length-alpha, line length-theta, Higuchi FD [14], SVDEn [15], peak frequency [16], Hurst exponent [15,17], and Higuchi FD-alpha) that are commonly applied in seizure prediction can still be used as a guide for general features of all patients.

Visualization of Optimal Feature-Channels
To investigate how the optimal feature-channels contributed to the classification, we project the optimal feature-channel vector of cases chb01, chb20, chb23 and chb14 onto a 2D plane using t-SNE [18] algorithm. The selection principle is based on various n, which are 10, 26, 37, and 55, respectively. Figure 7 shows the dimension reduction results of optimal features of these four cases. The blue dots represent pre-ictal samples, and the orange ones represent inter-ictal samples. It can be seen that the samples represented by the optimal feature-channels can be well divided into two categories. This indicates that these features are also friendly to other classifiers, not only to SVM.

Evaluation of Generalization Ability
In order to verify the generalization ability of models, the optimal models are evaluated on the extended pre-ictal data. Many studies have found that characteristic changes in EEG signals occur within 30 min before onset [2,19]. Since the data 15 min before onset are used to train and evaluate the models, we use the same criteria in Section 2.2 to collect the EEG signals from 15 to 30 min before onset as the extended pre-ictal data. These data are divided into three datasets at 5 min intervals, which are −30 to −25 min dataset, −25 to −20 min dataset, and −20 to −15 min dataset.
The evaluation of models trained with optimal feature-channels on these three datasets is presented in Table 5. Table 5. Evaluation Results of extended pre-ictal data using optimal models. The extended data 15 to 30 min before the onset is divided into three datasets. These three datasets with same number of seizures are −30 to −25 min dataset, −25 to −20 min dataset and −20 to −15 min dataset. # of samples: Number of 5-s segments used for prediction. SEN: The percentage of the samples that are correctly predicted as pre-ictal state. Cost: Average time to predict a sample. AVG SEN: Average sensitivity of these three dataset on each case. Since the pre-ictal state exists within 30 min before onset, most models can also accurately predict the data that is not involved in training. The prediction time of our models are much shorter than 5 s, so it can be easily applied in the application of realtime prediction.
However, some patient models (e.g., chb21) have negative results on these data. The possible reason is that the signal 15 min before onset of this patient is very different from the signal 15 to 30 min before onset. This suggests that the pre-ictal window varies greatly from patient to patient.
Generally, we believe that the closer the distance to onset, the more obvious characteristics. However, many models (e.g., chb03, chb07 et al.) have better prediction results on the dataset of −30 to −25 min than on the one from −25 to −20 min. Therefore, a detailed analysis of time on epilepsy may be worthy of examination in future studies.

Comparison to Prior Works
A comparison of the proposed framework and other methodology from the works in pre-ictal/inter-ictal classification is presented in Table 6. The focus of the comparison is studies such as ours that have been evaluated within the CHB-MIT database. From the methods perspective, a similar workflow is used, with each work using different sets of features and classifiers. Models trained with optimal features show a promising performance with high sensitivity and low FPR.

Conclusions
In this study, mRMR is employed to determine the quality of each channel-based feature from time, frequency, and time-frequency domains. Based on this, we use the prediction accuracy of SVM combined with a sequential forward selection method to determine the optimal subset of features for each case. These models trained with optimal features achieve an overall sensitivity of 90.2% and an FPR of 0.096/h in the classification of pre-ictal and inter-ictal on the CHB-MIT database. In addition, a comparison of the optimal features with the summarized feature subset and the complete set of features from three domains show that finding an optimal feature-channel for each patient represents an important step in seizure prediction. This suggests a valuable future research trajectory of applying a method that combines feature quality measurement with classification to save training time.

Conflicts of Interest:
The authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Appendix A. Detailed Description of Features
In this section, the features expressed in italics are briefly described and explained with mathematical expressions. • Basic statistics: Basic statistics features have been frequently used to distinguish pre-ictal pattern from an inter-ictal pattern. Mean, skewness, kurtosis, peak-to-peak amplitude, and coefficient of variation are used in this work. Skewness counting the asymmetry of the signal distribution is calculated as: where µ is the mean value, and σ is the standard deviation of signal X. Kurtosis can be used to measure the steepness of a signal as follows: Peak-to-peak amplitude quantifies the range of signal. Coefficient of variation is a normalized measures of signal dispersion, which is defined as the ratio of standard deviation to average value. • Energy related: The energy is the sum of squares of the signal. The average power is the mean square of the signal, which is the average energy. The root mean square is the root mean square of the signal, which is the square root of the average power. The last energy related feature is nonlinear energy, which was originally presented in [25]. This feature can track the instantaneous frequency of the signal effectively, and its output is given by: • Line length: Line length derived from Katz's fractal dimension was first proposed by [26]. It increases as the amplitude or frequency of the signal increase. The normailzed line length can be represented as: Entropy based: Entropy, as the measure of uncertainty and disorder in the data, has been verified in a lot signal processing research. Approximate entropy (ApEn), sample entropy (Sam-pEn), and singular value decomposition entropy (SVDEn) are the features used in this work. ApEn [27] can be used to quantify the regularity of the signal. The signal X is cut into subsequence u m (i) for {i | 1 ≤ i ≤ n − m + 1} using m as the sliding window. Subsequence i is presented by u i = {x i , x i+1 , . . . , x i+m−1 }. Then, the ApEn is defined as: where C m i (r) is the proportion of the distance between subsequence u i and all subsequence less than tolerance value r = 0.2 * std(X). It can be calculated by: where H(t) is the Heaviside step function, and d(t) is the distance function. SampEn [28] is based upon concepts similar to ApEn. It is only for calculated other subsequences when calculation distance, so the B m i (r) corresponding to C m i (r) is: In addition, it performs logarithmic operations in the final entropy calculation. Therefore, the SampEn is defined by: Smaller values indicate more self-similar and regular signals, while larger values characterize higher complexity. SVDEn [29] measures the dimensionality of the signals. It uses the embedding matrix Y, which can be written as: where m is the embedding dimension, and j is the time delay. Then, this entropy is described as follows: where M is the number of the singular values of the embedding matrix Y andσ i is the normalized singular value. Since SVDEn is calculated on all channels, the value can indicate a pattern of pre-ictal signal in both space and time. • Hurst exponent: Hurst exponent can evaluate the predictability of a time series, and it has been proven that the epileptic brain is long term anticorrelated [30]. Rescaled range analysis is a commonly used calculation method in time series. The cumulative deviate series can be defined as follows: then compute the range: In addition, the standard deviation is: A Hurst exponent complies with the following rules: The average length of new time series u i is computed as: Thus, the total average length of all new series can be written as: Finally, Higuchi fractal dimension is the slope of the linear regression between ln L and ln 1 k . • Hjorth parameters: Three Hjorth parameters proposed by [32] can together characterize the EEG signal in terms of amplitude, time scale, and complexity. Hjorth parameters were calculated on raw signal series X, the first derivative of the series X , and the second derivative X . The derivatives were obtained as differences, namely: The first parameter, Hjorth activity, is the variance σ 2 X of the signal X. The second parameter, Hjorth mobility, can be expressed as: The third parameters called Hjorth complexity is defined as: • Detrended fluctuation analysis (DFA): DFA [33] is another long-range correlations analysis method, which is similar to the rescaled range analysis. Construct a new series using the mean of X, u i = ∑ i j=1 (x j − µ), then segmented into k subintervals using length list v = {v 1 , v 2 , . . . , v k }. In each subinterval, the data are fitted by polynomial regression to obtain the functionû i , and the mean-squared residual is found: Finally, DFA is the slope of the 1D least-square regression between ln v and ln F.

•
Number of zero-crossings: Zero-crossing, a favorite feature, is a point where the sign of the signal amplitude changes. Therefore, number of zero-crossings is these points' quantity in a signal series. It indirectly reflects the change of signal frequency. When this number is large, it infers that there are relatively high frequency components in this signal. • Number of local extrema: Local extrema consist of local maxima and minima. The local maxima, the so-called peaks, is obtained by a simple comparison of neighboring values. In the same way, local minima is found. Number of local extrema is also an indirect measurement of signal frequency similar to the number of zero-crossings.

Appendix A.2. Frequency Domain Features
Power Spectral Density describes how the power of a signal is distributed over frequency. We use P = {p i , p 2 , . . . , p n } to denote the signal's PSD with a frequency range from 1 to n and useP to denote the normalized the PSD to the total power. •

Energy-PSD:
The same as the energy in time domain, the energy of the PSD named energy-PSD, similar to the energy in time domain, is the sum of the squares of the PSD. • Intensity weighted mean frequency (IWMF): IWMF, also known as the mean frequency, is the weighted mean of the frequencies present in the normalized PSD estimation for signal series. It is defined as: • Intensity weighted bandwidth (IWBW): IWBW, also called standard deviation frequency, is a measure of the normalized PSD width expressed in standard deviation, and is defined to be: •

Spectral edge frequency (SEF):
We use SEF 50 [34], the median frequency, defined as the minimum frequency that can reach 50% of the total spectral power of reference frequency f ref : • Spectral entropy: The entropy of power spectrum [35] can determine EEG's irregularity. Spectral entropy is calculated on the normalized PSD using the classical entropy [36]. Thus, it is defined to be: • Peak frequency: Peak frequency [37], also called dominant frequency, is the frequency at the peak which has the largest average power in its full-width-half-maximum (FWHM) band. The FWHM band is defined by two frequencies, which are within the rising slope and falling slope, respectively, and their amplitudes are equal to half of the peak's amplitude. This feature can find the most prominent rhythmic component of the signal.