Manifold Feature Fusion with Dynamical Feature Selection for Cross-Subject Emotion Recognition

Affective computing systems can decode cortical activities to facilitate emotional human–computer interaction. However, personalities exist in neurophysiological responses among different users of the brain–computer interface leads to a difficulty for designing a generic emotion recognizer that is adaptable to a novel individual. It thus brings an obstacle to achieve cross-subject emotion recognition (ER). To tackle this issue, in this study we propose a novel feature selection method, manifold feature fusion and dynamical feature selection (MF-DFS), under transfer learning principle to determine generalizable features that are stably sensitive to emotional variations. The MF-DFS framework takes the advantages of local geometrical information feature selection, domain adaptation based manifold learning, and dynamical feature selection to enhance the accuracy of the ER system. Based on three public databases, DEAP, MAHNOB-HCI and SEED, the performance of the MF-DFS is validated according to the leave-one-subject-out paradigm under two types of electroencephalography features. By defining three emotional classes of each affective dimension, the accuracy of the MF-DFS-based ER classifier is achieved at 0.50–0.48 (DEAP) and 0.46–0.50 (MAHNOBHCI) for arousal and valence emotional dimensions, respectively. For the SEED database, it achieves 0.40 for the valence dimension. The corresponding accuracy is significantly superior to several classical feature selection methods on multiple machine learning models.


Introduction
Human emotions play an important role in conveying information about humancomputer interaction and have a capability to indirectly reflect anxiety, stress and ability of cognition, communication and decision-making. With the wide application of machine learning methods in human-centered systems, emotion recognition has attracted much attention. Specifically, identified human's emotions can be used as feedback to provide better service in medical care devices, recommender systems, and information retrieval engines [1]. It enhances the user experience and satisfaction level and leads to harmonious interactions between human and machine agents. Affective computing technique plays an important role in a wide range of applications in the domain of human-machine interactions. Thakur et al. designed a framework for human behavior monitoring [2]. Based on the data of human activities of daily living, multimodal components of user interactions were analyzed. The human behavioral patterns and their relationships with the dynamic contextual and spatial features of the environment functionalities were found to be significant. Machot et al. designed an affective computing module to support driver as the input of a linear support vector machine (SVM), where the accuracy of the subjectspecific emotion recognition (90.97%) is much higher than that of the cross-subject condition (64.82%) [11]. It implied a classifier cannot effectively transfer knowledge pertaining to EEG data distribution among multiple users [12]. The individual difference leads to a difficulty for the generality and applicability of the extracted and selected EEG features.
To tackle this obstacle, we focus on designing the feature selection and fusion approach for transferring knowledge among multiple users' EEG in the cross-subject ER system. Transfer learning uses prior knowledge and concepts in the source domain to apply to a target domain by adjusting the machine model to match the novel data distribution [13]. Transfer learning is widely used in image classification [14], fault diagnosis [15], data mining and knowledge discovery systems [16]. In particular, it facilitates a cross-subject ER system to predict correct emotional labels with insufficient EEG instances from a single subject since all user's data can be exploited to train a generalizable model. To this end, we incorporate geodesic flow kernel (GFK) [17] to achieve unsupervised domain adaptation by sampling points in the estimated manifold [18].
In this study, we propose a cross-subject ER framework based on a novel EEG feature selection method termed as feature fusion and dynamical feature selection (MF-DFS). In the MF-DFS model, neighborhood component analysis (NCA) is first applied to reduce the dimensionality of the EEG features. Then, the GFK is used to map the fused EEG features to the Glassmann manifold space. Specifically, we propose a dynamical feature selection method (DFS) combined with the random forest (RF) to determine the most relevant features and improve the generalization capacity of the ER model. This paper is structured as follows. Section 2 briefly reviews related works on methods of the EEG-based emotion recognition. In Section 3, we describe the EEG databases and approaches of feature selection and machine learning based classification. Comparison of the emotion classification performance is shown in Section 4. In Section 5, we discuss the main findings and point out the limitations of the present study. Finally, Section 6 concludes the study and lists our future work.

Related Works
We employ the DEAP, MAHNOB-HCI and SEED databases to validate the proposed ER framework. The DEAP was built by Koelstra et al. [19] and is used to study human emotion variations based on the multi-modal physiological data. In previous studies, the recognition accuracy of the valence and arousal dimensions on the DEAP can be achieved at 70.00% [20,21]. When deep neural networks are employed, the accuracies of arousal and valence are 61.25% and 62.50%, respectively. For convolutional neural networks and the RF, the accuracies are 88.49%/87.44% and 59.22%/55.70% for the arousal/valence dimension [22][23][24]. Among these studies, most of the classifiers are designed with subjectdependent paradigm, where training and testing data were drawn from the same subject. Under this context, bispectral analysis with the support vector machine (SVM) was developed and the accuracy of 72.50%/73.30% for arousal/valence was achieved [25]. Plataniotis et al. [26] adopted two types of semi-supervised deep learning approaches, stacked denoising autoencoder and deep belief networks, as feature extractors. The accuracies of valence and arousal dimensions are 88.33% and 88.59%, respectively.
The MAHNOB-HCI database was collected by Soleymani et al., where the EEG and physiological signals from the peripheral nervous system were available to indicate affective states of subjects [27]. Yan et al. proposed a modified common spatial pattern extractor and combined it with channel selection method, which achieved an average accuracy of the MAHNOB-HCI at 94.13% [28]. In our previous work, a locally robust feature selection method was proposed to find an EEG feature subset with local generalization ability among a group of subjects [29]. For two emotion classes under cross-subject paradigm, average classification accuracies on arousal and valence dimensions of the MAHNOB-HCI database are 67.00% and 70.00%, respectively. Tan et al. used a short-term ER framework based on a spiking neural network with spatio-temporal EEG patterns [30]. They segmented EEG sig-Brain Sci. 2021, 11, 1392 4 of 24 nals and extracted their short-term changes and to avoid handcrafted feature engineering. The average accuracies of valence and arousal dimensions of the MAHNOB-HCI database are 72.12% and 79.39%, respectively.
The SEED database was built by the BCMI laboratory. Three (positive, negative and neutral) target emotions for each physiological data clip were induced when the subject was watching movie clips. Lu and Zheng built an EEG-based ER system by training deep belief networks (DBN). It is reported that the average recognition accuracy (86.65%) of four selected channels is higher than that (86.08%) of the 62 EEG channels [31]. Wang et al. proposed electrode-frequency distribution maps with short-time Fourier transform for the EEG feature extraction and applied it on the SEED database [32]. Residual block based deep convolutional neural network is used as the base classifier and the accuracy is 90.59%, which is 4.51% higher than the baseline [31]. Lu et al. developed a cross-subject ER system with the dynamic entropy model learning framework [33]. The average recognition accuracy of negative and positive emotions in the SEED database was 85.11%.
In recent works, machine learning approaches are also validated by using other EEG databases. Katsigiannis et al. has built the DREAMER database [34] which possessed multi-modality data from the EEG and electrocardiogram (ECG). Based on the support vector machine binary classifier. Accuracy for valence dimension reached 62.49% with the EEG modality, while the fusion of the EEG and ECG features provided the highest accuracy of the arousal dimension (62.32%). Baldo et al. [35] proposed a model for predicting consumer's affective states on the novel products based on the EEG signals. They recorded EEG data of 40 participants while viewing the different shoes on the computer screen. Consumer's preference, i.e., like or dislike, on each pair of the shoes with different fashions can be classified. Murugappan et al. [36] applied the k-nearest neighbor and probabilistic neural network to recognize emotional states of 12 participants towards different brand advertisement videos. By extracting power spectral density, spectral energy and centroid features of the EEG, the accuracy of 96.62% was achieved. Abadi's lab present DECAF [37], a multimodal database for decoding user physiological responses to affective multimedia content. The brain activity has been scanned by magnetoencephalogram sensors. They used a linear support vector machine and achieved a mean accuracy of 57.9% and 51.25% over 30 participants using leave-one-subject-out cross-validation on arousal and valence dimensions, respectively.
In general, the design of the ER system can be categorized into two schemes, namely subject-specific and cross-subject. Although the accuracy of the last case is usually lower than that of the first two cases, there is no doubt that the cross-subject ER system requires less amount of the EEG instances from a specific individual. Previous studies indicate the differences of neuro-physiological responses between individuals brought difficulties to the cross-subject ER task compared to the subject-specific condition. To this end, we focused on designing the cross-subject EEG feature selection method and emotion classifiers and validated it on previously mentioned databases.
By briefly reviewing recent reported works, we notice that the individual difference of the EEG distribution significantly impairs the generalization capability of machine learning classifiers [26][27][28]. The reason behind is that the psychophysiological process induced by specific affective stimuli varies among different people [20][21][22][23]. A promising solution is to extract or select stable EEG features that are invariant across individuals [17][18][19]24]. In [17], the variational mode decomposition was used to discover these invariant spatial EEG features from raw signals. In [24], a locally robust feature selection method was used to quantify the consistency of the feature distribution of the same affective state among all BCI users. Encouraged by these works, in the present study we first apply the neighborhood component analysis and geodesic flow kernel to map EEG features from source and target domains to a Grassmann manifold space. Under this context, the source and target domains are built from different people. Therefore, consistency feature representations can be learned based on domain adaptation and knowledge transferring. We also develop a dynamical feature selection module to adaptively locate sensitive EEG features to emotional variations. The selected features can be properly adjusted when the training EEG data of a specific subject with abnormal feature distribution are employed.

Materials and Methods
In this section, the employed EEG databases were first introduced. Then, we provide procedures for EEG data preprocessing and assignment of the emotion labels. The steps of feature extraction are also described in detail. Finally, the mathematical method of the feature fusion and dynamical feature selection method is described. Two manifold learning techniques (the NCA and GFK) for feature fusion are reviewed. The detailed steps of the proposed DFS feature selection method is described. The framework of the proposed ER system is illustrated in Figure 1. and target domains are built from different people. Therefore, consistency feature representations can be learned based on domain adaptation and knowledge transferring. We also develop a dynamical feature selection module to adaptively locate sensitive EEG features to emotional variations. The selected features can be properly adjusted when the training EEG data of a specific subject with abnormal feature distribution are employed.

Materials and Methods
In this section, the employed EEG databases were first introduced. Then, we provide procedures for EEG data preprocessing and assignment of the emotion labels. The steps of feature extraction are also described in detail. Finally, the mathematical method of the feature fusion and dynamical feature selection method is described. Two manifold learning techniques (the NCA and GFK) for feature fusion are reviewed. The detailed steps of the proposed DFS feature selection method is described. The framework of the proposed ER system is illustrated in Figure 1.

Database Descriptions
The DEAP dataset collected EEG and peripheral physiological signals from 32 channels. When the EEG signals were recorded, 40 selected music videos (1 min each) were viewed by 32 volunteers (16 males). The participants were asked to assess a self-assessment manikin questionnaire after watching each video. At the end of a trial, arousal, valence, dominance, liking and familiarity scales were rated by the volunteers within a range of 1-9. All the collected EEG signals are downsampled to 128 Hz. In this study, only the EEG signals are used for building the ER system.
The MAHNOB-HCI, a dataset of 30 volunteers, recorded physiological signals of electrocardiogram, EEG (with 32 channels), respiratory amplitude and skin temperature.
The EEG modality is used in building the ER system. Since the EEG data of six subjects are incomplete, only 24 participants' data are available. Each participant rated arousal, valence, dominance, and predictability scales from 1 to 9 for 20 selective musical videos. The recorded EEG is downsampled to 128 Hz. We extract a 60-s EEG segment from each trial for further analysis. The first 5 s signal and the signal after 65 s of each trial are removed.
For the SEED database, 15 subjects participated in three trials of data acquisition experiments with an interval of 1 week between two consecutive trials. Each subject watched fifteen selected clips of Chinese films to stimulate target emotions on the valence scale, i.e., positive, neutral and negative. Each clip lasted approximately 4 min. The EEG data were simultaneously recorded with 62 channels and 32 channels are selected for further analysis. The selected 32 channels are as same as those used in the DEAP and the MAHNOB-HCI. The raw EEG signal was recorded at a frequency of 1000 Hz and then downsampled to 200 Hz.

EEG Data Preprocessing
In the process of the EEG acquisition, the signals can be interfered by artifacts induced by the ocular, muscular, and movement noise. Low and high pass filters are selectively employed to remove muscular noise or ocular disturbance. Compared to the SEED and DEAP databases, the EEG in the MAHNOB-HCI is required to be referenced. Since the movement and respiration artifacts are observed in the MAHNOB-HCI, a highpass filter with the cutoff frequency of 3 Hz is applied. For the DEAP, a bandpass filter with the cutoff frequencies of 4 and 45 Hz is implemented. For the SEED, a highpass filter (3 Hz) is first used to remove the motion and respiratory artifacts. Then, a lowpass filter (45 Hz) is used to eliminate high frequency noise. Table 1 shows the detailed implementations of the filter settings of the three databases. To generate sufficient training instances of the ER classifier, we divide a trial EEG data into four nonoverlapped segments for all three databases. In total, 5120 (160 for each subject), 1920 (80 for each subject) and 2700 (180 for each subject) EEG segments are elicited from the DEAP, MAHNOB-HCI and SEED databases, respectively. Since the supervised machine learning models are used for building the ER system, we assign target emotion classes to each EEG segment as follows. The emotion class for each trial is fixed. For the DEAP and MAHNOB-HCI databases, three target classes according to the subjective ratings on arousal and valence dimensions are assigned for each trial. A rating value higher than 5.5 indicates the high class, the value between 5.5 and 3.5 indicates the neutral class, and the value lower than 3.5 indicates a low class. Hence, for an affective dimension of arousal or valence, three target emotions are defined and assigned to each trial of the EEG data. For the SEED database, all trials have been categorized into three emotional classes (positive, neutral, negative).

Feature Extraction
The preprocessed EEG signals of each segment are then transformed to a feature vector aiming at sensitively indicating emotion variations. In this study, we incorporate 364 classical EEG features (CL) [38] and 128 differential entropy (DE) features [39] from the signals of 32 EEG channels.

Classical EEG Features
For all three EEG databases, 364 CL (204 frequency-domain features and 160 timedomain features) are extracted and shown in Table 2. Considering the asymmetry of the left and right hemispheres, power differences between left and right hemispheres of scalps in four frequency bands (theta: 4-8 Hz, alpha: 8-14 Hz, beta: 14-31 Hz, gamma: 31-45 Hz) were extracted. The power features of the same four bands of 32 channels were calculated by the fast Fourier transform. Power ratios of specific channels and frequency bands were also calculated. At the same time, we extracted five sets of time-domain features as shown in Table 2. PSD Differences of the Four Bands Frontal scalp:

Differential Entropies
The DE is an extension of the Shannon entropy and has been widely applied for building the EEG-based ER systems [40]. In this study, we compute the DE for each classical band defined as follows, According to Equation (1), f (x) is the probability density function of the EEG time series X in a specific frequency band after bandpass filtering. According to the Kolmogorov-Smirnov statistic [41], the filtered EEG signals were the time series of Gaussian distribution N µ, σ 2 . Therefore, Equation (1) can be approximately computed based on the variance of the filtered. (2)

Manifold Feature Fusion and Dynamical Feature Selection
In this section, the proposed MF-DFS framework for fusing and selecting EEG features for building the cross-subject ER system is described in detail. We first briefly review the NCA and GFK feature fusion methods. Then, the details of the proposed DFS feature selection method are shown.

Neighborhood Component Analysis
The NCA [42] is a non-parametric method for selecting features with the goal of maximizing prediction accuracy of the classifier. It has a capability to learn the Mahalanobis distance between training instances and linearly transform them to a subspace such that the average cross validation classification accuracy is maximized. The motivation of applying the NCA method lies in two aspects. First, the NCA employs a stochastic 1-nearest neighbor (1-NN) classifier to examine whether the predicted class is consistent with the target class of the EEG features. Compared to unsupervised principal component analysis, the NCA adopts supervised learning principle that could exploit label information to increase interclass distinguishability. Moreover, the comparison between the predicted and target classes is based on leave-one-out cross validation. Compared to artificial neural network with empirical risk minimization, it better controls the overfitting when performing metric learning of the distance weight. It should be noted that the functionality of the 1-NN classifier is to evaluate the distance of each two instances belonging to the same class. It facilitates that the low dimensional features are embedded with large inter-class margin, which is different from the k-NN classifier directly used for classification. In this study, the fold for the cross validation is 15.
A multi-class training set T of n samples can be defined as . . , C} are the class labels, and C is the number of class. The aim is to learn a classifier to generate prediction f (X). The prediction should be close to the true label y of X. To select the optimal feature subset, we define D w as the weighted distance between samples X i and X j . In this scheme, a reference point is randomly chosen to be the nearest neighbor of the new point X. The probability that a point X j is picked from T as the reference point for X increases if X j is closer to X as measured by the distance function D w , where w r is the feature weight. The leave-one-subject-out paradigm is then applied to evaluate the classifier's performance. That is, the predicted label of X (i) is generated by the classifier trained by the dataset T (−i) that denotes excluding the subset of the training instances X (i) , y (i) from the training set T. The probability that a point X j is picked as the reference point for X i is, In Equation (4), kernel function k(z) = exp(−z/σ) achieves a large value when D w X i , X j decreases. The kernel width σ influences the probability of a training sample being selected as the reference. The p i is the probability that the classifier correctly classifies the data point using the training dataset T.
In Equation (5), y ij can be elicited as.
Thus, the average probability of correct classification is derived as, The goal of the NCA is to maximize F to improve the classification accuracy. To reduce the overfitting, a regularization parameter λ > 0 is induced to balance the F and the summation of the weights [43]. In this study, the kernel width σ is simply selected as 1. The objective function in Equation (7) can be generalized as, To find the proper value of and determine the dimension of the selected features, the leave-one-subject-out is applied again as follows: 1.
Partition the EEG feature data into K subsets and each subset contains the EEG data of a subject; 2.
For each fold, train a NCA model on K-1 subsets and validated the trained model on the remaining subset; 4.
Return the value of the classification loss defined as the mean square error for the current fold; 5.
Repeat steps (2)-(4) to find the lowest loss corresponding to optimal the value of λ; 6.
Perform NCA feature selection according to the optimal λ.

Geodesic Flow Kernel
The geodesic is defined as the shortest local distance between two points in the feature space. To find a geodesic, the source and target domains are mapped to a Grassmann manifold space [17] as shown in Figure 2. Given two points projected on the Glassmann manifold, the kernel method is used to select all the geodesic points from the source to target domains with seamless migration. There are three steps to build a GFK model.

Geodesic Flow Kernel
The geodesic is defined as the shortest local distance between two points in the feature space. To find a geodesic, the source and target domains are mapped to a Grassmann manifold space [17] as shown in Figure 2. Given two points projected on the Glassmann manifold, the kernel method is used to select all the geodesic points from the source to target domains with seamless migration. There are three steps to build a GFK model. (1) Obtain the optimal dimension of the subspaces.
The GFK adopts the subspace disagreement measure (SDM) to find the intrinsic dimension of the subspace. The SDM ( ) D d is defined as, Given two datasets S and T , the principal component analysis (PCA) is applied and obtain subspaces S P and T P , respectively. Then, a dataset S T + is created by combining S and T , the PCA is applied again to derive the subspace S T represents the dth principle angle between S P and S T + P (or T P and (1) Obtain the optimal dimension of the subspaces.
The GFK adopts the subspace disagreement measure (SDM) to find the intrinsic dimension of the subspace. The SDM D(d) is defined as, Given two datasets S and T, the principal component analysis (PCA) is applied and obtain subspaces P S and P T , respectively. Then, a dataset S + T is created by combining S and T, the PCA is applied again to derive the subspace P S+T . The term α d (or β d ) represents the dth principle angle between P S and P S+T (or P T and P S+T ) [44]. The only hyper-parameter needs to be tuned is the dimensionality of the subspaces d. The value of the d is minimized with the constraint of D(d) = 1. The constraint ensures the basis of the P S or P T is orthogonal to that of the P S+T .
A larger value of d is preferred to contain more information from the fused features.
After implementing the PCA, all d-dimensional subspaces are embedded into manifold H. The terms S and T represent the subspaces of source and target domain, respectively. Then, H can be regarded as the set of all d-dimensional subspaces. Every possible subspace in d dimensions can be considered as a point on H. Thus, a geodesic between two points can form a path between two subspaces.
Suppose that the subspaces of the source and target domains are projected by a geodesic mapping function Φ, and assume that they are in two poles of 0 and 1 in the manifold space, there exists Φ 0 = P s and Φ 1 = P T . Let R S ∈ R D×(D−d) denote the orthogonal complement to P s , and R T S P S . For a point t mapped within the interval of [0, 1], the corresponding mapping function is defined as Φ t . This function can be computed as [44,45], In the equation, U 1 and Γ are elicited by the P T S P T = U 1 ΓV T according to the singular value decomposition (SVD) while U 2 and Σ are computed via R T S P T = −U 2 ΣV T . It is noted that the diagonal element of Γ can be represented as cos θ i while that of Σ is sin θ i , where θ i denotes the principle angle between P S and P T with 0 ≤ θ 1 ≤ θ 2 ≤ . . . ≤ θ d ≤ π/2. For two vectors x i and x j , their projections on Φ t can be represented as infinitedimensional vector: z ∞ i and z ∞ j . The inner product of z ∞ i and z ∞ j defines a geodesic stream kernel, In the equation, G ∈ R D×D can be calculated by the following closed-form SVD [6], Thus, through the GFK mapping, the source domain features are transformed into a Grassmann manifold with Equation (13). The mapping matrix √ G can be computed based on the SVD decomposition.

Dynamical Feature Selection and Performance Evaluation of the MF-DFS
The aim of the feature selection to find a relevant feature subset with lower dimensionality and less noise. In this work, we propose a novel DFS method to find the most informative EEG variables indicating variations of the emotions. The DFS is developed based on recursive feature elimination (RFE) approach. The RFE was proposed by Guyon et al. and originally used for the gene selection task [46]. The RFE is a wrapper-based feature selection method and adopts a sequential backward elimination strategy.
The aim of the DFS is to reduce the differences between individuals and achieves a more consistent probability distribution across the source and target domain. It is noted that the DFS should be implemented with a predefined emotion classifier. It is used to generate the weight of fused features. The input feature matrix X DFS of the DFS is calculated as, In the equation, X NCA is elicited by the NCA based feature selection and √ G is derived based on Equation (13).
The procedure for implementing the DFS is summarized as follows.

1.
Perform the leave-one-subject-out training and testing procedure; 2.
Select a CL or DE feature set from a database with N subjects and compute the corresponding feature matrix X NCA ; 3.
Define a testing set, where the EEG data are drawn from a specific subject; 4.
A predefined emotion classifier is trained by the learning algorithm L based on the remaining N − 1 subjects' EEG data. The dimension of the EEG feature is defined as n; 5.
Perform feature ranking according to the feature weights according to the trained classifier; 6.
Remove the feature with the lowest weight and update the feature matrix; 7.
Retrain the SVM classifier based on the current feature matrix and update the weight; 8.
Generate a feature ranking according to the order of the feature removal. The first (or last) removed feature possesses the lowest (or highest) ranking; 10. Given the classifier, compute n classification accuracies. For instance, the 1st accuracy corresponds to that the optimal feature is adopted according the feature rankings to train the classifier, the 2nd accuracy indicates the optimal two features are adopted, and the nth accuracy indicates all features are used; 11. Determine the optimal feature combination corresponding to the highest accuracy elicited in step (10); 12. Repeat steps (3)-(11) for all testing subjects.
The generalization capability of the proposed MF-DFS model is validated by the leave-one-subject-out cross validation paradigm combined with specific machine learning classifiers. We take the DEAP database with the EEG recorded from 32 subjects as an example. The EEG features are first fused and selected by the MF-DFS model and then divided into 32 subsets. Each subset contains the EEG data with the reduced dimensionality from an individual. The machine learning classifier for classifying low and high arousal/valence classes is trained based on 31 subsets and validated on the remaining subset. Therefore, each subset can be validated once. After all subsets are validated for 32 rounds of such training and testing procedure, the average classification accuracy is computed as the performance of the MF-DFS method. For the HCI and SEED databases, 24 and 15 subsets are built according to the number of the individuals, respectively.

NCA Model Selection
To achieve the optimal performance of the proposed MF-DFS method, the hyperparameter λ of the NCA is carefully determined based on the leave-one-subject-out cross validation. For each feature set (CL or DE), the average loss (mean square error, MSE) of all folds are computed. Note that the number of features in the NCA across different features sets varies within an interval of 27-38. The optimal λ with the smallest MSE is applied in the NCA model. The variations of the MSE vs. corresponding values of λ for all feature sets and emotional dimensions are shown in Figure 3. Taken Figure 3a as an example, the best loss of 0.34 is achieved corresponding to the λ value of 0.0037. In Figure 3b, the corresponding weights of the DE feature set of the DEAP database are shown. We can observe most weight values are zeros, which identify an irrelevant feature subset. Thus, the number of the selected features and the optimal value of λ can be simultaneously determined for each feature set of all databases according to Figure 3. It should be noted that we also adopt a threshold to control the number of the selected features when most of the weights are zeros or non-zeros.
The feature importance to variations of the emotion can be interpreted by the NCA weight shown in Figure 3. By averaging weight values of all EEG channels and databases, beta (0.1694) and gamma (0.1622) bands possess higher importance than that from theta (0.0502) and alpha (0.0699) bands with respect to the valence dimension. For arousal variations, similar observations are shown with the averaged weight of 0.0175, 0.0447, 0.1516 and 0.1546 for the theta, beta, beta and gamma bands, respectively. By sorting the weight in a descending order, the most important channels to the valence for the beta band are the Pz and O1. It implies an increased cortical response in parietal and occipital regions. The most important channels to the valence for the gamma band are Fp1 and F7, which shows an increased cortical activity in the left frontal region. For the arousal dimension, the most important features in the beta band are T7 and Fp2 while those in the gamma band are Fp1 and F8. In conclusion, the features that are sensitive to emotion variations are identified from beta and gamma power in central parietal, left occipital, frontal and left temporal regions of the scalp.

Feature Selection Performance with Different Classifiers
To further validate the performance of the MF-DFS based ER systems, five classifiers, random forest (RF), adaptive boosting (AdaBoost), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost) and decision tree (DT) are applied. All hyperparameters of the classifiers have been carefully selected and listed in Table 3. The selected hyper-parameters are fixed under all cases of the experiments. subfigures (a) as an example, the best loss of 0.34 is achieved corresponding to the λ value of 0.0037. In Figure 3b, the corresponding weights of the DE feature set of the DEAP database are shown. We can observe most weight values are zeros, which identify an irrelevant feature subset. Thus, the number of the selected features and the optimal value of λ can be simultaneously determined for each feature set of all databases according to Figure 3. It should be noted that we also adopt a threshold to control the number of the selected features when most of the weights are zeros or non-zeros.   In Tables 4-6, we compare the accuracy of the CL and DE feature sets under five feature selection methods, i.e., Chi-squared-based feature selection (CSBS), mutual informationbased feature selection (MI), ridge regression-based feature selection (RR), extremely random forest (ERF), and the proposed DFS. All accuracies are computed based on the inter-subject manner based on the leave-one-subject-out paradigm. Training data of all classifiers are processed based on the NCA and GFK. Therefore, the last column of each table shows the results of the MF-DFS. In total, there are 25 combinations of different feature selection methods and classifiers. For the DEAP and MAHNOB-HCI databases, the arousal and valence dimensions of the emotions are recognized. For the SEED database, only the valence dimension is evaluated since the arousal targets are unavailable.
From the tables, it is shown the classification accuracies of the DFS combined with all five classifiers are significantly higher than the other four feature selection methods for all three databases. Moreover, it can be found that the DFS combined with the RF possesses the highest average classification accuracy (0.4236). In addition, in the comparison of the two feature sets in three databases, the CL feature has a higher average classification accuracy than the DE feature set for both arousal and valence dimensions. The average accuracy of DEAP database CL feature is 0.4470, and that of DE feature is 0.4381. For the MAHNOB-HCI database, the average accuracies for two feature sets are 0.4178 and 0.4062. For the SEED database, the average accuracies for two feature sets are 0.3465 and 0.3450. It implies the emotions in the DEAP database possess higher distinguishability.
In Tables 4-6, five classifiers were validated based on different feature selection and fusion techniques. For the DEAP database, the AdaBoost combined with the MF-DFS achieves the optimal performance on both of classical and differential entropy features for arousal and valence dimensions. The improvement is approximately 0.2-1.5% averaged for all cases against other combinations in Table 4. For the MAHNOB-HCI database, the RF combined with the MF-DFS is superior to other cases with the accuracy improvement of 1-3.7%. For the SEED database, the improvement of the DT with 0.6-1.8% is observed. Overall, the RF model outperforms the other classifiers averaged by all three tables. The potential reason lies in two aspects. The RF employs a group of member classifiers to build a classification committee by majority voting. A hyper-parameter is required to be tuned, i.e., number of the member classifiers. By using the proper amount of member classifiers, the fitting capability can be superior to that of the DT with only a single classification model. Moreover, the RF employs random sampling simultaneously on training instances and input EEG features, which is different from the AdaBoost and classical ensemble method only adopting instance sampling. The training subset can be built based on the bootstrap approach using a lower feature dimension. It thus potentially reduces the overfitting of the member decision trees.

Statistical Test of Feature Selection Performance
In Figures 4-6, we show of the RF classifier combined with five feature selection methods of the three databases. In Figure 4, the MF-DFS achieves the optimal accuracy on both CL and DE feature sets. For the CL feature set, the MF-DFS combined with the RF classifier achieves a recognition accuracy of 0.48 and 0.5 for valence and arousal dimension, respectively. For the DE feature set, the corresponding accuracies of the valence and arousal are 0.48 and 0.5, respectively. In Figure 5, for the MAHNOB-HCI database, the accuracy of the CL feature set is 0.54\0.48 (valence\arousal) and that of the DE feature set is 0.51\0.47 (valence\arousal). In Figure 6 for the SEED database, the valence accuracy of the CL and DE feature sets are 0.40 and 0.40, respectively. It indicates the performance of the proposed MF-DFS method achieves higher median than other feature selection methods. It should be noted the accuracies of the MF-DFS share a larger variance for the arousal dimension of the DEAP database. It implies the cross-subject classification fails on specific individuals.
In Figure 7, the results of the paired t-test are shown to compare whether the improvement between the MF-DFS and other feature selection methods is significant or not. To achieve a fair comparison, the ER classifier is fixed as the AdaBoost. In Figure 7, it can be observed that the MF-DFS significantly outperforms the remaining four feature selection methods with p < 0.05 for the valence dimension and the CL feature set of the arousal dimension. The difference between DFS and the other four feature selection methods is insignificant with the DE feature set for the arousal dimension. In Figures 8 and 9, the significant improvement of the MF-DFS for the MAHNOB-HCI and SEED databases are observed across all cases and feature sets. Note: The highest performance metric in each row is in boldface. The terms CL and DE denote the classical feature sets and differential entropy feature sets, respectively. The standard deviation is listed in brackets. The optimal values are shown in boldface.
Brain Sci. 2021, 11,1392 18 of 26 method only adopting instance sampling. The training subset can be built based on the bootstrap approach using a lower feature dimension. It thus potentially reduces the overfitting of the member decision trees.

Statistical Test of Feature Selection Performance
In Figures 4-6, we show of the RF classifier combined with five feature selection methods of the three databases. In Figure 4, the MF-DFS achieves the optimal accuracy on both CL and DE feature sets. For the CL feature set, the MF-DFS combined with the RF classifier achieves a recognition accuracy of 0.48 and 0.5 for valence and arousal dimension, respectively. For the DE feature set, the corresponding accuracies of the valence and arousal are 0.48 and 0.5, respectively. In Figure 5, for the MAHNOB-HCI database, the accuracy of the CL feature set is 0.54\0.48 (valence\arousal) and that of the DE feature set is 0.51\0.47 (valence\arousal). In Figure 6 for the SEED database, the valence accuracy of the CL and DE feature sets are 0.40 and 0.40, respectively. It indicates the performance of the proposed MF-DFS method achieves higher median than other feature selection methods. It should be noted the accuracies of the MF-DFS share a larger variance for the arousal dimension of the DEAP database. It implies the cross-subject classification fails on specific individuals.   In Figure 7, the results of the paired t-test are shown to compare whether the improvement between the MF-DFS and other feature selection methods is significant or not. To achieve a fair comparison, the ER classifier is fixed as the AdaBoost. In Figure 7, it can be observed that the MF-DFS significantly outperforms the remaining four feature selection methods with p < 0.05 for the valence dimension and the CL feature set of the arousal dimension. The difference between DFS and the other four feature selection methods is insignificant with the DE feature set for the arousal dimension. In Figures 8 and 9, the  In Figure 7, the results of the paired t-test are shown to compare whether the improvement between the MF-DFS and other feature selection methods is significant or not. To achieve a fair comparison, the ER classifier is fixed as the AdaBoost. In Figure 7, it can be observed that the MF-DFS significantly outperforms the remaining four feature selection methods with p < 0.05 for the valence dimension and the CL feature set of the arousal dimension. The difference between DFS and the other four feature selection methods is insignificant with the DE feature set for the arousal dimension. In Figures 8 and 9, the

Performance Comparison between the MF-DFS and Original EEG Features
In this section, we compare the MF-DFS with the baseline state, where the extracted classical features and differential entropy are directly fed to the classifiers. The derived cross-subject emotion classification accuracy is presented in Tables 7-9. For the DEAP database, the performance of the average classification accuracies on valence and arousal dimensions of all adopted five classifiers of the two feature sets are all improved by 5.41%. For the MAHNOB-HCI database, the accuracy of classical features set is increased by 8.30% and that of differential entropy is increased by 8.24%. For the SEED database, the average classification accuracies of the classical and differential entropy feature sets are improved by 3.82% and 5.95%, respectively. The results show the competency of the MF-DFS model against the case using the original EEG features without proper feature selection.        Note: The highest accuracy in each row for arousal or valence dimension is in boldface. The terms CL and DE indicate the classical and differential entropy features are used for deriving the accuracy.

Discussion
In this study, a three-class ER system based on the NCA-GFK feature fusion and the DFS feature selection has been proposed. Specifically, the ER system is developed with the cross-subject paradigm. Due to individual differences among the subjects in each database, such ER system is relatively difficult to share a satisfactory generalization capability. In practice, large amount of EEG data from a specific subject are difficult to acquire. Thus, efficient identification of the relevant EEG features between different individuals related to emotional variations is particularly critical.
The MF-DFS combines manifold feature fusion techniques and dynamical feature selection approach to achieve domain adaptation and knowledge transferring of the EEG statistics. It successfully reduces the dimension of the features and improves transferability across classical PSD and differential entropy feature sets of different individuals. In particular, the leave-one-subject-out accuracy of the proposed DFS significantly outperforms four competitive feature selection methods. The potential reason is the introduction of a dynamical feature filter aiming at adapting the personality of the EEG distribution of each user of the brain computer interface. The fairness of the comparison can be ensured since both NCA and GFK are leveraged to discover the proper manifold for all feature selection methods. By analyzing the cross-subject classification accuracy, the proposed MF-DFS has a capability to improve the accuracy of individual-independent emotion recognition in three different physiological databases.
The limitations of the proposed ER framework mainly manifest in the following two points. (1) The accuracy of the cross-subject emotion classification accuracy is still lower than 50% for three classification cases. It is still an obstacle for practical application for online implementation of the algorithms. (2) The essential of the DFS is a dynamic recursive process and it induces a high computational cost for finding the most relevant features.

Conclusions
In this study, we proposed an EEG feature selection method termed as the MF-DFS. It is specifically designed for cross-subject emotion recognition. The MF-DFS adopted the merits of local geometrical information-based feature selection (NCA), manifold estimation with domain adaptation (GFK) and dynamical feature selection to boost the performance of emotion classifiers. We validated the MF-DFS based on classical and differential entropy feature sets from three EEG database: the DEAP, MAHNOB-HCI and SEED. We observed the MF-DFS significantly outperforms classical feature selection methods on five machine learning classifiers. It partially demonstrated its generalization capability and transferability for inter-individual EEG feature selection. The future work will focus on the aspect of reducing the computational cost in its recursive procedure and further improve its usability for its practical application.