Coupled Projection Transfer Metric Learning for Cross-Session Emotion Recognition from EEG

: Distribution discrepancies between different sessions greatly degenerate the performance of video-evoked electroencephalogram (EEG) emotion recognition. There are discrepancies since the EEG signal is weak and non-stationary and these discrepancies are manifested in different trails in each session and even in some trails which belong to the same emotion. To this end, we propose a Coupled Projection Transfer Metric Learning (CPTML) model to jointly complete domain alignment and graph-based metric learning, which is a uniﬁed framework to simultaneously minimize cross-session and cross-trial divergences. By experimenting on the SEED_IV emotional dataset, we show that (1) CPTML exhibits a signiﬁcantly better performance than several other approaches; (2) the cross-session distribution discrepancies are minimized and emotion metric graph across different trials are optimized in the CPTML-induced subspace, indicating the effectiveness of data alignment and metric exploration; and (3) critical EEG frequency bands and channels for emotion recognition are automatically identiﬁed from the learned projection matrices, providing more insights into the occurrence of the effect.


Introduction
Endowing machines with emotional intelligence is indispensable for natural humanmachine interactions, making machines more humanized in communication [1,2]. EEG directly manifests electrical activities of the human cerebral cortex, providing an objective and reliable approach for emotion recognition [3]. Nowadays, EEG-based emotion recognition has received increasing attention from researchers due to its inherent characteristics, such as being noninvasive, inexpensive and easy to use. In view of these advantages, EEGbased emotion recognition has potential applications in diverse fields such as healthcare, education, entertainment and neuromarketing [4][5][6].
A typical EEG-based emotion recognition system is shown in Figure 1, which is usually composed of three stages. First, emotional video clips are played to healthy subjects to evoke their corresponding emotional states, while raw EEG data are recorded from them using EEG acquisition devices. Second, the raw EEG data will be preprocessed, including removing artifacts and down-sampling. Third, features will be extracted and then fed into a model to conduct emotion recognition. In this paper, we mainly focus on the third stage. In the past decade, lots of emotion recognition models ranging from machine learning to deep learning have been proposed [3,7]. For example, Shen et al. designed a multi-scale frequency bands ensemble learning model for EEG-based emotion classification, which effectively combined the information from different scales of frequency bands and further enhanced the performance [8]. Gao   Nevertheless, the EEG signal is weak and non-stationary such that even for the same subject, distributions of EEG samples collected in different sessions cannot well match, which is generally known as cross-session discrepancies [10,11]. Figure 2a demonstrates a simple setting; that is, EEG samples of session 1 are collected on one day, and those of session 2 are collected on another day. Due to distribution discrepancies between these two sessions, even if a classifier is well trained by EEG samples from session 1, it may not achieve promising performance on session 2. To solve this problem, unsupervised domain adaptation (UDA) [12] has been introduced, which treats EEG samples of session 1 as the source domain and that of session 2 as the target domain, and then seeks a shared subspace of them to minimize distribution discrepancies, and thus improves the generalization ability of the source classifier, as shown in Figure 2b. According to UDA, Zheng et al. built several personalized EEG emotion recognition models by exploiting shared features and transferable model parameters between both domains [13]. Li et al. enforced the latent representations of the two domains to be similar and minimized the classification error of source domain [14]. These methods improved the cross-session emotion recognition performance in comparison with non-transfer methods. However, there are still some limitations in existing approaches. On the one hand, they only focus on the cross-session discrepancies of EEG, while ignoring the cross-trial differences. Generally, each session consists of multiple trials, each of which corresponds to a certain emotional state. Strictly, every trial is a slightly new task [10]. That is, there are differences among EEG samples from different trials but with the same emotional state, which is called as cross-trial differences. For example, supposing that there are six trials in each session with three emotional states, we build a similarity graph to depict the connectionship of these EEG samples [15], as shown in Figure 3a. Obviously, we obtain 12 rather than three diagonal blocks in the similarity graph, which is inconsistent with the ideal structure of a graph that the number of diagonal blocks should be equal to the number of emotional classes [16]. That is, the cross-trial differences of EEG are large and the similarity graph cannot well depict the underlying emotional states. Therefore, we should learn a more reliable emotion metric which can make EEG samples belonging to the same emotional state as interconnected as possible, whose corresponding similarity graph is shown in Figure 3b, such that the recognition performance can be greatly improved.
On the other hand, most transfer methods mainly pursued higher emotion recognition accuracy and only a few of them may further explore what have been learned by the models. For example, Cui et al. developed a convolutional neural network combined with an interpretation technique for cross-subject driver drowsiness recognition [17]. They designed an interpretation technique to reveal relevant regions of the input signals that were important for prediction. However, their model required extra parameters to discovery common patterns of mental states across different subjects, which was time-consuming. Different from them, we directly explore the properties of the coupled projection matrices without extra parameters. The basis of it is that, from the perspective of transfer learning, the learned projection matrices mainly extract domain-invariant features by strengthening the common components between domains while weakening the non-common components. Therefore, they may reveal common information of the two different sessions which do not change over time. That is, we could conduct further investigations into them to explore the stable EEG patterns related to cross-session emotion expression.  To address both issues, we propose a model termed Coupled Projection Transfer Metric Learning (CPTML) for cross-session EEG emotion recognition, in order to not only improve the emotion recognition accuracy by minimizing both cross-session and crosstrial discrepancies of EEG data but also automatically reveal the stable EEG patterns of emotion from the learned subspace. Generally, CPTML projects data from two domains into respective subspaces by coupled projection matrices. Then, it unifies the domain alignment and the graph-based metric learning together into a single objective to jointly optimize the two subspaces. The main contributions of this paper are summarized below.
• We propose a transfer metric learning method to address two critical issues in crosssession EEG emotion recognition. Specifically, for the cross-session discrepancies, domain alignment is proposed to extract domain-invariant features, which makes distributions of the two sessions to be aligned; for the cross-trial differences, graphbased metric learning is designed to learn discriminative features, which makes EEG samples from different trials but belonging to the same emotional state to connect with each other. The extensive experimental results explicitly demonstrate the effectiveness of these two modules.
• We combine the domain alignment and the graph-based metric learning into a unified framework. On one hand, better aligned data can provide more accurate sample similarity measure for further graph-based metric learning; on the other hand, emotional discriminative features learned by metric learning can offer more accurate target pseudo labels, which further contributes to the conditional distribution alignment of both domains. Experimental results verify that these two modules are complementary to each other. • Apart from improving emotion recognition accuracy, our CPTML model can explore the specific EEG patterns which are stable in cross-session emotion expression. It can automatically identify the critical EEG frequency bands and channels (brain regions) through investigating the feature weighting ability of the coupled projection matrices, which not only provides insights into the occurrence of affective effect, but also provides theory instruction for engineers to design wearable devices for emotion-related EEG acquisition.
The remaining contents are arranged as follows. In Section 2, we introduce the detailed formulation and optimization of our proposed CPTML model. The experiments were conducted to evaluate the efficacy of CPMTL in EEG emotion recognition in Section 3. Section 4 concludes the paper.

Problem Definition
Suppose we have labeled EEG samples from one session {X s , Y s } = {(x si , y si )} n s i=1 , denoted as the source domain D s , and unlabeled EEG samples from the other session C is the number of emotional states, n s and n t are the number of samples in source and target domains, respectively. The feature space and label space of both domains are the same, i.e., X s = X t and Y s = Y t ; however, their marginal distributions and conditional distributions are different due to the domain shift: P s (X s ) = P t (X t ) and P s (Y s |X s ) = P t (Y t |X t ).
As shown in Figure 4, we project the source and target domain data into respective subspaces by two matrices, and then minimize the discrepancies between projected data of the two domains. Suppose A ∈ R d×d f is the projection matrix for source domain, is the dimension of corresponding subspaces. Then, the projected data of the two domains can be represented as A T X s and B T X t , respectively.

Domain Alignment
Due to the distribution discrepancies of EEG from different sessions, we simultaneously minimize marginal and conditional distributions between their projected representations according to the Maximum Mean Discrepancy (MMD) criterion [18]. In detail, marginal distribution alignment can be achieved by minimizing the distance between the sample means of the two domains; that is Similarly, conditional distribution alignment aims to minimize the distance between the sample means belong to the same class of the two domains; that is where n c s and n c t denote the number of samples belonging to the c| C c=1 -th emotional state in source and target domains, respectively. Since the label information of target domain data is not available, we utilize pseudo labels to estimate the conditional distribution in target domain, which is similar with [19,20]. Target pseudo labels are predicted by the classifier trained on source domain data and updated with iterations.
For simplicity, we combine F DistM and F DistC with the same weight. Thus, the joint distribution alignment is formulated as For clarity, we rewrite Equation (3) in matrix form as where where 1 s ∈ R n s and 1 t ∈ R n t are all-one column vectors. Additionally, to avoid too much divergence between the two subspaces, we minimize the distance of them as

Graph-Based Metric Learning
Though discrepancies between source and target domains have been minimized, it still cannot guarantee that the projected data of the two domains can be discriminative enough for emotion recognition. As described in Section 1, discrepancies between different trials are large and EEG samples with the same emotional state cannot well connect with each other in the graph, leading to degenerated recognition performance. Therefore, we design a graph-based metric learning method to encourage the intra-class EEG sample pairs to stay close while the inter-class ones stay away. Specifically, for each domain, we leverage two kinds of graphs, i.e., intrinsic graph and penalty graph, to preserve the distance relationship between sample pairs [21,22]. In the intrinsic graph, each EEG sample is connected with k intra nearest samples within the same class in the same domain; while in the penalty graph, each EEG sample is connected with k inter nearest samples from different classes in the same domain. Then, we calculate the Laplacian matrix for each graph. Inspired by the Fisher criterion [23], it is intuitive to compact samples in the same class and separate samples in different classes, which can be achieved by where L s I ∈ R n s ×n s and L s P ∈ R n s ×n s represent the Laplacian matrix of the intrinsic and penalty graphs in the source domain, respectively; L t I ∈ R n t ×n t and L t P ∈ R n t ×n t are those in the target domain. The Laplacian matrix is calculated as L = D − (S T + S)/2, where S is the weight matrix, and D ii = ∑ j S ij . In this paper, the weight matrix S is computed by the "HeatKernel" function, i.e., S ij would be exp(−( x i − x j 2 )/2) if x i and x j are connected, otherwise it would be 0.

Overall Objective Function
As stated previously, domain alignment and graph-based metric learning are complementary to each other. On one hand, domain alignment offers better aligned data based on which more accurate sample-pair distance and class information can be obtained for subsequent graph-based metric learning; on the other hand, emotional discriminative features learned by graph-based metric learning can help to predict more accurate target labels, which is beneficial for better aligning the conditional distributions. Therefore, we joint them into a unified framework.
Further, we impose two constraints on the target subspace as [24,25] did. First, to avoid features of the target domain data being projected into irrelevant dimensions, we maximize the variance of the target projected data by n t 1 t 1 T t is the centering matrix, I t ∈ R n t ×n t is the identity matrix. Second, we control the scale of target subspace by F Cons2 = min B Tr(B T B). Thus, the final objective function of CPTML can be formulated as where λ, γ and µ are trade-off parameters. It can be rewritten it into the following form where I ∈ R d×d is the identity matrix. To enhance the readability, we transform Equation (15) into the matrix form as where 0 ∈ R d×d , and

Optimization
The two projection matrices A and B are the target variables to be optimized in Equation (16), which is obviously a generalized Rayleigh quotient and can be optimized by generalized eigenvalue decomposition. By denoting H [A; B], we introduce an Lagrange multiplier Φ and the Lagrangian function of Equation (16) can be transformed as where Φ = diag(φ 1 , · · · , φ d f ) and φ 1 , · · · , φ d f are the d f largest eigenvalues of the above eigendecomposition problem, and H = [H 1 , · · · , H d f ] contains the corresponding d f eigenvectors. Once the matrix H is solved, the optimal projection matrices A and B can be obtained. The procedures of the proposed method CPTML is shown in Algorithm 1.

Computational Complexity
The computational complexity of CPTML consists of the following two parts. First, computing W ss , W st , W ts , W tt , R ss and R tt in step 3 of Algorithm 1 costs O(TCn 2 ), where n = n s + n t . Second, solving the generalized eigenvalue decomposition problem in step 4 of Algorithm 1 costs O(Tdd 2 f ). Finally, the computational complexity of CPTML is O(TCn 2 + Tdd 2 f ). Specifically, in this paper, T is less than 30, which is enough to ensure convergence of the algorithm, and d f d n.

Algorithm 1
The procedure for CPTML framework.  (17) and (18) Train a new classifier f by 2 -LSR on the updated source projection data {A T X s , Y s }; then update the pseudo labels of the target domain dataŶ t = f (B T X t ); 6: end while

Dataset
SEED_IV [11] is a widely used dataset for EEG-based emotion recognition. It is a video-evoked EEG dataset, and 72 carefully chosen video clips are used to elicit four desired emotion states (sad, fear, happy, and neutral). A total of 15 healthy subjects participated in the EEG data collection experiment 3 times on different days, corresponding to 3 sessions. In each session, each subject was asked to watch 24 video clips (6 video clips corresponding to one emotional state) to evoke the four emotional states. That is, each session has 24 trials. During watching video clips, EEG data of subjects were recorded by the ESI NeuroScan system with a 62-channel cap whose electrodes are placed according to the standard 10-20 system. Differential entropy (DE) feature of EEG data is used to evaluate performance of models in our experiment, which is the preprocessed version of the SEED_IV dataset and can be downloaded from https://bcmi.sjtu.edu.cn/home/seed/seed-iv.html (accessed on 25 March 2022). The DE feature also has been proved that it is the most stable and accurate feature for emotion recognition than traditional features [26,27]. Since the DE features were extracted from 5 different EEG frequency bands, including Delta (1-4 Hz), Theta (4-8 Hz), Alpha (8-14 Hz), Beta (14-31 Hz), Gamma (31-50 Hz), and there are 62 channels in total, the data format is 62 × n × 5. n is the number of samples for each subject in each session, which is approximate 830. We reshape DE features into 310 × n by concatenating the 62 values of 5 frequency bands into a vector and then normalize them into [−1, 1] by row and conduct decentralizing.

Experimental Settings
To investigate the performance of CPTML in the cross-session EEG emotion recognition task, we set experiments as follows. For every subject, samples as well as their labels from one session form the labeled source domain and samples from the other session form the target domain. Therefore, each subject has three cross-session tasks in chronological order, i.e., "session 1 → session 2", "session 1 → session 3" and "session 2 → session 3", respectively.

Recognition Results and Analysis
The results of the five models on the three cross-session EEG emotion recognition tasks are, respectively, shown in Tables 1-3, where we highlight the best accuracies in boldface. From the obtained results, we made the following observations: (1) Generally, CPTML performs the best among these five models. It obtains the highest recognition accuracies in most of the total 45 cases and the best average accuracies of 82.16%, 83.69% and 84.68% in all the three cross-session EEG emotion recognition tasks, indicating the superiority of joint data alignment and emotion metric learning. (2) The strategy of simultaneously minimizing cross-session and cross-trial discrepancies performs better than considering only one of them. Specifically, as shown in Table 1, CPTML achieves 11.98% and 10.90% improvements in comparison with JDA and SDA which only take into account the discrepancies between different sessions. Similar phenomena can also be found in Tables 2 and 3. Additionally, compared with JOSRVFL which mainly focuses on learning discriminative features from EEG samples, CPTML exceeds it by 6.75%, 9.44% and 5.62% in the three cross-session tasks, respectively. (3) CPTML outperforms JDASRN by 4.75%, 4.59% and 5.28% in the three tasks. The main difference between them is that CPTML employs a joint framework to complete the domain alignment and graph-based metric learning while JDASRN uses a two-stage manner. Since there inevitably exist interactions between these two modules, our CPTML model is more powerful in approximating the global optimum. In addition to comparing the average accuracy rates of the five models, the Friedman test [31] is used to illustrate the statistical significance among them. It ranks all methods by combining the results on each group, and the higher ranking indicates the better performance of the corresponding model. The detail of the statistical test is stated as follows. The underlying hypothesis is that "all the models have the same performance". Whether to accept the hypothesis or not is determined by the variable τ F , which is defined as where K is the number of models, N is the number of result groups, and τ χ 2 is calculated as where r i is the average ranking of the i-th model. In this paper, K = 5, N = 45, the average rankings of the five models are [4.21, 3.84, 3.17, 2.47, 1.31]. We can obtain τ F = 50.51 according to Equation (20). When the significance level α = 0.05 by default, the critical value of Friedman test is 2.423 [32]. It is obvious that τ F = 50.51 is greater than 2.423; thus, the hypothesis is rejected. Further, the Nemenyi test is used to distinguish whether there are significant differences among these models, and the result is shown in Figure 5. In the figure, the solid circles denote the average rankings of these models, and the length of vertical line denotes the critical distance (CD) of Nemenyi test, which is calculated as where q α is the critical value and it is 2.728 when α = 0.05 [33]. If two vertical lines do not have overlap, it indicates that the corresponding models have statistically different performance. As shown in Figure 5, our proposed CPTML does not have overlap with JDASRN, JOSRVFL, SDA and JDA, which means that it is significantly better than the other four models in the cross-session emotion recognition tasks.  To give an insight into the recognition performance of CPTML on each emotional state, we use the average confusion graph to present the recognition results of the three crosssession tasks again in Figure 6. First, we obtain the average recognition accuracies of the four emotional states. For example, the average accuracies of the neutral, sad, fear and happy emotional states classified by CPTML are 90%, 79%, 81% and 83%, respectively. Second, the misclassification rates of all emotional states are explicitly provided. For example, 90% of the neutral EEG samples are correctly classified while 3%, 4% and 3% of them are wrongly recognized as sad, fear and happy states, respectively. Third, the neutral emotional state achieves the highest average accuracy in comparison with the others, which is the easiest one to recognize based on our experimental results.

Effect of Domain Alignment and Emotion Metric Learning
In this section, we first investigate the data alignment ability of CPTML in minimizing the domain discrepancies. As shown in Figure 7, by taking subject 5 as an example and using the t-SNE [34] visualization method, we intuitively plot the distributions of original data representation and CPTML-based subspace representation. From this figure, we find that the data distributions of the two domains are significantly different in original feature space; however, the discrepancies are dramatically reduced in the new subspaces learned by CPTML. This is exactly what the transfer learning does. In the learned subspaces, the source and target data share similar representations, benefiting from the joint marginal and conditional distributions alignment. This indicates that the domain alignment module of CPTML is effective for mitigating cross-session differences.  Second, we illustrate the effectiveness of CPTML in graph-based metric learning. By means of the similarity matrix which characterizes the connection among all source and target EEG samples, we can display the emotion metric graphs corresponding to original feature space and the learned CPTML subspaces. Figure 8 which also takes subject 5 as an example shows the learned results of its three cross-session recognition tasks. We observe that there are 48 block diagonals in the graphs of original feature space, which is exactly the total number of trials in source and target domains. According to the graph theory, since the value within each block denotes the affinity of two corresponding samples in one class, the number of block diagonals should be theoretically equal to the number of classes (i.e., 4 in present research since there are four emotional states in SEED_IV). The fact that the number of block diagonals in the left column of Figure 8 is 48 instead of 4 means that the cross-trial divergences are greater than those of different emotional states. Specifically, even two trials have the same emotional state, the similarities of samples in these two trials cannot be well built due to the large cross-trial differences. Fortunately, in the learned CPTML subspaces shown in the right column of Figure 8, the contours of the 4 block diagonals are significantly enhanced. It means that CPTML finds a way in appropriately building connections for EEG samples belonging to the same emotional state. We informally state this process as emotion metric learning, which corresponds to minimizing distances between intra-class samples and maximizing distances between inter-class samples.

Knowledge Discovery on Frequency Bands and Channels
For cross-session EEG emotion recognition, we are interested not only in the classification accuracy but also in the stable EEG patterns in cross-session emotion expression. Since the latter can offer us more insights to the neural mechanism of emotion processing, we expect the CPTML model to be competent for exploring where the cross-session stable EEG features are mainly from, i.e., identifying the critical EEG frequency bands and channels. From the perspective of transfer learning, it tries to seek shared subspaces for both source and target samples; therefore, the two coupled projection matrices A and B should learn domain-invariant features by strengthening the common components between domains while weakening the non-common components. Equivalently, we propose a quantitative approach to measure the feature weighting ability of the two projection matrices. Below are the detailed procedures.
Due to the multi-rhythm and multi-channel properties of EEG, the sample vector is usually formed by concatenating spectra features extracted from different frequency bands. To be specific, the dimensionality d = 310 of the SEED_IV dataset is obtained by each 62 points (corresponding to the 62 channels) of the 5 frequency bands (Delta, Theta, Alpha, Beta, and Gamma). Inspired by [35], the weight of each feature dimension can be quantitatively measured by the normalized 2 -norm of each row of the projection matrices. Taking A ∈ R d×d f for example, if we use a i to denote the weight of the i-th EEG feature, it can be calculated as a i = Here A i,: is the i-th row of A and its 2 -norm is defined Once the feature weight vector a = [a 1 , a 2 , · · · , a 310 ] is obtained, we can establish the correspondence between EEG features and frequency bands (channels) in Figure 9. Then, the importance of each frequency band can be measured by summing over the weight of EEG features belonging to such frequency bands; that is, where j = 1, 2, 3, 4, 5, respectively, denote the five frequency bands, Delta, Theta, Alpha, Beta, and Gamma. Similarly, the importance of the k-th EEG channel is Projection Matrix A  Figure 9. The correspondence between the projection matrix and the weight of frequency bands (channels).

Weights of channels
The channel order can be found from the SEED_IV website.
After obtaining the learned projection matrices A and B, the importance of EEG frequency bands and channels in both source and target domains is achieved. According to Equation (23), the weight of frequency bands of both source and target domains are shown in Figure 10. From this figure, we observe that in all cases, the frequency band importance ranking of the source domain is consistent with that of the target domain. Further, the Gamma band has the greatest importance which is considered as the most important frequency band in EEG emotion recognition. This result also coincides with some previous studies [36,37]. Additionally, we calculate the importance of all channels by Equation (24), and the top 10 important channels of both source and target domains are shown in Figure 11. We observe that FP1, FPZ, CPZ, CZ, FP2, T7 and T8 channels are selected in all cases, meaning that these channels are important for cross-session emotion recognition. To be more intuitive, we transform the weight of all channels into the form of brain topographical map, where yellow color denotes significant brain regions, as shown in Figure 12. From it, we find that the channels in the prefrontal, left/right temporal and central parietal lobes have larger weights. This finding is consistent with previous research [38,39]. Based on the above observations, we conclude that our proposed CPTML model can effectively perform emotion knowledge discovery on critical EEG frequency bands and channels which are more powerful in cross-session emotion expression.

Conclusions
In this paper, we proposed a CPTML model to simultaneously minimize the crosssession and the cross-trial data discrepancies for EEG emotion recognition. Extensive experiments on the SEED_IV dataset demonstrated that (1) CPTML achieved better recognition performance then the other models by jointly taking into consideration the domain alignment and the graph-based metric learning. (2) In the coupled projection matricesinduced subspaces by CPTML, data distributions between the source and target domains were well aligned. Additionally, in the learned emotion metric graph, the connections of EEG samples from different trials but with the same emotional state have been significantly enhanced. (3) The Gamma frequency band and the brain regions of pre-frontal, left/right temporal and central parietal lobes were identified by CPTML as the more important ones in cross-session emotion expression.