Abstract
Background/Objectives: It has become a major direction of research in affective computing to explore the brain-information-processing mechanisms based on physiological signals such as electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS). However, existing research has mostly focused on feature- and decision-level fusion, with little investigation into the causal relationship between these two modalities. Methods: In this paper, we propose a novel emotion recognition framework for the simultaneous acquisition of EEG and fNIRS signals. This framework integrates the Granger causality (GC) method and a modality–frequency attention mechanism within a convolutional neural network backbone (MFA-CNN). First, we employed GC to quantify the causal relationships between the EEG and fNIRS signals. This revealed emotional-processing mechanisms from the perspectives of neuro-electrical activity and hemodynamic interactions. Then, we designed a 1D2D-CNN framework that fuses temporal and spatial representations and introduced the MFA module to dynamically allocate weights across modalities and frequency bands. Results: Experimental results demonstrated that the proposed method outperforms strong baselines under both single-modal and multi-modal conditions, showing the effectiveness of causal features in emotion recognition. Conclusions: These findings indicate that combining GC-based cross-modal causal features with modality–frequency attention improves EEG–fNIRS-based emotion recognition and provides a more physiologically interpretable view of emotion-related brain activity.
1. Introduction
As global brain science research progresses, exploring the information-processing and cognitive mechanisms of the human brain through physiological signals in affective computing [1,2,3,4] has become increasingly prominent. Generally, physiological signals associated with neural activity fall into two categories [5,6]: one is the electrophysiological signals, such as electroencephalography (EEG), magnetoencephalography (MEG); and the other is signals of metabolic changes, such as functional near-infrared spectroscopy (fNIRS) and functional magnetic resonance imaging (fMRI). In the field of emotion recognition, EEG and fNIRS are often favored over MEG and fMRI, owing to their portability, cost-effectiveness, and ease of operation [7,8]. In particular, the simultaneous acquisition of EEG–fNIRS signals, which provides more comprehensive brain information than does single-modality data [9,10], has become a popular topic in emotion recognition research.
Research on neurovascular coupling (NVC) mechanisms posit [11,12] that neural activity in the brain is accompanied by the electrical activity and corresponding hemodynamic changes. In response to external stimuli, neuronal activation triggers electrical activity while simultaneously increasing local cerebral blood flow to meet the increased oxygen demands. This leads to an elevation in the intravascular concentration of oxyhemoglobin (HBO2) and a reduction in deoxyhemoglobin (HbR) [13,14]. Therefore, EEG reflects neural electrical activity with high temporal resolution, while fNIRS reveals metabolic processes through changes in blood oxygenation, offering superior spatial localization capabilities compared to EEG. Together, EEG and fNIRS provide a complementary toolset for investigating the neural mechanisms of emotion-related brain activity because of their temporal and spatial complementarity.
However, the temporal and spatial differences between EEG and fNIRS also present challenges for joint analysis [15,16,17,18]. Existing research has predominantly addressed this challenge by feature-level or decision-level fusion [19,20,21,22], which makes it difficult to reveal the potential coupling mechanisms between the two signals. EEG and fNIRS respectively reflect neural electrical activity and blood oxygenation dynamics in response to the same stimulus, suggesting a potential causal relationship between them. Therefore, it is necessary to employ measures that can quantitatively characterize causal direction and strength. Granger causality (GC), originally proposed in economics [23,24], has been widely used in neuroscience to explore the causal connections of physiological signals such as EEG, fNIRS, and fMRI [25,26,27,28]. There remains a relative paucity of systematic studies on cross-modal causality between EEG and fNIRS. Thus, this paper applies the GC method to analyze in depth the causal coupling between EEG and fNIRS signals during emotion induction, which is crucial for revealing mechanisms of emotional processing in the human brain and optimizing emotion recognition performance.
Furthermore, the EEG features, fNIRS features, and the combinations of different frequency bands of EEG and fNIRS features contribute differently to affective representations. Traditional methods often consider all modalities and frequency bands equally, ignoring differences in informativeness and limiting overall performance. In recent years, attention mechanisms have demonstrated advantages in multimodal fusion due to their ability to dynamically allocate weights based on feature importance. Inspired by this, we introduce a modality–frequency attention (MFA) mechanism based on cross-modal causal features to adaptively highlight modality and frequency-band features that are more sensitive to emotional states, thereby further enhancing recognition accuracy and robustness.
Above all, although the EEG or fNIRS alone can provide insights into human emotion brain activity, the complexity of brain function means that single-modality approaches are inherently limited. Therefore, this paper focuses on synchronously acquired EEG–fNIRS signals, and the primary contributions include the following:
- We proposed a novel causal metric for EEG-fNIRS to quantify their causal relationship.
- We designed an MFA-CNN emotion recognition model that integrates a 1D–2D CNN framework with a modality–frequency attention mechanism, which assigns weights to EEG, fNIRS, and cross-modal features to enhance recognition performance.
- We conducted systematic experiments on the TYUST-3.0 EEG-fNIRS multimodal emotion dataset [29] to evaluate MFA-CNN, including unimodal and multimodal settings, ablation studies, and comparisons with alternative classifiers.
The rest of the paper is organized as follows. Section 2 discusses the related work. Section 2 discusses the related work. Section 3 introduces EEG-fNIRS data acquisition and preprocessing. Section 4 details the proposed MFA-CNN model. Section 5 presents the experimental setup, results, and analysis. Section 6 concludes the paper and outlines future research directions.
2. Related Works
In this section, we introduce the related works involved in this paper, including the GC method, EEG electrode arrangement strategy, and multimodal emotion recognition network.
2.1. The GC Method
The basic idea of the GC method is that if the prediction error of the future moments of a time series X can be reduced by introducing the historical information of another time series Y, then the time series Y is assumed to have a causal relationship with the time series X. In recent years, researchers have explored the causal relationships between EEG, fNIRS, and combined EEG-fNIRS signals in numerous fields including workload, emotion recognition, motor control, cognitive assessment, and rehabilitation [30,31,32,33,34,35]. In EEG research, Guo et al. [36] used the sparse group lasso-Granger to construct an EEG causal brain network, which was then used to analyze the changes in human affective states. Bagherzadeh et al. [37] utilized a frequency domain GC method to construct EEG brain networks and used them to study the connectivity and differences of the brain under different emotions. Zachary et al. [38] applied GC analysis to determine how an emotion is affected by a person’s cultural background and situation. In the fNIRS research, Hu et al. [39] combined conditional GC analysis and fNIRS neuroimaging technique to examine the differences in brain activation and effective connectivity between calculation, planning, and reasoning and discovered that the performance of planning and reasoning was correlated with the activation in frontal cortex and parietal cortex, respectively. Lee et al. [40] investigated cerebral cortex activation during active movement and passive movement by using fNIRS signals.
In EEG-fNIRS research, Al-Shargie et al. proposed an ROC-based decision fusion of -frequency EEG and fNIRS signal for improving the detection rate of mental stress [30]. Nour et al. designed a multiple bandwidth method with optimized CNN model to fusion the time–frequency features of EEG and fNIRS signals and to enhance the recognition performance of motor imagery [31]. Saadati et al. down-sampled the original EEG signals to match the sampling rate of the fNIRS signals and then input the simultaneous EEG-fNIRS signals into a CNN network, which increased the recognition performance of the workload memory tasks by 20% and 7% compared with the single-modal EEG and fNIRS signals, respectively [32]. Sun et al. adopt the feature-stitching method to integrate the time–frequency features of the different frequency EEG signals and fNIRS signals, whose emotion recognition performance was improved by 8% to 18% compared with single-modal signals [33]. Liang et al. adopted coherence and Granger causality (GC) between the EEG and fNIRS signals as the degree of neurovascular coupling to explore the causality relationship between neurophysiology and hemodynamics in children and adults during general anesthesia [34]. In [35], Kamat et al. constructed the GC brain network of EEG-fNIRS signals, to explore the differences in the flow of information between the experts and novices during the performance of the fundamentals of laparoscopic surgery.
2.2. Multimodal Emotion Recognition Network
With the advancement of deep learning, various attention mechanisms have been gradually integrated into recognition models such as CNN and CRNN to further explore the time–frequency characteristics in EEG and fNIRS signals related to emotion. For example, Hamidi et al. [41] combined Transformers with graph convolutional networks (GCNs) to capture the spatio-temporal features of EEG signals. In [42], a multi-channel hybrid fusion network based on EEG-fNIRS signals was proposed, which employs an efficient attention mechanism module to integrate the spatial information of EEG and fNIRS signals. In [43], an early fusion strategy based on deep learning was proposed to combine the spatial weight coefficients obtained from the fNIRS with EEG features and to adaptively allocate the weights coefficients according to their learned importance.
Moreover, the attention mechanism has also been introduced in the frequency domain of EEG to highlight the differential contributions of frequency bands to emotional states. The frequency–spatial attention mechanism proposed in [44] simultaneously models weight distributions across both frequency and spatial domains. In [45], a multi-band graph feature fusion method was proposed to enhance classification performance by integrating cross-band information. In summary, attention-enhanced multimodal networks have demonstrated clear benefits for EEG-fNIRS emotion recognition, providing a solid basis for integrating GC-derived cross-modal causal features into end-to-end models.
3. Materials
3.1. Participants
Fifty healthy, right-handed undergraduate students (25 males and 25 females; age 22–26 years; mean 24.1 years) participated in the experiments. None had a history of mental or other brain-related illnesses. Before each experiment, the participants were briefed about the purposes and procedures of the experiments. All participants provided informed consent and received financial compensation for their involvement. The experimental protocol adhered to the Declaration of Helsinki for human experimentation and was approved by the Taiyuan University of Technology Research Ethics Committee.
3.2. Experimental Design
Figure 1 overviews the EEG-fNIRS acquisition pipeline for emotion induction, including stimulus selection, synchronized signal recording, and preprocessing.
Figure 1.
The framework of EEG-fNIRS signal acquisition.
3.2.1. Stimulus Selection
Compared with static images or speech, video clips elicit stronger affective responses. We first curated 120 video clips with diverse emotions from films and documentaries, and each clip lasted approximately 1–2 min. Independent volunteers then assessed each clip on a 5-point Likert scale for valence (pleasure), arousal, and dominance based on their felt experience. Based on these ratings, we compiled a final stimulus set comprising four emotion categories, with each category including 15 video clips: happy, fear, sad, and calm.
3.2.2. Acquisition Equipment
As for the EEG-fNIRS acquisition, the EEG electrodes and fNIRS optodes were placed at non-overlapping scalp locations. EEG was recorded using Neuroscan SynAmps2 (Neuroscan, Neuroscan USA, Ltd., Charlotte, NC, USA), with 62 electrodes positioned according to the international 10–20 system, and sampled at 1000 Hz, covering the whole scalp. fNIRS was recorded using NirSmart (DanYang HuiChuang Medical Equipment Co., Ltd., Danyang, China), with 10 sources and 8 detectors, operating at 760/850 nm and 11 Hz. Given the key role of the frontal and temporal cortices in emotion processing, we placed 6 sources/4 detectors over the temporal region and 4 sources/4 detectors over the frontal region. The details of the EEG electrodes and fNIRS optodes are illustrated in Figure 2, where S denotes a source, D denotes a detector, and each S–D denotes pairing a measurement channel.
Figure 2.
The arrangement and layout of EEG electrodes and fNIRS optodes.
3.2.3. Data Collection
To ensure high data quality, experiments were conducted in a shielded laboratory to minimize electromagnetic and environmental noise. During the experiment, participants sat comfortably and received both visual and auditory stimuli: videos were presented on an LCD monitor at about a 1 m viewing distance, and audio was delivered via high-fidelity, head-mounted headphones. The protocol was programmed in E-prime 3, and each participant completed 60 trials.
The experimental procedure is shown in Figure 1b. It commenced with a “+” symbol displayed on the screen for 2 s to prompt the participant to focus on the upcoming emotional clip. Next, a selected emotional clip was randomly presented, and EEG and fNIRS signals were recorded synchronously. After each video clip, participants were given 30 s to assess their felt emotions in terms of pleasure, arousal, dominance, and emotion type. This step was essential for verifying whether the video clips effectively elicited the intended emotions.
3.3. Signal Pre-Processing
EEG preprocessing: EEG data were re-referenced, downsampled to 200 Hz, band-pass filtered (4–45 Hz), and baseline-corrected. Artifacts (e.g., ocular movements) were removed using independent component analysis (ICA). Time–frequency representations were obtained via the short-time Fourier transform (STFT), and band-specific features were extracted for (4–8 Hz), (8–12 Hz), (12–30 Hz), and (30–45 Hz).
fNIRS preprocessing: Major noise sources include cardiac pulsation (1–1.5 Hz) and respiration (0.2–0.4 Hz), whereas task-related hemodynamic components mainly lie within 0.005–0.21 Hz. Preprocessing comprised baseline correction, motion-artifact attenuation, and 0.01–0.2 Hz band-pass filtering. Changes in oxyhemoglobin (HbO2) and deoxyhemoglobin (HbR) concentrations (HbO2 and HbR) were computed using the modified Beer–Lambert law. To temporally align with EEG, the HbO2/HbR time series were upsampled to 200 Hz via envelope-based interpolation.
4. The Proposed MFA-CNN Method
Figure 3 illustrates the proposed MFA-CNN framework for emotion recognition. After preprocessing, three feature streams are extracted: (i) GC-based EEG connectivity matrices, (ii) fNIRS hemodynamic features (HbO2/HbR), and (iii) EEG–fNIRS cross-modal GC features. Each stream is fed into a modality-specific encoder—a 1D–CNN for the fNIRS time series and 2D-CNNs for the matrix-form inputs—to obtain deep representations. A modality–frequency attention (MFA) module then adaptively reweights and fuses the three streams, and a Softmax classifier outputs the final emotion results.
Figure 3.
The framework of the proposed MFA-CNN method. where the blue blocks indicate the fNIRS feature processing branch, the orange blocks indicate the EEG–fNIRS cross-modal GC feature processing branch, the green blocks indicate the GC-based EEG feature processing branch, and the purple block denotes the MFA module.
4.1. The GC Features of EEG Signal
According to Granger causality, for two time series and , if the prediction of based on the joint past is better than that based only on , then X is said to Granger-cause Y. To illustrate, let and denote two EEG time series. The univariate AR(p) model for is
The bivariate AR/VAR(p) model for is
where p is the model order, are autoregressive coefficients, and denotes the prediction error of the model, respectively.
Let and denote the variances of the residuals in the restricted and full models, respectively. Then the GC effect from to is defined as
If , there exists a GC relationship between and , i.e., Granger-causes . The converse measure is defined analogously by interchanging the roles of and .
In summary, causal features of EEG in the , , , and bands can be computed via Equations (1)–(3). Similarly, the causal features of fNIRS signals can be computed.
4.2. The Causal Feature of EEG-fNIRS
In the previous section, the fNIRS signals were upsampled to match the EEG sampling rate, ensuring temporal alignment between the two modalities. Following the GC method, let denote the EEG time series at channel z and let denote the fNIRS blood-oxygenation time series (HbO2 or HbR) at optode f. To test whether fNIRS Granger causes EEG, we compare a restricted AR model for with a full bivariate AR model that additionally includes past fNIRS terms:
where q denotes the model order (e.g., selected via AIC/BIC), , and are the autoregressive coefficients, and and are the prediction errors of the restricted model in (4) and the full model in (5), respectively.
Accordingly, the GC from fNIRS to EEG is defined as the log ratio of the residual variances:
A larger indicates that including past fNIRS improves the prediction of EEG; i.e., fNIRS Granger causes EEG at the tested pair . By repeating (4)–(6) for all fNIRS-EEG channel pairs, we obtain a channel-wise GC matrix . The reverse matrix is computed analogously by swapping the roles of and , as shown in Figure 4.
Figure 4.
The GC matrix of fNIRS to EEG signal. Darker colors indicate lower GC values (weaker directed causal influence between EEG and fNIRS), whereas brighter colors indicate higher GC values (stronger causal coupling between the two modalities).
4.3. The MFA-CNN Module
As described in Figure 3, the proposed MFA-CNN fuses the three feature streams, and each stream is first passed through a feature-specific CNN encoder composed of three convolutional layers (with 32/64/128 filters) followed by a fully connected (FC) layer with 128 units, outputting a 128-dimensional deep representation per stream. Pooling layers are intentionally omitted between adjacent convolutions to preserve the spatial structure that could be lost through repeated down-sampling. The three 128-dimensional vectors are then concatenated into a 384-dimensional fused vector, which is fed to a final FC layer and a Softmax classifier to produce the emotion label. Table 1 reports the detailed hyper-parameters of the MFA-CNN.
Table 1.
Parameters of the MFA-CNN model.
Prior research [30] has demonstrated different abilities in the emotion recognition of EEG signals across different frequency bands. Likewise, the GC-based cross-modal features between EEG bands and fNIRS exhibit varying discriminative power. To exploit this heterogeneity, we developed a modality–frequency attention (MFA) module that learns weights over modality–frequency combinations and rescales the corresponding deep features before fusion, as illustrated in Figure 5. In this way, features most informative of emotional state receive higher importance, improving both accuracy and robustness.
Figure 5.
The framework of MFA module.
In the MFA module, we first apply global average pooling (GAP) along the feature dimension of the input (where C denotes modality–frequency channels, and D denotes features per channel). The GAP process results in a channel descriptor :
Next, the descriptor vector is partitioned into three groups: the EEG descriptors , the fNIRS descriptor, , and the EEG-fNIRS cross-modal descriptors, . For each group, attention weights are produced by a two-layer fully connected network with ReLU followed by sigmoid. The computations are given in Equations (8)–(10):
where and are the weights of the two FC layers, and and denote the ReLU and sigmoid activations, respectively:
The resulting weights are broadcast back to the corresponding rows of X and applied by element-wise multiplication, and producing reweighted features , , and , which are then concatenated and forwarded to the classifier.
4.4. The Proposed Weighted Multi-Loss Function (WML)
To better exploit informative features for emotion recognition, we adopt a weighted multi-loss scheme to jointly optimize the MFA-CNN. We define the global loss of the fused prediction as and the local loss from each modal-frequency combination as . The total loss is described as follows:
where balances the contributions of global and local supervision.
For the global term , we use cross-entropy between the ground-truth one-hot label and the fused Softmax prediction :
where C denotes the number of emotion categories.
For the local loss , we aggregate three branch-specific cross-entropies:
where are computed from the corresponding branch outputs, and are the weights reflecting the recognition strength of each branch.
In practice, the weights are updated from the branch accuracies (e.g., running averages over training batches) and normalized to sum to one:
This weighting prioritizes branches with higher discriminative power while still allowing gradients from all branches to contribute to learning.
5. Experimental Results and Analysis
5.1. Parameter Settings
In this paper, all experiments were conducted under a subject–dependent setting. Specifically, our experiments were conducted on Windows 10 using Python 3.9, and PyTorch 2.0, running on an Intel Core i9-13900KF (3.0 GHz) CPU with an NVIDIA GeForce RTX 4090 GPU. For deep learning optimizer, the Adam optimizer is employed with a learning rate of 0.005 and batch size of 32, and a dropout rate of 0.3 is applied to mitigate overfitting.
For the original EEG and fNIRS signal, the raw streams were segmented with a 3 s sliding window and a 1.5 s overlap. In the subject-dependent setting, splits were performed clip-wise per participant to avoid leakage: for each emotion class, 12 clips were used for training and the remaining 3 clips were reserved for validation/test (applied before windowing so that no trial contributes windows to multiple splits). To evaluate the emotion recognition, 5-fold cross-validation is implemented across all 50 participants, and the resulting average recognition accuracy and standard deviation are used as evaluation metrics.
5.2. The Emotion Recognition Performance of Single Modal Features
In this section, we describe the evaluation of single features extracted from EEG and fNIRS using the same CNN classifier, and the results are summarized in Figure 6.
Figure 6.
The emotion recognition rate of single features.
(1) For fNIRS features, the average accuracies of HbO2 and HbR features are 77.45% and 75.49%, respectively. The cascaded HbO2 + HbR feature lifts performance to 86.08%, i.e., achieving an improvement of 8.33% and 10.59% over the single HbO2 and HbR features, respectively. These results indicate that the two hemoglobin components provide complementary information for emotion recognition.
(2) For the GC features of EEG signal, the average emotion recognition accuracies for the , , , and frequency bands are 71.83%, 76.09%, 80.27%, and 84.04%, respectively. Furthermore, the combined features achieve significant improvements over single-band features. This improvement can be attributed to the fact that EEG signals in different frequency bands express distinct emotional information, utilizing the complementarity between them enhances emotion recognition performance.
(3) For the EEG-fNIRS causal features, the accuracies between EEG signals in , , , and frequency bands and fNIRS signals are 45.90%, 45.81%, 46.40%, and 67.29%, respectively. It is worth noting that the recognition accuracies of the frequency band EEG signal are significantly higher than those of the other frequency bands. This observation suggests a stronger cross-modal coupling in the frequency band EEG signal and supporting the presence of an informative causal relationship between EEG and fNIRS. These observations motivate leveraging causal features in the subsequent multimodal fusion.
5.3. Emotion Recognition Performance of Combination Features
Table 2 summarizes recognition accuracies for different combinations of EEG, fNIRS, and EEG-fNIRS features using the proposed 1D2D-CNN model.
Table 2.
Emotion recognition accuracy of different combination features (%).
The results demonstrate that each combined feature outperforms its corresponding single-modality baseline. Specifically, combining EEG with HbO2, HbR, and HbO2 + HbR yields accuracies of 94.65%, 93.66%, and 94.76%, i.e., improvements of 4.11%, 3.12%, and 4.22% over the single EEG feature, respectively. Similarly, the EEG + EEG-fNIRS combination achieve improvements of 2.89% and 23.12% over the single EEG and EEG-fNIRS features, respectively, and the HbO2 + HbR + EEG-fNIRS combination show increases of 6% and 20.77% in accuracy over the single HbO2 + HbR and EEG-fNIRS features, respectively. Specifically, the three-way fusion achieves the highest emotion recognition accuracy of 94.86%, which is 4.32%, 8.78%, and 24.55% higher than the single EEG, HbO2 + HbR, and EEG–fNIRS features, respectively. These findings demonstrate strong cross-modal complementarity.
Notably, although the single EEG-fNIRS causal feature yields lower standalone accuracy, it effectively enhances the recognition performance when fused with EEG and fNIRS features. In addition, comparisons across single frequency-modality features indicate heterogeneous contributions to emotion recognition, providing an empirical basis for the proposed MFA-CNN.
5.4. Emotion Recognition Performance of the Proposed MFA-CNN Models
To access the contributions of each component in the MFA-CNN model, we conducted ablation studies with the following variants:
Table 3 shows the emotion recognition results and training time of different methods. Compared with 1D2D-CNN, integrating only the MFA or the WML module alone improves the EEG() + fNIRS recognition accuracy by 0.96% and 0.61%, respectively. This indicates that the proposed MFA module better captures and utilizes the modality–frequency information, and the WML provides beneficial auxiliary supervision. In addition, the full MFA-CNN achieves the highest emotion recognition accuracy of 98.65%, outperforming the 1D2D-CNN, 1D2D-CNN(WML), 1D2D-CNN(MFA), and MFA-CNN (unweighted) by 1.89%, 1.03%, 1.38%, and 0.64%, respectively.
Table 3.
Ablation experimental results of the MFA-CNN model (%).
Table 3 also presents the model parameters and training times, while the computational overhead of the proposed variants is negligible. Training times are nearly identical across models—624 s for 1D2D-CNN and 632.6 s for MFA-CNN—indicating that performance gains are achieved without materially increasing runtime. Parameter counts show the same pattern: 1D2D-CNN and 1D2D-CNN(WML) each contain 128,972,816 parameters, while models incorporating the MFA module add just 72 parameters, a practically imperceptible increase. These results demonstrate that MFA-CNN improves recognition accuracy while maintaining comparable computational complexity and model size.
To quantify the contributions of different feature combinations, we visualized the attention weights learned by MFA-CNN. The average attention weights were 0.96, 0.94, 1.04, 1.01, 0.95, 0.93, 1.03, 0.89, and 0.90. Among them, EEG(), EEG() and EEG()-fNIRS features received the largest weights, whereas EEG()-fNIRS and EEG()-fNIRS were the lowest. This indicates that the model relies more on low-frequency cross-modal coupling and discriminative EEG information in the and bands, consistent with the slow time course of NVC and established emotion-related EEG rhythms.
Moreover, a Friedman test across five schemes was significant (, ), with a moderate effect size (Kendall’s ). Post hoc Wilcoxon signed-rank tests with Holm correction showed that MFA-CNN significantly outperformed all other schemes (all adjusted ); effect sizes (Cliff’s ) were large to very large ( up to ).
5.5. Performance Comparisons for Latest Schemes
In this section, we described the comparisons of the proposed MFA-CNN with several state-of-the-art methods on the TYUT 3.0 emotion dataset. To ensure fairness, all methods used the same data division and fivefold cross-validation. Table 4 presents the performance comparison of recognition accuracy and standard deviation, with the proposed MFA-CNN achieving the highest overall accuracy 96.85%.
Table 4.
Comparison of performance between different emotion recognition models (%). Entries in the last column are mean/SD.
(1) Compared with Cascade+SVM, Cascade+CNN, and Cascade+GCN, the MFA-CNN improves accuracy by 10.73%, 4.52%, and 5.68%, respectively. This is mainly because those baselines rely on feature stacking for fusion and do not explicitly model the relative importance of modalities or EEG frequency bands. In contrast, MFA-CNN introduces directed cross-modal information flow via GC features and applies MFA for adaptive reweighting, thereby substantially enhancing emotion recognition performance.
(2) Relative to GNN-fusion and weighted fusion+GCN, the MFA-CNN increases overall accuracy by 2.52% and 1.70%, respectively. Although graph models capture inter-channel and regional relationships, these are typically correlation or coherence measures and underrepresent directed neurovascular coupling. MFA-CNN embeds the GC relationship of EEG-fNIRS signals as a structural prior and performs explicit selection at both modality and frequency levels, which is more consistent with the physiological mechanisms of EEG-fNIRS.
(3) Compared with algebraic fusion models such as tensor fusion and p-order polynomial fusion, MFA-CNN improves accuracy by 3.94% and 3.11%, respectively. While algebraic fusion can capture cross-modal interactions, it is prone to dimensional explosion and noise amplification, making it sensitive to regularization and data scale. MFA-CNN uses a 1D–2D CNN architecture to encode temporal and spatial information and couples this with a WML method to balance supervision across branches, thereby avoiding these drawbacks.
(4) Compared with capsule-network variants such as MLF-CapsNet, ST-CapsNet, and MBA-CF-cCapsNet, MFA-CNN consistently achieves the best accuracies. Capsule networks are effective for hierarchical and pose-invariant representations, but they often lack physiological priors and explicit frequency-level constraints in multimodal settings. MFA-CNN integrates causal priors with modality–frequency attention, focusing discriminative power on emotion-sensitive bands and key modality combinations.
(5) Compared with the Transformer method, MFA-CNN achieves a 10.01% improvement. This is primarily because without explicit modality–frequency priors, Transformers struggle to learn stable cross-modal coupling and are more sensitive to hyperparameters and dataset size.
Furthermore, Figure 7 presents the confusion matrix of the MFA-CNN model, showcasing its exceptional classification accuracy for all four emotion types. Although there were a few misclassifications, they were negligible. Notably, the emotions sad and calm (1.64% and 1.14% confusion, respectively) alone with happy and fear (1.67% and 0.94% confusion, respectively) were more susceptible to confusion. This might be attributed to sad and calm both being low-activation emotion types, while happy” and fear are high-activation types.
Figure 7.
The confusion matrix of the MFA–CNN.
5.6. The Subject–Independent Experiments
The results in the previous subsections demonstrate that MFA-CNN achieves strong emotion recognition performance under the subject-dependent setting. To further assess its generalization and provide a comprehensive view of performance across different application scenarios, we conducted subject-independent experiments on the TYUT-3.0 dataset using a leave-one-subject-out (LOSO) protocol. In each fold, data from 49 participants were used for training, and data from the held-out participant were used exclusively for testing.
Table 5 shows the recognition results of different fusion methods, and the MFA-CNN achieves higher accuracy than does representative fusion baselines from the literature. As expected, due to the pronounced distribution shift between training and testing subjects in the LOSO setting, absolute accuracies are lower than are those in the subject-dependent experiments. Nevertheless, MFA-CNN remains the top performer.
Table 5.
The subject–independent results of different fusion methods (%).
6. Conclusions
In this paper, we present MFA-CNN, a multimodal EEG-fNIRS emotion-recognition framework that couples a 1D–2D CNN backbone with an MFA module and WML strategy. First, we quantified the causal coupling between EEG and fNIRS via the GC method and verified that incorporating GC-derived cross-modal features enhances recognition performance. The network fuses EEG, fNIRS, and EEG-fNIRS causal features, while MFA adaptively assigns weights across modalities and frequency bands, and WML jointly optimizes global and branch-specific losses. Experiments on the TYUT-3.0 dataset show that MFA-CNN consistently outperforms representative EEG-fNIRS fusion methods in both accuracy and stability. Beyond performance, the GC analysis provides a quantitative perspective on neurovascular coupling during affective processing, offering evidence for the underlying neural mechanisms and indicating a promising applicability to diverse EEG-fNIRS-based brain–computer interface tasks. Practically, our findings can be leveraged for driver fatigue monitoring, workplace stress assessment, student mental-workload monitoring, and clinical rehabilitation scenarios such as emotion and pain evaluation.
Future work will pursue cross-subject emotion recognition by incorporating advanced transfer learning and domain adaptation techniques (e.g., subject-invariant representations and zero-shot adaptation). In parallel, guided by the MFA-identified modality–frequency combinations most sensitive to emotion, we will explore lightweighting of the emotion recognition model, deepen understanding of the brain’s emotion-processing mechanisms, and develop an efficient, low-latency online EEG–fNIRS BCI system.
Author Contributions
Conceptualization, J.Z. and A.W.; methodology, J.Z.; software, J.Z. and D.Z.; validation, J.Z., A.W. and S.L.; formal analysis, J.Z. and X.L.; writing—original draft preparation, J.Z.; writing—review and editing, J.Z., A.W., S.L., D.Z. and X.L.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Fundamental Research Program of Shanxi Province (no. 202403021212260), the Doctoral Scientific Research Starting Foundation of Taiyuan University of Science and Technology (no. 2023098), the Funding Awards for Outstanding Doctors Volunteering to Work in Shanxi Province (no. 20242011), and the Science and Technology Innovation Project of Colleges and Universities in Shanxi Province (no. 2024L259).
Institutional Review Board Statement
The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Taiyuan University of Technology (TYUT-20220309, approved on 9 March 2022).
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
The data presented in this study are available on request due to privacy and ethical reasons. To obtain access, please download and complete the license agreement at https://gitee.com/tycgj/enter (accessed on 1 September 2025). After completing the form, kindly email it to 2023025@tyust.edu.cn. A download link will be provided via email after your application is reviewed.
Conflicts of Interest
Author Li Xin was employed by the company The 2nd Research Institute of China Electronics Technology Group Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| EEG | electroencephalography |
| fNIRS | functional near-infrared spectroscopy |
| GC | Granger causality |
| CNN | convolutional neural network |
| MFA | modality–frequency attention |
| HbO2 | oxyhemoglobin |
| HbR | deoxyhemoglobin |
| GCN | graph convolutional network |
| WML | weighted multi-loss |
| SVM | support vector machine |
References
- Geng, Y.; Shi, S.; Hao, X. Deep learning-based EEG emotion recognition: A comprehensive review. Neural Comput. Appl. 2025, 37, 1919–1950. [Google Scholar] [CrossRef]
- Al-Hadithy, S.S.; Abdalkafor, A.S.; Al-Khateeb, B. Emotion recognition in EEG signals: Deep and machine learning approaches, challenges, and future directions. Comput. Biol. Med. 2025, 196, 110713. [Google Scholar] [CrossRef]
- Ma, F.; Yuan, Y.; Xie, Y.; Ren, H.; Liu, I.; He, Y.; Ren, F.; Yu, F.; Ni, S. Generative technology for human emotion recognition: A scoping review. Inf. Fusion 2025, 115, 102753. [Google Scholar] [CrossRef]
- Gkintoni, E.; Aroutzidis, A.; Antonopoulou, H.; Halkiopoulo, C. From neural networks to emotional networks: A systematic review of EEG-based emotion recognition in cognitive neuroscience and real-world applications. Brain Sci. 2025, 15, 220. [Google Scholar] [CrossRef] [PubMed]
- Liu, Z.M.; Shore, J.; Wang, M.; Yuan, F.P.; Buss, A.; Zhao, X.P. A systematic review on hybrid EEG/fNIRS in brain-computer interface. Biomed. Signal Process. Control 2021, 68, 102595. [Google Scholar] [CrossRef]
- Suhaimi, N.S.; Mountstephens, J.; Teo, J. EEG-based emotion recognition: A state-of-the-art review of current trends and opportunities. Comput. Intell. Neurosci. 2020, 2020, 8875426. [Google Scholar] [CrossRef]
- Sun, X.; Dai, C.; Wu, X.; Han, T.; Li, Q.; Lu, Y.; Liu, X.; Yuan, H. Current implications of EEG and fNIRS as functional neuroimaging techniques for motor recovery after stroke. Med. Rev. 2024, 4, 492–509. [Google Scholar] [CrossRef] [PubMed]
- Si, X.; Huang, H.; Yu, J.; Ming, D. EEG microstates and fNIRS metrics reveal the spatiotemporal joint neural processing features of human emotions. IEEE Trans. Affect. Comput. 2024, 15, 2128–2138. [Google Scholar] [CrossRef]
- Wang, R.; Zhuang, P. A strategy for network multi-layer information fusion based on multimodel in user emotional polarity analysis. Int. J. Cogn. Comput. Eng. 2025, 6, 120–130. [Google Scholar] [CrossRef]
- Chen, M.S.; Cai, Q.; Omari, D.; Sanghvi, D.; Lyu, S.; Bonanno, G. Emotion regulation and mental health across cultures: A systematic review and meta-analysis. Nat. Hum. Behav. 2025, 9, 1176–1200. [Google Scholar] [CrossRef] [PubMed]
- Mao, H.; Meng, H.; Tan, Q.; Samuel, O.; Li, G.; Fang, P. Evaluation of neurovascular coupling behaviors for FES-induced wrist movements based on synchronized EEG-fNIRS signals. IEEE Trans. Neural Syst. Rehabil. Eng. 2025, 33, 2622–2630. [Google Scholar] [CrossRef] [PubMed]
- Zhong, J.; Li, G.; Lv, Z.; Chen, J.; Wang, C.; Shao, A.; Gong, Z.; Wang, J.; Liu, S.; Luo, J.; et al. Neuromodulation of cerebral blood flow: A physiological mechanism and methodological review of neurovascular coupling. Bioengineering 2025, 12, 442. [Google Scholar] [CrossRef]
- Lin, J.; Lu, J.; Shu, Z.; Han, J.; Yu, N. Subject-specific modeling of EEG-fNIRS neurovascular coupling by task-related tensor decomposition. IEEE Trans. Neural Syst. Rehabil. Eng. 2024, 32, 452–461. [Google Scholar] [CrossRef]
- Burma, J.; Oni, I.; Lapointe, A.; Rattana, S.; Schneider, K.; Debert, C.; Smirl, J.; Dunn, J. Quantifying neurovascular coupling through a concurrent assessment of arterial, capillary, and neuronal activation in humans: A multimodal EEG-fNIRS–TCD investigation. NeuroImage 2024, 302, 120910. [Google Scholar] [CrossRef]
- Chen, H.; Chan, I.Y.S.; Dong, Z.; Samuel, T. Unraveling trust in collaborative human–machine intelligence from a neurophysiological perspective: A review of EEG and fNIRS features. Adv. Eng. Inform. 2025, 67, 103555. [Google Scholar] [CrossRef]
- Llamas-Ramos, R.; Alvarado-Omenat, J.J.; Llamas-Ramos, I. Early EEG and NIRS measurements in preterm babies: A systematic review. Eur. J. Pediatr. 2024, 183, 4169–4178. [Google Scholar] [CrossRef]
- Suzen, E.; Yardibi, F.; Ozkan, O.; Ozkan, O.; Colak, O.; Ozen, S. The role of fNIRS in assessing motor function: A bibliometric review on extremity applications. Medicine 2025, 104, e43707. [Google Scholar] [CrossRef] [PubMed]
- Lee, H.T.; Shim, M.; Liu, X.; Cheon, H.; Kim, S.; Han, C.; Hwang, H. A review of hybrid EEG-based multimodal human–computer interfaces using deep learning: Applications, advances, and challenges. Biomed. Eng. Lett. 2025, 15, 587–618. [Google Scholar] [CrossRef]
- Bunterngchit, C.; Wang, J.; Su, J.; Wang, Y.; Liu, S.; Hou, Z. Temporal attention fusion network with custom loss function for EEG-fNIRS classification. J. Neural Eng. 2024, 21, 066016. [Google Scholar] [CrossRef] [PubMed]
- Qiu, L.; Feng, W.; Ying, Z.; Pan, J. EFMLNet: Fusion model based on end-to-end mutual information learning for hybrid EEG-fNIRS brain–computer interface applications. Proc. Annu. Meet. Cogn. Sci. Soc. 2024, 46, 5752–5758. [Google Scholar]
- Liu, M.; Yang, B.; Meng, L.; Zhang, Y.; Gao, S.; Zan, P.; Xia, X. STA-Net: Spatial–temporal alignment network for hybrid EEG-fNIRS decoding. Inf. Fusion 2025, 119, 103023. [Google Scholar]
- Nia, A.; Tang, V.; Talou, G.; Billinghurst, M. Decoding emotions through personalized multi-modal fNIRS-EEG systems: Exploring deterministic fusion techniques. Biomed. Signal Process. Control 2025, 105, 107632. [Google Scholar] [CrossRef]
- Ghouse, A.; Faes, L.; Valenza, G. Inferring directionality of coupled dynamical systems using Gaussian process priors: Application on neurovascular systems. Phys. Rev. E 2021, 104, 064208. [Google Scholar] [CrossRef]
- Granger, C.W.J. Investigating causal relations by econometric models and cross-spectral methods. Econom. J. Econom. Soc. 1969, 37, 424–438. [Google Scholar] [CrossRef]
- Zhang, X.; Li, Y.; Zhang, P.; Wang, D.; Yao, D.; Xu, P. Central-peripheral nervous system activation in exoskeleton modes: A Granger causality analysis via EEG–EMG fusion. Expert Syst. Appl. 2025, 268, 126311. [Google Scholar] [CrossRef]
- Wismüller, A.; Vosoughi, A.; Kasturi, A. Large-scale nonlinear Granger causality (lsNGC) analysis of functional MRI data for schizophrenia classification. In Proceedings of the Medical Imaging 2025: Computer-Aided Diagnosis (SPIE), San Diego, CA, USA, 16–21 February 2025; Volume 13407, pp. 300–307. [Google Scholar]
- Chen, J.; Zhou, G.; Han, J.; Su, P.; Zhang, H.; Tang, D. The effect of perceived groove in music on effective brain connectivity during cycling: An fNIRS study. Med. Sci. Sport. Exerc. 2024, 57, 857–866. [Google Scholar]
- Lin, R.; Dong, C.; Zhou, P.; Ma, P.; Ma, S.; Chen, X.; Liu, H. Motor imagery EEG task recognition using a nonlinear Granger causality feature extraction and an improved Salp swarm feature selection. Biomed. Signal Process. Control 2024, 88, 105626. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, X.; Chen, G.; Huang, L.; Sun, Y. EEG-fNIRS emotion recognition based on multi-brain attention mechanism capsule fusion network. J. Zhejiang Univ. (Eng. Sci.) 2024, 58, 2247–2257. [Google Scholar]
- Al-Shargie, F.; Tang, T.B.; Kiguchi, M. Stress assessment based on decision fusion of EEG and fNIRS signals. IEEE Access 2017, 5, 19889–19896. [Google Scholar] [CrossRef]
- Nour, M.; Öztürk, Ş.; Polat, K. A novel classification framework using multiple bandwidth method with optimized CNN for brain–computer interfaces with EEG-fNIRS signals. Neural Comput. Appl. 2021, 33, 15815–15829. [Google Scholar]
- Saadati, M.; Nelson, J.; Ayaz, H. Convolutional neural network for hybrid fNIRS-EEG mental workload classification. In Proceedings of the AHFE 2019 International Conference on Neuroergonomics and Cognitive Engineering, and the AHFE International Conference on Industrial Cognitive Ergonomics and Engineering Psychology, Washington, DC, USA, 24–28 July 2019; Springer: Cham, Switzerland, 2019; pp. 221–232. [Google Scholar]
- Sun, Y.J.; Ayaz, H.; Akansu, A.N. Multimodal affective state assessment using fNIRS+EEG and spontaneous facial expression. Brain Sci. 2020, 10, 85. [Google Scholar] [CrossRef]
- Liang, Z.; Wang, X.; Yu, Z.; Tong, Y.; Li, X.; Ma, Y.; Guo, H. Age-dependent neurovascular coupling characteristics in children and adults during general anesthesia. Biomed. Opt. Express 2023, 14, 2240–2259. [Google Scholar] [CrossRef] [PubMed]
- Kamat, A.; Norfleet, J.; Intes, X.; Dutta, A.; De, S. Perception–action cycle-related brain networks differentiate experts and novices: A combined EEG, fNIRS study during a complex surgical motor task. In Proceedings of the Clinical and Translational Neurophotonics 2022, SPIE BiOS, San Francisco, CA, USA, 22 January–28 February 2022; Volume 11945, pp. 47–55. [Google Scholar]
- Guo, J.L.; Fang, F.; Wang, W.; Ren, F.J. EEG emotion recognition based on Granger causality and CapsNet neural network. In Proceedings of the 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), Nanjing, China, 23–25 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 47–52. [Google Scholar]
- Bagherzadeh, S.; Maghooli, K.; Shalbaf, A.; Maghsoudi, A. Emotion recognition using effective connectivity and pre-trained convolutional neural networks in EEG signals. Cogn. Neurodyn. 2022, 16, 1087–1106. [Google Scholar] [CrossRef]
- Pugh, Z.H.; Choo, S.; Leshin, J.C.; Lindquist, K.A.; Nam, C.S. Emotion depends on context, culture and their interaction: Evidence from effective connectivity. Soc. Cogn. Affect. Neurosci. 2022, 17, 206–217. [Google Scholar] [CrossRef] [PubMed]
- Hu, Z.S.; Lam, K.F.; Xiang, Y.T.; Yuan, Z. Causal cortical network for arithmetic problem-solving represents brain’s planning rather than reasoning. Int. J. Biol. Sci. 2019, 15, 1148–1160. [Google Scholar] [CrossRef]
- Lee, S.H.; Jin, S.H.; An, J. Distinction of directional coupling in sensorimotor networks between active and passive finger movements using fNIRS. Biomed. Opt. Express 2018, 9, 2859–2870. [Google Scholar] [CrossRef]
- Hamidi, A.; Kiani, K. Motor imagery EEG signals classification using a Transformer-GCN approach. Appl. Soft Comput. 2025, 170, 112686. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, X.; Ming, D. Early-stage fusion of EEG and fNIRS improves classification of motor imagery. Front. Neurosci. 2023, 16, 1062889. [Google Scholar] [CrossRef] [PubMed]
- Kwak, Y.; Song, W.J.; Kim, S.E. FGANet: FNIRS-guided attention network for hybrid EEG-fNIRS brain-computer interfaces. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 30, 329–339. [Google Scholar] [CrossRef]
- Zhang, J.; Zhang, X.; Chen, G.; Huang, L.; Sun, Y. EEG emotion recognition based on cross-frequency Granger causality feature extraction and fusion in the left and right hemispheres. Front. Neurosci. 2022, 16, 974673. [Google Scholar] [CrossRef]
- Sun, Z.; Huang, Z.; Duan, F.; Liu, Y. A novel multimodal approach for hybrid brain–computer interface. IEEE Access 2020, 8, 89909–89918. [Google Scholar] [CrossRef]
- Hou, M.; Zhang, X.; Chen, G.; Huang, L.; Sun, Y. Emotion recognition based on an EEG–fNIRS hybrid brain network in the source space. Brain Sci. 2024, 14, 1166. [Google Scholar] [CrossRef]
- Wei, Y.; Liu, Y.; Li, C.; Cheng, J.; Song, R.; Chen, X. TC-Net: A Transformer capsule network for EEG-based emotion recognition. Comput. Biol. Med. 2023, 152, 106463. [Google Scholar] [CrossRef]
- Liu, Y.; Ding, Y.; Li, C.; Cheng, J.; Song, R.; Wan, F.; Chen, X. Multi-channel EEG-based emotion recognition via a multi-level features guided capsule network. Comput. Biol. Med. 2020, 123, 103927. [Google Scholar] [CrossRef]
- Wang, Z.; Chen, C.; Li, J.; Wan, F.; Sun, Y.; Wang, H. ST-CapsNet: Linking spatial and temporal attention with capsule network for P300 detection improvement. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 991–1000. [Google Scholar] [CrossRef] [PubMed]
- Zhao, W.; Jiang, X.; Zhang, B.; Xiao, S.; Weng, S. CTNet: A convolutional transformer network for EEG-based motor imagery classification. Sci. Rep. 2024, 14, 20237. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).