ICTD: Combination of Improved CNN–Transformer and Enhanced Deep Canonical Correlation Analysis for Eye-Movement Emotion Classification
Highlights
- This paper proposes a deep canonical correlation analysis method based on cosine similarity, non-linearly transforming feature vectors of different modalities into feature vectors with stronger correlation to improve the accuracy of emotion classification.
- This paper proposes an incremental feature feedforward network (IFFN) to perform feature transformations in enhancement and simplification, replacing the FFN in the original transformer module.
- Cosine similarity pays more attention to the direction of vectors than it does to their magnitude, is affected less by outliers, and does not require the data to satisfy specific distribution assumptions. The characteristics that are more suitable for eye-movement input data have provided a suitable processing method for eye-movement-based emotion classification.
- Existing studies mostly rely on the statistical characteristics of the original data for emotion analysis. Moreover, the importance of each emotional feature in the calculation process is not primary or secondary, and the role of key features cannot be highlighted. By assigning higher weights to key features, the influence of indiscriminately input features is reduced, thereby enhancing the importance of key features. This design effectively addresses the lack of prioritization in feature importance and significantly improves the ability of eye-movement features to characterize emotional states.
Abstract
1. Introduction
2. Related Work
2.1. Eye-Movement Features for Emotion Recognition
2.2. Ocular Response-Based Emotional Computing for Differentiated Individuals
2.3. Multimodal Emotion Analysis Model with Eye-Tracking Data
3. Proposed Method
3.1. Task Definition
3.2. The Structure of the Proposed Method
3.2.1. CNN–Transformer Module
3.2.2. Incremental Feature Feedforward Network
| Algorithm 1: Incremental Feature Feedforward Network. |
| Input: : Autoencoder encoding-decoding and reconstructing the input features Output: : Enhanced output features |
| Step 1: Partial Convolution Projection |
| Step 2: Channel Splitting |
| Step 3: Cross-Channel Interaction |
| Step 4: Final Projection |
| Step 5: Return Result |
| Return |
3.2.3. Deep Canonical Correlation Analysis Model Based on Cosine Similarity
3.2.4. Loss Optimization Module
4. Experimental Results and Analysis
4.1. Experimental Procedure
4.1.1. Experimental Design
4.1.2. Experimental Environment
4.1.3. Experiment Parameter Configuration
4.1.4. Experiment Dataset
SEED-V Dataset
SEED-IV Dataset
eSEE-d Dataset
4.1.5. Data Preprocessing and Feature Extraction
4.2. Experimental Results and Comparison
4.2.1. Evaluation of ICTD’s Effectiveness on the eSEE-d Dataset
4.2.2. Evaluation of ICTD’s Effectiveness on the SEED-IV and SEED-V Datasets
4.2.3. Ablation Experiment
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| IFFN | Incremental feature feedforward network |
| LA | Low arousal |
| MA | Medium arousal |
| HA | High arousal |
| NV | Negative valence |
| MV | Medium valence |
| PV | Positive valence |
| LSTM | Long short-term memory |
| KNN | k-nearest neighbors |
| SVM | Support vector machine |
| RNN | Recurrent neural network |
| DGCNN | Deep gradient convolutional neural network |
| DMLP | Deep multi-layer perceptron |
| STFT | Short-time Fourier transform |
| DCCA | Deep canonical correlation analysis |
| ANOVA | Analysis of variance |
| ICTD | Improved CNN-transformer combined with enhanced deep canonical correlation Analysis network |
| SEED | The SJTU emotion EEG dataset |
| SEED-FRA | SEED includes data of French subjects |
| SEED-VI | SEED includes 4 categories of emotional classification |
| SEED-V | SEED includes 5 categories of emotional classification |
| fNIRS | Functional near-infrared spectroscopy |
References
- Salovey, P.; Mayer, J.D. Emotional intelligence. Imagin. Cogn. Personal. 1990, 9, 185–211. [Google Scholar] [CrossRef]
- Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
- Mühl, C.; Allison, B.; Nijholt, A.; Chanel, G. A survey of affective brain computer interfaces: Principles, state-of-the-art, and challenges. Brain-Comput. Interfaces 2014, 1, 66–84. [Google Scholar] [CrossRef]
- Wu, D.; Lu, B.L.; Hu, B.; Zeng, Z. Affective brain–computer interfaces (aBCIs): A tutorial. Proc. IEEE 2023, 111, 1314–1332. [Google Scholar] [CrossRef]
- Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar] [CrossRef]
- Shanechi, M.M. Brain–machine interfaces from motor to mood. Nat. Neurosci. 2019, 22, 1554–1564. [Google Scholar] [CrossRef] [PubMed]
- Widge, A.S.; Malone, D.A., Jr.; Dougherty, D.D. Closing the loop on deep brain stimulation for treatment-resistant depression. Front. Neurosci. 2018, 12, 175. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.H.; Wu, C.T.; Cheng, W.T.; Hsiao, Y.T.; Chen, P.M.; Teng, J.T. Emotion recognition from single-trial EEG based on kernel Fisher’s emotion pattern and imbalanced quasiconformal kernel support vector machine. Sensors 2014, 14, 13361–13388. [Google Scholar] [CrossRef] [PubMed]
- Alonso-Martin, F.; Malfaz, M.; Sequeira, J.; Gorostiza, J.F.; Salichs, M.A. A multimodal emotion detection system during human–robot interaction. Sensors 2013, 13, 15549–15581. [Google Scholar] [CrossRef]
- Peng, Y.; Lu, B.L. Discriminative extreme learning machine with supervised sparsity preserving for image classification. Neurocomputing 2017, 261, 242–252. [Google Scholar] [CrossRef]
- Si, X.; Huang, H.; Yu, J.; Ming, D. EEG microstates and fNIRS metrics reveal the spatiotemporal joint neural processing features of human emotions. IEEE Trans. Affect. Comput. 2024, 15, 2128–2138. [Google Scholar] [CrossRef]
- Liu, X.; Hu, B.; Si, Y.; Wang, Q. The role of eye movement signals in non-invasive brain-computer interface typing system. Med. Biol. Eng. Comput. 2024, 62, 1981–1990. [Google Scholar] [CrossRef]
- Startsev, M.; Zemblys, R. Evaluating eye movement event detection: A review of the state of the art. Behav. Res. Methods 2023, 55, 1653–1714. [Google Scholar] [CrossRef]
- Bradley, M.M.; Codispoti, M.; Cuthbert, B.N.; Lang, P.J. Emotion and motivation I: Defensive and appetitive reactions in picture processing. Emotion 2001, 1, 276. [Google Scholar] [CrossRef]
- Zheng, W.L.; Zhu, J.Y.; Peng, Y.; Lu, B.L. EEG-based emotion classification using deep belief networks. In 2014 IEEE International Conference on Multimedia and Expo (ICME); IEEE: New York, NY, USA, 2014; pp. 1–6. [Google Scholar]
- Holmqvist, K.; Nyström, M.; Andersson, R.; Dewhurst, R.; Jarodzka, H.; Van de Weijer, J. Eye Tracking: A Comprehensive Guide to Methods and Measures; Oxford University Press: Oxford, UK, 2011. [Google Scholar]
- Worthy, D.A.; Lahey, J.N.; Priestley, S.L.; Palma, M.A. An examination of the effects of eye-tracking on behavior in psychology experiments. Behav. Res. Methods 2024, 56, 6812–6825. [Google Scholar] [CrossRef]
- Ionescu, A.; Ștefănescu, E.; Strilciuc, Ș.; Rafila, A.; Mureșanu, D. Correlating eye-tracking fixation metrics and neuropsychological assessment after ischemic stroke. Medicina 2023, 59, 1361. [Google Scholar] [CrossRef]
- Ibragimov, B.; Mello-Thoms, C. The use of machine learning in eye tracking studies in medical imaging: A review. IEEE J. Biomed. Health Inform. 2024, 28, 3597–3612. [Google Scholar] [CrossRef] [PubMed]
- Lim, J.Z.; Mountstephens, J.; Teo, J. Emotion recognition using eye-tracking: Taxonomy, review and current challenges. Sensors 2020, 20, 2384. [Google Scholar] [CrossRef]
- Oliva, M.; Anikin, A. Pupil dilation reflects the time course of emotion recognition in human vocalizations. Sci. Rep. 2018, 8, 4871. [Google Scholar] [CrossRef] [PubMed]
- Gilzenrat, M.S.; Nieuwenhuis, S.; Jepma, M.; Cohen, J.D. Pupil diameter tracks changes in control state predicted by the adaptive gain theory of locus coeruleus function. Cogn. Affect. Behav. Neurosci. 2010, 10, 252–269. [Google Scholar] [CrossRef] [PubMed]
- Aracena, C.; Basterrech, S.; Snáel, V.; Velásquez, J. Neural networks for emotion recognition based on eye tracking data. In 2015 IEEE International Conference on Systems, Man, and Cybernetics; IEEE: New York, NY, USA, 2015; pp. 2632–2637. [Google Scholar]
- Cheng, B.; Titterington, D.M. Neural networks: A review from a statistical perspective. Stat. Sci. 1994, 9, 2–30. [Google Scholar]
- Palm, R.B. Prediction as a candidate for learning deep hierarchical models of data. Tech. Univ. Den. 2012, 5, 19–22. [Google Scholar]
- Skaramagkas, V.; Ktistakis, E.; Manousos, D.; Kazantzaki, E.; Tachos, N.S.; Tripoliti, E.; Fotiadis, D.I.; Tsiknakis, M. eSEE-d: Emotional state estimation based on eye-tracking dataset. Brain Sci. 2023, 13, 589. [Google Scholar] [CrossRef]
- Fu, B.; Gu, C.; Fu, M.; Xia, Y.; Liu, Y. A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals. Front. Neurosci. 2023, 17, 1234162. [Google Scholar]
- Li, Y.; Deng, J.; Wu, Q.; Wang, Y. Eye-tracking signals based affective classification employing deep gradient convolutional neural networks. Int. J. Interact. Multimedia Artif. Intell. 2021, 7, 34–43. [Google Scholar] [CrossRef]
- Tarnowski, P.; Kołodziej, M.; Majkowski, A.; Rak, R.J. Eye-tracking analysis for emotion recognition. Comput. Intell. Neurosci. 2020, 2020, 2909267. [Google Scholar] [CrossRef]
- Wang, Y.; Lv, Z.; Zheng, Y. Automatic emotion perception using eye movement information for E-healthcare systems. Sensors 2018, 18, 2826. [Google Scholar] [CrossRef]
- Gong, X.; Chen, C.P.; Hu, B.; Zhang, T. CiABL: Completeness-induced adaptative broad learning for cross-subject emotion recognition with EEG and eye movement signals. IEEE Trans. Affect. Comput. 2024, 15, 1970–1984. [Google Scholar] [CrossRef]
- Jiménez-Guarneros, M.; Fuentes-Pineda, G.; Grande-Barreto, J. Multimodal semi-supervised domain adaptation using cross-modal learning and joint distribution alignment for cross-subject emotion recognition. IEEE Trans. Instrum. Meas. 2025, 72, 2523911. [Google Scholar] [CrossRef]
- Liu, W.; Zheng, W.L.; Li, Z.; Wu, S.Y.; Gan, L.; Lu, B.L. Identifying similarities and differences in emotion recognition with EEG and eye movements among Chinese, German, and French People. J. Neural Eng. 2022, 19, 026012. [Google Scholar] [CrossRef] [PubMed]
- Skaramagkas, V.; Ktistakis, E.; Manousos, D.; Tachos, N.S.; Kazantzaki, E.; Tripoliti, E.E.; Fotiadis, D.I.; Tsiknakis, M. A machine learning approach to predict emotional arousal and valence from gaze extracted features. In 2021 IEEE 21st International Conference on Bioinformatics and Bioengineering (BIBE); IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar]
- Zheng, W.L.; Liu, W.; Lu, Y.; Lu, B.L.; Cichocki, A. Emotionmeter: A multimodal framework for recognizing human emotions. IEEE Trans. Cybern. 2018, 49, 1110–1122. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; Van Gool, L. Localvit: Bringing locality to vision transformers. arXiv 2021, arXiv:2104.05707. [Google Scholar]
- Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 12021–12031. [Google Scholar]
- Andrew, G.; Menglong, Z. Efficient convolutional neural networks for mobile vision applications. Mobilenets 2017, 10, 151. [Google Scholar]
- Qiu, J.L.; Liu, W.; Lu, B.L. Multi-view emotion recognition using deep canonical correlation analysis. In International Conference on Neural Information Processing; Springer International Publishing: Cham, Switzerland, 2018; pp. 221–231. [Google Scholar]
- Li, T.H.; Liu, W.; Zheng, W.L.; Lu, B.L. Classification of five emotions from EEG and eye movement signals: Discrimination ability and stability over time. In 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER); IEEE: New York, NY, USA, 2019; pp. 607–610. [Google Scholar]



| Authors | Objectives | Modalities | Features | Results |
|---|---|---|---|---|
| Skaramagkas [26] | To test new machine learning methods and attempt to develop the best model that uses eye-tracking functionality to predict emotional states. | Eye | Fixation and blink frequency, saccade amplitude and duration, pupil diameter, fixation duration kurtosis, saccade duration variation and pupil diameter kurtosis. | 84% |
| Fu [27] | To enhance the accuracy and stability of emotion recognition, a dual-branch feature extraction module is designed to extract features from both modalities simultaneously while maintaining temporal alignment between the modal signals. | EEG, Eye | 33-dimension eye-movement features such as pupil diameter, gaze details, saccade details, blink details, and event details statistics, and a 62 channel EEG signal. | 87% |
| Oliva [21] | Investigate the relationship between pupil size fluctuations and the process of emotion recognition. Participants heard human nonverbal vocalizations and indicated the emotional state of the speakers as soon as they had identified it. | Eye | Peak pupil dilation, the time of peak dilation, the rate of pre-peak dilation, and the rate of post-peak contraction. | 81% |
| Li [28] | Develop a deep gradient convolutional neural network (DGCNN) using eye-movement signals for emotion classification. | Eye | Image features | 87% |
| Aracena [23] | Using the evolution of the eye-tracking data during a window of time for recognizing the emotions which are provoked by visual stimuli of colored images labeled as negative, neutral and positive. | Eye | Pupil size average, pupil size change average | 72% |
| Gong [31] | To overcome the individual differences caused by the non-stationary and low signal-to-noise properties that bring several challenges to cross-subject emotion recognition tasks, a novel multimodal emotion recognition model was proposed for improving the generalization performance to unseen target domain subjects. | EEG, Eye | 33-dimension eye-movement features such as pupil diameter, gaze details, saccade details, blink details, and event details statistics, and a 62 channel EEG signal. | 96% |
| Liu [33] | Identify the similarities and differences among Chinese, German, and French individuals in emotion recognition with electroencephalogram (EEG) and eye-movements from an affective computing perspective. | EEG, Eye | 33-dimension eye-movement features such as pupil diameter, gaze details, saccade details, blink details, and event details statistics, and a 62 channel EEG signal. | 84% |
| Tarnowski [29] | To classify three emotional states, the study employed support vector machines, linear discriminant analysis, and k-nearest neighbors, with performance evaluated using the leave-one-subject-out cross-validation approach. | Eye | 18-dimension eye-movement features such as average duration of fixation, skewness of fixation durations, average amplitude of the saccades, variation of saccade amplitudes, average pupil diameter, pupil diameter variance, pupil diameter kurtosis. | 80% |
| Category/Parameter | Value |
|---|---|
| Training and optimization | |
| Optimizer | Adam |
| Learning rate | 0.0005 |
| Momentum | (0.9, 0.999) |
| Batch size | 128 |
| Epochs | 100 (eSEE-d), 50 (SEED-IV, SEED-V) |
| Decay rate | 0.1 |
| Weight decay | 0.01 |
| Model architecture | |
| Hidden layer1 | 64 |
| Hidden layer2 | 128 |
| Dropout | 0.3 |
| Eye-Movement Parameters | Extracted Features |
|---|---|
| Pupil diameters (left and right) | Mean, standard deviation |
| Fixation duration (ms) | |
| Dispersion (X and Y) | |
| Saccade duration (ms) | |
| Saccade amplitude (°) | |
| Blink duration (ms) | |
| Event statistics | Fixation frequency, maximum fixation duration, maximum fixation dispersion, total fixation dispersion, saccade frequency, average saccade duration, average saccade amplitude, average saccade latency |
| Eye-Movement Parameters | Extracted Features |
|---|---|
| Pupil diameter | Mean, Var, CV |
| Fixation duration (ms) | Mean, Var, CV |
| Saccade duration (ms) | Kurt, Skew, CV |
| Saccade speed | Kurt, Skew |
| Saccade distance | Kurt, Skew |
| Method | Arousal | Valence | ||
|---|---|---|---|---|
| Accuracy (%) ± Std | F1-Score (%) ± Std | Accuracy (%) ± Std | F1-Score (%) ± Std | |
| SVM | 64.6 ± 1.6 | 63.5 ± 1.5 | 70.1 ± 1.1 | 68.3 ± 1.2 |
| Random forest | 62.9 ± 1.3 | 61.8 ± 1.2 | 56.5 ± 0.8 | 54.8 ± 0.8 |
| CNN | 69.5 ± 2.2 | 68.2 ± 2.0 | 71.2 ± 2.6 | 70.1 ± 2.5 |
| CNN–transformer | 71.3 ± 1.8 | 70.0 ± 1.7 | 73.2 ± 2.3 | 73.5 ± 2.1 |
| LSTM | 71.8 ± 2.5 | 70.5 ± 2.3 | 72.6 ± 2.4 | 75.6 ± 2.4 |
| DCCA [39] | 75.5 ± 3.6 | 74.2 ± 3.3 | 76.3 ± 3.1 | 76.8 ± 3.0 |
| DMLP [26] | 72.0 ± \ | \ | 84.0 ± \ | \ |
| DGCNN [28] | 78.8 ± 1.5 | 77.5 ± 1.3 | 82.1 ± 1.3 | 81.5 ± 1.2 |
| ICTD (ours) | 81.8 ± 0.9 | 80.4 ± 0.7 | 85.2 ± 1.0 | 84.2 ± 0.9 |
| Method | SEED-IV | |
|---|---|---|
| Accuracy (%) ± Std | F1-Score (%) ± Std | |
| SVM | 72.3 ± 1.8 | 68.5 ± 2.1 |
| Random forest | 66.2 ± 1.5 | 63.1 ± 1.8 |
| CNN | 75.4 ± 2.9 | 71.7 ± 2.5 |
| CNN–transformer | 81.1 ± 2.6 | 78.4 ± 2.2 |
| LSTM | 80.3 ± 1.7 | 77.6 ± 1.4 |
| DGCNN [28] | 87.8 ± 1.6 | 84.5 ± 1.7 |
| DCCA [39] | 88.9 ± 2.8 | 85.7 ± 2.4 |
| ICTD (Ours) | 91.2 ± 2.1 | 89.3 ± 2.0 |
| Method | SEED-V | |
|---|---|---|
| Accuracy (%) ± Std | F1-Score (%) ± Std | |
| SVM | 68.6 ± 2.3 | 67.5 ± 2.5 |
| Random forest | 59.1 ± 1.7 | 58.2 ± 1.9 |
| CNN | 71.2 ± 3.1 | 70.3 ± 3.3 |
| CNN–transformer | 77.8 ± 3.4 | 76.8 ± 3.6 |
| LSTM | 76.6 ± 1.6 | 76.0 ± 1.5 |
| DGCNN [28] | 82.1 ± 2.3 | 81.5 ± 2.5 |
| DCCA [39] | 82.0 ± 3.2 | 80.3 ± 3.4 |
| ICTD (Ours) | 85.1 ± 2.8 | 84.5 ± 3.0 |
| Method | Accuracy (%) ± Std | F1-Score (%) ± Std |
|---|---|---|
| ICTD (ours) | 91.2 ± 2.1 | 89.3 ± 2.0 |
| w/o IFFN | 89.4 ± 1.8 | 88.1 ± 1.7 |
| w/o cos-DCCA | 85.5 ± 1.3 | 84.9 ± 1.2 |
| w/o IFFN + cos-DCCA | 81.1 ± 2.6 | 78.4 ± 2.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, C.; Li, X.; Chi, J.; Cao, M.; Gu, Q.; Liu, J. ICTD: Combination of Improved CNN–Transformer and Enhanced Deep Canonical Correlation Analysis for Eye-Movement Emotion Classification. Brain Sci. 2026, 16, 330. https://doi.org/10.3390/brainsci16030330
Zhang C, Li X, Chi J, Cao M, Gu Q, Liu J. ICTD: Combination of Improved CNN–Transformer and Enhanced Deep Canonical Correlation Analysis for Eye-Movement Emotion Classification. Brain Sciences. 2026; 16(3):330. https://doi.org/10.3390/brainsci16030330
Chicago/Turabian StyleZhang, Cong, Xisheng Li, Jiannan Chi, Ming Cao, Qingfeng Gu, and Jiahui Liu. 2026. "ICTD: Combination of Improved CNN–Transformer and Enhanced Deep Canonical Correlation Analysis for Eye-Movement Emotion Classification" Brain Sciences 16, no. 3: 330. https://doi.org/10.3390/brainsci16030330
APA StyleZhang, C., Li, X., Chi, J., Cao, M., Gu, Q., & Liu, J. (2026). ICTD: Combination of Improved CNN–Transformer and Enhanced Deep Canonical Correlation Analysis for Eye-Movement Emotion Classification. Brain Sciences, 16(3), 330. https://doi.org/10.3390/brainsci16030330

