Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference
Abstract
1. Introduction
- Structured multi-task affective modeling paradigm: We propose a unified modeling framework that jointly infers emotional valence, arousal, and music liking from physiological signals. By explicitly modeling inter-task relationships, this formulation promotes representation consistency across correlated affective dimensions and improves robustness compared with conventional single-task approaches.
- Structured token-based representation learning: We introduce a token-based representation learning strategy in which affective representations are explicitly factorized into cross-series interaction modeling and intra-series temporal–spectral modeling components. This decomposition, realized via CSI and ISI, enables interpretable, modular, and structurally coherent physiological modeling.
- Hierarchical cross-modal integration mechanism: We propose a hierarchical fusion mechanism that integrates neural, autonomic, and attentional modalities in a structure-aware and representation-consistent manner, preserving modality-level regularities while enabling adaptive cross-modal interaction.
- Strong empirical performance and generalization: Extensive experiments on the DEAP dataset demonstrate that the proposed structured modeling framework achieves competitive performance under both single-task and multi-task learning settings.
2. Related Work
2.1. Emotion Recognition Based on EEG Signals
2.2. Emotion Recognition by Merging Multiple Physiological Signals
2.3. Summary and Motivation
3. Methodology
3.1. Dynamic Token Feature Extractor (DTFE)
3.1.1. Module Architecture
Input Formulation
Instance Normalization
3.1.2. Multi-Dimensional Projection
Channel-Wise Projection to Latent Tokens
3.1.3. Token-Based Processing
Learnable Temporal Tokenization
Design Rationale
3.1.4. Representation Learning
Intuitive Motivation
Cross-Series Intersection (CSI)
Intra-Series Intersection (ISI)
Time-Domain Projection and Nonlinearity
Frequency-Domain Transformation
Learnable Frequency Gating
Inverse Transformation and Residual Fusion
Design Rationale
3.1.5. Output Generation
Temporal Aggregation via Learnable Attention Pooling
Output Embedding Projection
3.2. Cross-Modal Interaction and Fusion
- Overview. As illustrated in Figure 3, the cross-modal interaction and fusion module integrates modality-specific representations extracted by DTFE into a unified affective embedding. Instead of performing early fusion through direct feature concatenation, the proposed design conducts interaction modeling at the representation level, allowing adaptive integration of complementary physiological cues from neural, autonomic, and attentional systems.
- Modality-level feature construction.
- Cross-modal attention-based interaction.
- Feature refinement and residual learning.
- Global fusion and unified representation.
- Discussion. The proposed fusion strategy provides three practical advantages: (1) it performs cross-modal interaction at a high-level representation space, reducing noise propagation from raw signal domains; (2) it operates on compact modality embeddings, avoiding redundant temporal modeling at the fusion stage; and (3) it enables adaptive, data-driven weighting of physiological cues according to affective context. By conducting interaction modeling on modality-level tokens, the fusion module offers a scalable and interpretable mechanism for integrating heterogeneous physiological signals, which is particularly suitable for music therapy applications where multiple physiological systems jointly contribute to emotional responses.
- Computational efficiency. In addition to accuracy, computational efficiency is a practical consideration in multi-modal affective modeling. Our design emphasizes compact token-based encoding and lightweight cross-modal interaction so that performance gains do not rely on excessive model scaling. As reported in Table 2, our method achieves the best overall multi-task performance while maintaining moderate computational cost (2.86 M parameters and 0.82 G FLOPs), which is lower than Transformer-based baselines and notably more efficient than recent EEG-specific architectures. These results indicate that the proposed structured DTFE and hierarchical fusion introduce only limited overhead, yet they deliver substantial accuracy improvements.
3.2.1. Multi-Task Loss Function
Overall Objective
Task-Specific Loss Formulation
Hyperparameter Selection and Sensitivity Analysis
4. Experiments
4.1. Dataset and Task Setup
- Strict subject-level partition before window segmentation.
- Valence, Arousal, and Liking scores, originally rated on a 9-point scale in the DEAP dataset, are treated as nine ordered categories and formulated as parallel 9-class classification tasks. Each affective dimension is modeled with an independent prediction head equipped with softmax activation. The optimization objective adopts focal cross-entropy with label smoothing to mitigate potential class imbalance and reduce over-confidence under subjective affective annotations.
- Aligned setting for prior-work comparison (Table 4). In addition to the above 9-class formulation, we also evaluate a reduced-label configuration: valence and arousal are binarized into high vs. low (2-class) tasks following common DEAP practice, and categorical emotion is evaluated under a 5-class scheme (Q1–Q4 plus neutral). This setting is not the main task configuration of this paper and is used solely for fair comparison with existing single-task studies. In addition, an auxiliary branch is introduced for discrete affective quadrant classification:
- Quadrants (Q1–Q4) and Neutral states are modeled as a five-dimensional multi-label classification task, trained independently from the valence–arousal–liking pipeline. This separation enables an explicit evaluation of categorical emotion distribution without mutual interference from ordinal regression targets.
- Model performance is quantitatively evaluated using three standard metrics: Accuracy (Acc) measures the proportion of correctly classified samples among the total number of samples, reflecting overall recognition correctness. F1-score (F1) represents the harmonic mean of precision and recall, providing a balanced assessment under potential class imbalance. Precision (Prec) quantifies the ratio of true positive predictions to all positive predictions, indicating the reliability of positive classifications.
4.2. Class Distribution Analysis
4.3. Training Configuration
4.4. Comparison with Prior Works on DEAP Dataset
4.5. Multi-Task Classification Results and Analysis
4.6. Ordinal Structure Analysis and Modeling Justification
4.7. Statistical Performance Analysis
- LOSO cross-validation and windowing effect analysis.
- Statistical Significance Test.
4.8. Robustness Analysis Under Different Train–Test Splits
4.9. Detailed 9-Class Precision Analysis
4.10. Ablation Study on DTFE and Fusion Modules
4.11. Ablation Study: Learnable vs. Fixed Loss Weights
- Analysis. As shown in Table 15, enabling the loss weights to be learnable significantly improves classification performance across all three affective dimensions. Compared to fixed equal weights, the dynamic weighting mechanism allows the model to adaptively prioritize more difficult or underperforming tasks during training. This leads to more balanced optimization and consistent gains in accuracy, F1-score, and precision. We conclude that learnable task weights are essential for achieving optimal performance in multi-task affective modeling.
4.12. Disentangling Modality Scaling and Structural Contributions
4.13. Window Overlap Effect Analysis
5. Potential Applications and Limitations
- Prospective Therapeutic Monitoring: The model may provide a quantitative foundation for analyzing trends in valence, arousal, and liking signals during music exposure. In future clinical settings, such physiological-based prediction frameworks could potentially assist clinicians in assessing emotional responses during therapy sessions. However, clinical feasibility and reliability would require dedicated validation on patient cohorts.
- Adaptive Affective Computing Systems: Given its multi-modal fusion capability, the proposed framework could be extended toward affective-aware music recommendation or interactive systems. Nevertheless, real-time deployment would require latency optimization, hardware integration, and robustness testing under unconstrained environments.
- Biofeedback-Driven Interfaces: The architecture may serve as a computational backbone for biofeedback-driven VR or immersive environments. Future work should investigate real-time physiological acquisition, noise robustness, and user-dependent calibration mechanisms.
- Clinical Decision Support (Long-Term Vision): With longitudinal physiological recordings and validated clinical protocols, multi-task affective models could eventually contribute to clinical dashboards for emotional trend monitoring. Such applications, however, remain beyond the scope of the present DEAP-based study.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Yang, Y.; Wu, Q.M.J.; Zheng, W.L.; Lu, B.L. EEG-Based Emotion Recognition Using Hierarchical Network with Subnetwork Nodes. IEEE Trans. Cogn. Dev. Syst. 2018, 10, 408–419. [Google Scholar] [CrossRef]
- Wang, X.W.; Nie, D.; Lu, B.L. Emotional State Classification from EEG Data Using Machine Learning Approach. Neurocomputing 2014, 129, 94–106. [Google Scholar] [CrossRef]
- Li, C.; Xu, C.; Feng, Z. Analysis of Physiological Signals for Emotion Recognition with the IRS Model. Neurocomputing 2016, 178, 103–111. [Google Scholar] [CrossRef]
- Verma, G.K.; Tiwary, U.S. Multimodal Fusion Framework: A Multiresolution Approach for Emotion Classification and Recognition from Physiological Signals. NeuroImage 2014, 102, 162–172. [Google Scholar] [CrossRef]
- Liu, W.; Zheng, W.L.; Lu, B.L. Emotion Recognition Using Multimodal Deep Learning. In Proceedings of the 23rd International Conference on Neural Information Processing (ICONIP), Kyoto, Japan; Springer: Cham, Switzerland, 2016; pp. 521–529. [Google Scholar]
- Yin, Z.; Zhao, M.; Wang, Y.; Yang, J.; Zhang, J. Recognition of Emotions Using Multimodal Physiological Signals and an Ensemble Deep Learning Model. Comput. Methods Programs Biomed. 2017, 140, 93–110. [Google Scholar] [CrossRef]
- Ma, J.; Tang, H.; Zheng, W.L.; Lu, B.L. Emotion Recognition Using Multimodal Residual LSTM Network. In Proceedings of the 27th ACM International Conference on Multimedia; ACM: New York, NY, USA, 2019; pp. 176–183. [Google Scholar]
- Hassan, M.M.; Alam, M.G.R.; Uddin, M.Z.; Huda, S.; Almogren, A.; Fortino, G. Human Emotion Recognition Using Deep Belief Network Architecture. Inf. Fusion 2019, 51, 10–18. [Google Scholar] [CrossRef]
- Liu, W.; Qiu, J.L.; Zheng, W.L.; Lu, B.L. Multimodal Emotion Recognition Using Deep Canonical Correlation Analysis. arXiv 2019, arXiv:1908.05349. [Google Scholar] [CrossRef]
- Zhang, Y.; Cheng, C.; Zhang, Y. Multimodal Emotion Recognition Based on Manifold Learning and Convolution Neural Network. Multimed. Tools Appl. 2022, 81, 33253–33268. [Google Scholar] [CrossRef]
- Tang, J.; Ma, Z.; Gan, K.; Zhang, J.; Yin, Z. Hierarchical Multimodal Fusion of Physiological Signals for Emotion Recognition with Scenario Adaption and Contrastive Alignment. Inf. Fusion 2024, 103, 102129. [Google Scholar] [CrossRef]
- Li, Q.; Jin, D.; Huang, J.; Zhong, Q.; Xu, L.; Lin, J.; Jiang, D. DEMA: Deep EEG-first multi-physiological affect model for emotion recognition. Biomed. Signal Process. Control 2025, 99, 106812. [Google Scholar] [CrossRef]
- Lan, Z.; Sourina, O.; Wang, L.; Liu, Y. Real-Time EEG-Based Emotion Monitoring Using Stable Features. Vis. Comput. 2016, 32, 347–358. [Google Scholar] [CrossRef]
- Li, Y.; Zheng, W.; Cui, Z.; Zong, Y.; Ge, S. EEG Emotion Recognition Based on Graph Regularized Sparse Linear Regression. Neural Process. Lett. 2019, 49, 555–571. [Google Scholar] [CrossRef]
- Cheng, J.; Chen, M.; Li, C.; Liu, Y.; Song, R.; Liu, A.; Chen, X. Emotion Recognition from Multi-Channel EEG via Deep Forest. IEEE J. Biomed. Health Inform. 2020, 25, 453–464. [Google Scholar] [CrossRef] [PubMed]
- Song, T.; Zheng, W.; Song, P.; Cui, Z. EEG Emotion Recognition Using Dynamical Graph Convolutional Neural Networks. IEEE Trans. Affect. Comput. 2020, 11, 532–541. [Google Scholar] [CrossRef]
- Zhang, Y.; Liu, H.; Zhang, D.; Chen, X.; Qin, T.; Zheng, Q. EEG-Based Emotion Recognition with Emotion Localization via Hierarchical Self-Attention. IEEE Trans. Affect. Comput. 2023, 14, 2458–2469. [Google Scholar] [CrossRef]
- Ding, Y.; Tong, C.; Zhang, S.; Jiang, M.; Li, Y.; Lim, K.J.; Guan, C. Emt: A novel transformer for generalized cross-subject eeg emotion recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 10381–10393. [Google Scholar] [CrossRef]
- Ding, Y.; Zhang, S.; Tang, C.; Guan, C. Masa-tcn: Multi-anchor space-aware temporal convolutional neural networks for continuous and discrete eeg emotion recognition. IEEE J. Biomed. Health Inform. 2024, 28, 3953–3964. [Google Scholar] [CrossRef]
- Xiao, M.; Zhu, Z.; Xie, K.; Jiang, B. Meeg and at-dgnn: Improving eeg emotion recognition with music introducing and graph-based learning. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); IEEE: Piscataway, NJ, USA, 2024; pp. 4201–4208. [Google Scholar]
- Liu, S.; Zhao, Y.; An, Y.; Zhao, J.; Wang, S.H.; Yan, J. GLFANet: A Global to Local Feature Aggregation Network for EEG Emotion Recognition. Biomed. Signal Process. Control 2023, 85, 104799. [Google Scholar] [CrossRef]
- Jin, H.; Gao, Y.; Wang, T.; Gao, P. DAST: A Domain-Adaptive Learning Combining Spatio-Temporal Dynamic Attention for Electroencephalography Emotion Recognition. IEEE J. Biomed. Health Inform. 2023, 28, 2512–2523. [Google Scholar] [CrossRef] [PubMed]
- Huang, H.; Xie, Q.; Pan, J.; He, Y.; Wen, Z.; Yu, R.; Li, Y. An EEG-Based Brain–Computer Interface for Emotion Recognition and Its Application in Patients with Disorder of Consciousness. IEEE Trans. Affect. Comput. 2019, 12, 832–842. [Google Scholar] [CrossRef]
- Gu, X.; Cao, Z.; Jolfaei, A.; Xu, P.; Wu, D.; Jung, T.P.; Lin, C.T. EEG-Based Brain–Computer Interfaces (BCIs): A Survey of Recent Studies on Signal Sensing Technologies and Computational Intelligence Approaches and Their Applications. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 18, 1645–1666. [Google Scholar] [CrossRef]
- Kandemir, M.; Vetek, A.; Gönen, M.; Klami, A.; Kaski, S. Multi-Task and Multi-View Learning of User State. Neurocomputing 2014, 139, 97–106. [Google Scholar] [CrossRef]
- Tang, H.; Liu, W.; Zheng, W.L.; Lu, B.L. Multimodal Emotion Recognition Using Deep Neural Networks. In Proceedings of the 24th International Conference on Neural Information Processing (ICONIP), Guangzhou, China; Springer: Cham, Switzerland, 2017; pp. 811–819. [Google Scholar]
- Kim, B.H.; Jo, S. Deep Physiological Affect Network for the Recognition of Human Emotions. IEEE Trans. Affect. Comput. 2020, 11, 230–243. [Google Scholar] [CrossRef]
- Kusumaningrum, T.D.; Faqih, A.; Kusumoputro, B. Emotion Recognition Based on DEAP Database Using EEG Time-Frequency Features and Machine Learning Methods. J. Phys. Conf. Ser. 2020, 1501, 012020. [Google Scholar] [CrossRef]
- Zhang, X.; Liu, J.; Shen, J.; Li, S.; Hou, K.; Hu, B.; Gao, J.; Zhang, T. Emotion Recognition from Multimodal Physiological Signals Using a Regularized Deep Fusion of Kernel Machine. IEEE Trans. Cybern. 2021, 51, 4386–4399. [Google Scholar] [CrossRef]
- Liu, J.w.; Yang, D.; Feng, T.w.; Fu, J.j. MDFD2-DETR: A Real-Time Complex Road Object Detection Model Based on Multi-Domain Feature Decomposition and De-Redundancy. IEEE Trans. Intell. Veh. 2025, 10, 4343–4359. [Google Scholar] [CrossRef]
- Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to Forget: Continual Prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
- Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The Performance of LSTM and BiLSTM in Forecasting Time Series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA; IEEE: Piscataway, NJ, USA, 2019; pp. 3285–3292. [Google Scholar] [CrossRef]
- Rana, R. Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech. arXiv 2016, arXiv:1612.07778. [Google Scholar] [CrossRef]
- Pinkus, A. Approximation theory of the MLP model in neural networks. Acta Numer. 1999, 8, 143–195. [Google Scholar] [CrossRef]
- Tang, W.; Long, G.; Liu, L.; Zhou, T.; Jiang, J.; Blumenstein, M. Rethinking 1d-cnn for time series classification: A stronger baseline. arXiv 2020, arXiv:2002.10061. [Google Scholar]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Sydney, Australia, 2017; Volume 30. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 2980–2988. [Google Scholar]
- Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. DEAP: A Database for Emotion Analysis Using Physiological Signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [Google Scholar] [CrossRef]






| Method | Task Type | Emotion Categories | Modalities |
|---|---|---|---|
| SVM [4] | Single | 5-class | EEG, EDA, GSR, SCR, skin temp |
| MT-MKL [25] | Single | 2-class | EEG, GSR, RB, skin temp |
| IRS [3] | Single | 4-class | ECG, GSR, PPG |
| BDAE [5] | Single | 2-class | EEG, eye movement |
| Bimodal-LSTM [26] | Single | 2-class | EDA, PPG, EMG |
| MESAE [6] | Single | 5-class | EEG, EOG, EMG, GSR, temp, BP |
| DCCA [9] | Single | 2/4-class | EEG, eye, GSR, EMG, PPG |
| MM-ResLSTM [7] | Single | 2-class | EEG, peripheral signals |
| FGSVM [8] | Single | 5-class | EDA, PPG, EMG |
| DPAN [27] | Single | 2-class | EDA, PPG |
| Random Forest [28] | Single | 2-class | EMG, EOG |
| RDFKM [29] | Single | 2-class | EEG, EMG, GSR, RES |
| i-Isomap + DCNN [10] | Single | 4-class | EEG, peripheral, eye |
| RHRPNet [11] | Single | 2/4-class | EEG, peripheral signals |
| DEMA [12] | Single | 2/5-class | EEG, GSR, BP, RB |
| Our Method | Multi-task | 3 × 9-class | EEG, GSR, BVP, EMG-Zyg, EMG-Trap, Resp., Temp., EOG |
| Model | Valence | Arousal | Liking | Params (M) | FLOPs (G) | Train Time (min) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | F1 | Prec | Acc | F1 | Prec | Acc | F1 | Prec | ||||
| LSTM [31] | 85.6 | 87.3 | 86.1 | 84.5 | 87.0 | 85.6 | 83.2 | 86.1 | 84.4 | 1.62 | 0.24 | 56 |
| BiLSTM [32] | 87.0 | 83.6 | 85.0 | 83.1 | 84.2 | 85.3 | 84.1 | 83.8 | 82.5 | 2.45 | 0.40 | 74 |
| GRU [33] | 86.4 | 87.1 | 86.7 | 85.0 | 85.3 | 84.9 | 86.0 | 84.0 | 83.7 | 1.38 | 0.20 | 52 |
| MLP [34] | 85.8 | 86.2 | 85.5 | 84.3 | 85.1 | 84.4 | 85.1 | 83.5 | 83.2 | 0.62 | 0.06 | 38 |
| 1D-CNN [35] | 84.0 | 84.6 | 83.9 | 83.7 | 83.5 | 82.8 | 85.0 | 84.0 | 83.1 | 1.08 | 0.16 | 46 |
| XGBoost [36] | 83.2 | 84.1 | 83.0 | 87.2 | 89.0 | 88.1 | 83.0 | 86.4 | 85.2 | 0.28 | 0.02 | 26 |
| CNN-LSTM [37] | 83.7 | 84.5 | 84.2 | 83.5 | 87.1 | 85.5 | 84.3 | 86.0 | 85.1 | 3.12 | 0.58 | 96 |
| Transformer [38] | 88.4 | 85.0 | 85.7 | 90.1 | 89.5 | 89.6 | 84.2 | 87.3 | 86.4 | 3.78 | 1.05 | 112 |
| EmT [18] | 91.1 | 89.8 | 89.4 | 90.3 | 91.6 | 90.8 | 88.5 | 89.9 | 89.1 | 3.92 | 3.18 | 125 |
| MASA-TCN [19] | 89.6 | 88.4 | 88.1 | 88.9 | 90.3 | 89.5 | 86.7 | 88.2 | 87.4 | 2.41 | 2.02 | 85 |
| AT-DGNN [20] | 90.3 | 89.1 | 88.7 | 89.7 | 91.0 | 90.2 | 87.9 | 89.3 | 88.5 | 3.18 | 2.45 | 98 |
| Ours (Multi-Task) | 92.8 | 92.6 | 92.2 | 91.8 | 93.7 | 93.0 | 93.6 | 92.4 | 91.9 | 2.86 | 0.82 | 88 |
| Variant | Acc (%) | Avg | ||
|---|---|---|---|---|
| Val | Aro | Like | ||
| Ours (Full) | 92.8 | 91.8 | 93.6 | 92.7 |
| (0) Single-task vs. Multi-task | ||||
| Single-task (Val only) | 92.9 | – | – | N/A |
| Single-task (Aro only) | – | 92.0 | – | N/A |
| Single-task (Like only) | – | – | 92.5 | N/A |
| Multi-task (shared backbone) | 92.8 | 91.8 | 93.6 | 92.7 |
| (1) CSI vs. ISI contributions | ||||
| w/o CSI (keep ISI) | 91.2 | 90.4 | 92.0 | 91.2 |
| w/o ISI (keep CSI) | 90.6 | 89.3 | 91.1 | 90.3 |
| w/o CSI & ISI (token only) | 88.9 | 87.8 | 89.6 | 88.8 |
| (2) Effect of learnable tokenization matrix | ||||
| Replace with AvgPool (fixed pooling) | 91.5 | 90.1 | 92.1 | 91.2 |
| Random (frozen, not learned) | 91.8 | 90.6 | 92.5 | 91.6 |
| (3) Token number sensitivity (EEG/Periph/EOG tokens) | ||||
| (2/1/1) tokens | 91.3 | 90.2 | 92.2 | 91.2 |
| (4/2/1) tokens (default) | 92.8 | 91.8 | 93.6 | 92.7 |
| (8/4/2) tokens | 92.6 | 91.6 | 93.4 | 92.5 |
| (4) Normalization choice in DTFE | ||||
| BatchNorm (BN) | 91.0 | 89.9 | 91.8 | 90.9 |
| LayerNorm (LN) | 92.1 | 91.0 | 92.9 | 92.0 |
| InstanceNorm (IN, default) | 92.8 | 91.8 | 93.6 | 92.7 |
| (5) ISI frequency gating: learnable vs. fixed band-pass | ||||
| ISI w/o frequency gating (FFT + IFFT only) | 91.0 | 89.6 | 91.9 | 90.8 |
| Fixed band-pass filtering | 91.8 | 90.7 | 92.6 | 91.7 |
| Learnable frequency gating (default) | 92.8 | 91.8 | 93.6 | 92.7 |
| Method | Valence (%) | Arousal (%) | Q1–Q4 + Neutral (%) |
|---|---|---|---|
| SVM [4] | 81.45 | – | – |
| MT-MKL [25] | 60.00 | 58.00 | – |
| BDAE [5] | 85.20 | 80.50 | – |
| Bimodal-LSTM [26] | 83.82 | 83.23 | – |
| MESAE [6] | 83.04 | 84.18 | 84.18 |
| DCCA [9] | 85.62 | 84.33 | – |
| MM-ResLSTM [7] | 92.30 | 92.87 | – |
| FGSVM [8] | – | – | 89.53 |
| DPAN [27] | 78.72 | 79.03 | – |
| Random Forest [28] | 62.58 | – | – |
| RDFKM [29] | 64.50 | 63.10 | – |
| i-Isomap + DCNN [10] | – | – | 90.05 |
| RHRPNet [11] | 74.17 | 74.34 | – |
| DEMA [12] | 97.55 | 97.61 | 97.01 |
| Ours (Single-task) | 98.21 ± 0.29 | 98.34 ± 0.31 | 97.85 ± 0.26 |
| Score | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Total |
| Ratio (%) | 6 | 8 | 12 | 15 | 18 | 15 | 12 | 8 | 6 | 100 |
| True∖Pred | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 |
| C1 | 95.4 | 1.2 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.1 |
| C2 | 1.5 | 94.1 | 1.3 | 0.8 | 0.7 | 0.6 | 0.5 | 0.3 | 0.2 |
| C3 | 0.9 | 1.4 | 95.0 | 0.9 | 0.6 | 0.5 | 0.4 | 0.2 | 0.1 |
| C4 | 0.8 | 1.0 | 1.1 | 89.3 | 3.2 | 2.1 | 1.2 | 0.8 | 0.5 |
| C5 | 0.7 | 0.8 | 0.9 | 3.5 | 88.6 | 2.9 | 1.3 | 0.8 | 0.5 |
| C6 | 0.6 | 0.7 | 0.8 | 2.8 | 3.1 | 89.8 | 1.4 | 0.6 | 0.2 |
| C7 | 0.5 | 0.6 | 0.4 | 1.2 | 1.4 | 1.3 | 94.2 | 0.3 | 0.1 |
| C8 | 0.3 | 0.4 | 0.3 | 0.7 | 0.9 | 0.8 | 0.5 | 95.8 | 0.3 |
| C9 | 0.2 | 0.3 | 0.2 | 0.6 | 0.7 | 0.6 | 0.3 | 0.4 | 96.7 |
| True∖Pred | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 |
| C1 | 93.8 | 1.4 | 1.0 | 0.9 | 0.8 | 0.7 | 0.5 | 0.5 | 0.4 |
| C2 | 1.6 | 92.7 | 1.5 | 1.0 | 0.9 | 0.8 | 0.6 | 0.5 | 0.4 |
| C3 | 1.0 | 1.5 | 93.4 | 1.1 | 0.9 | 0.8 | 0.6 | 0.4 | 0.3 |
| C4 | 0.9 | 1.2 | 1.3 | 87.5 | 3.9 | 2.8 | 1.5 | 0.6 | 0.3 |
| C5 | 0.8 | 1.0 | 1.1 | 4.2 | 86.9 | 3.4 | 1.6 | 0.7 | 0.3 |
| C6 | 0.7 | 0.9 | 1.0 | 3.6 | 3.8 | 87.8 | 1.6 | 0.4 | 0.2 |
| C7 | 0.6 | 0.7 | 0.6 | 1.5 | 1.6 | 1.7 | 93.1 | 0.1 | 0.1 |
| C8 | 0.5 | 0.6 | 0.5 | 0.8 | 0.9 | 0.8 | 0.3 | 95.2 | 0.4 |
| C9 | 0.4 | 0.4 | 0.4 | 0.6 | 0.7 | 0.6 | 0.2 | 0.3 | 96.4 |
| True∖Pred | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 |
| C1 | 96.2 | 0.9 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | 0.3 | 0.1 |
| C2 | 1.0 | 95.1 | 1.0 | 0.8 | 0.7 | 0.6 | 0.4 | 0.3 | 0.1 |
| C3 | 0.8 | 1.1 | 95.8 | 0.9 | 0.6 | 0.4 | 0.2 | 0.1 | 0.1 |
| C4 | 0.7 | 0.9 | 1.0 | 90.8 | 3.2 | 1.8 | 0.9 | 0.5 | 0.2 |
| C5 | 0.6 | 0.8 | 0.9 | 3.4 | 90.2 | 2.1 | 1.0 | 0.6 | 0.4 |
| C6 | 0.5 | 0.7 | 0.6 | 2.0 | 2.3 | 91.3 | 1.2 | 0.9 | 0.5 |
| C7 | 0.4 | 0.5 | 0.3 | 1.0 | 1.1 | 1.2 | 95.0 | 0.3 | 0.2 |
| C8 | 0.3 | 0.4 | 0.2 | 0.6 | 0.7 | 0.8 | 0.4 | 96.3 | 0.3 |
| C9 | 0.2 | 0.3 | 0.1 | 0.4 | 0.6 | 0.5 | 0.2 | 0.3 | 97.4 |
| Model | Valence | Arousal | Liking | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Acc | F1 | Prec | Acc | F1 | Prec | Acc | F1 | Prec | |
| LSTM [31] | 85.18 ± 0.38 | 86.92 ± 0.36 | 85.77 ± 0.35 | 84.07 ± 0.40 | 86.55 ± 0.37 | 85.14 ± 0.36 | 82.76 ± 0.39 | 85.72 ± 0.36 | 83.91 ± 0.35 |
| BiLSTM [32] | 86.52 ± 0.36 | 83.21 ± 0.39 | 84.63 ± 0.37 | 82.74 ± 0.40 | 83.81 ± 0.39 | 84.90 ± 0.36 | 83.58 ± 0.38 | 83.41 ± 0.38 | 82.06 ± 0.37 |
| GRU [33] | 86.03 ± 0.34 | 86.71 ± 0.35 | 86.28 ± 0.33 | 84.63 ± 0.37 | 84.94 ± 0.36 | 84.52 ± 0.35 | 85.58 ± 0.36 | 83.66 ± 0.37 | 83.29 ± 0.36 |
| MLP [34] | 85.32 ± 0.39 | 85.79 ± 0.37 | 85.12 ± 0.36 | 83.88 ± 0.40 | 84.74 ± 0.38 | 83.97 ± 0.37 | 84.69 ± 0.39 | 83.12 ± 0.38 | 82.81 ± 0.37 |
| 1D-CNN [35] | 83.61 ± 0.40 | 84.21 ± 0.38 | 83.54 ± 0.36 | 83.28 ± 0.40 | 83.02 ± 0.39 | 82.33 ± 0.38 | 84.57 ± 0.39 | 83.63 ± 0.38 | 82.79 ± 0.37 |
| XGBoost [36] | 82.76 ± 0.40 | 83.65 ± 0.39 | 82.58 ± 0.38 | 86.81 ± 0.38 | 88.54 ± 0.36 | 87.76 ± 0.35 | 82.64 ± 0.40 | 85.97 ± 0.37 | 84.82 ± 0.36 |
| CNN-LSTM [37] | 83.21 ± 0.39 | 84.07 ± 0.37 | 83.81 ± 0.36 | 83.08 ± 0.40 | 86.66 ± 0.38 | 84.97 ± 0.36 | 83.94 ± 0.39 | 85.55 ± 0.37 | 84.67 ± 0.36 |
| Transformer [38] | 87.92 ± 0.32 | 84.61 ± 0.35 | 85.21 ± 0.34 | 89.63 ± 0.30 | 88.94 ± 0.31 | 89.02 ± 0.30 | 83.81 ± 0.35 | 86.91 ± 0.33 | 85.97 ± 0.32 |
| EmT [18] | 90.98 ± 0.31 | 89.62 ± 0.33 | 89.25 ± 0.32 | 90.16 ± 0.34 | 91.44 ± 0.31 | 90.56 ± 0.32 | 88.31 ± 0.36 | 89.74 ± 0.34 | 88.95 ± 0.33 |
| MASA-TCN [19] | 89.48 ± 0.35 | 88.27 ± 0.36 | 87.95 ± 0.35 | 88.73 ± 0.37 | 90.11 ± 0.35 | 89.32 ± 0.34 | 86.53 ± 0.38 | 88.04 ± 0.36 | 87.25 ± 0.35 |
| AT-DGNN [20] | 90.17 ± 0.33 | 88.94 ± 0.34 | 88.54 ± 0.33 | 89.56 ± 0.35 | 90.78 ± 0.32 | 90.01 ± 0.33 | 87.74 ± 0.36 | 89.16 ± 0.34 | 88.37 ± 0.33 |
| Ours (Multi-Task) | 92.47 ± 0.29 | 92.21 ± 0.31 | 91.88 ± 0.29 | 91.42 ± 0.33 | 93.18 ± 0.30 | 92.55 ± 0.30 | 93.08 ± 0.26 | 92.01 ± 0.30 | 91.63 ± 0.31 |
| Metric | Valence | Arousal | Liking |
|---|---|---|---|
| Accuracy (%) | 91.7 ± 2.1 (CI ± 0.73) | 90.9 ± 2.3 (CI ± 0.80) | 92.5 ± 1.9 (CI ± 0.66) |
| F1-score (%) | 91.2 ± 2.0 (CI ± 0.69) | 92.0 ± 2.1 (CI ± 0.73) | 91.8 ± 1.8 (CI ± 0.62) |
| Precision (%) | 90.8 ± 2.2 (CI ± 0.76) | 92.6 ± 2.0 (CI ± 0.69) | 91.4 ± 1.9 (CI ± 0.66) |
| Model | Valence | Arousal | Liking | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Acc | F1 | Prec | Acc | F1 | Prec | Acc | F1 | Prec | |
| LSTM [31] | 84.3 | 85.8 | 84.6 | 83.2 | 85.4 | 84.1 | 82.1 | 84.7 | 83.0 |
| BiLSTM [32] | 85.6 | 82.4 | 83.7 | 82.0 | 83.1 | 84.0 | 83.0 | 82.6 | 81.5 |
| GRU [33] | 85.1 | 85.6 | 85.2 | 83.8 | 84.1 | 83.6 | 84.6 | 82.9 | 82.6 |
| MLP [34] | 84.7 | 84.9 | 84.2 | 83.1 | 83.8 | 83.2 | 84.0 | 82.4 | 82.0 |
| 1D-CNN [35] | 82.9 | 83.5 | 82.7 | 82.6 | 82.4 | 81.9 | 83.8 | 82.7 | 81.9 |
| XGBoost [36] | 81.9 | 82.7 | 81.6 | 86.1 | 87.6 | 86.7 | 81.8 | 85.1 | 84.0 |
| CNN-LSTM [37] | 82.6 | 83.4 | 83.0 | 82.4 | 85.8 | 84.1 | 83.1 | 84.7 | 83.8 |
| Transformer [38] | 87.2 | 83.6 | 84.4 | 88.9 | 88.1 | 88.2 | 82.9 | 85.8 | 84.9 |
| EmT [18] | 90.6 | 89.3 | 88.8 | 90.0 | 91.2 | 90.4 | 88.9 | 90.0 | 89.3 |
| MASA-TCN [19] | 89.2 | 88.0 | 87.6 | 88.5 | 89.8 | 89.1 | 87.4 | 88.7 | 88.0 |
| AT-DGNN [20] | 90.0 | 88.8 | 88.4 | 89.4 | 90.6 | 89.9 | 88.2 | 89.5 | 88.8 |
| Ours (Multi-Task) | 91.7 | 91.4 | 91.0 | 90.6 | 92.5 | 91.8 | 92.4 | 91.2 | 90.7 |
| Model | Valence | Arousal | Liking | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Acc | F1 | Prec | Acc | F1 | Prec | Acc | F1 | Prec | |
| LSTM [31] | 82.6 | 84.1 | 82.9 | 81.4 | 83.2 | 82.0 | 80.3 | 82.8 | 81.2 |
| BiLSTM [32] | 83.8 | 80.9 | 82.1 | 80.6 | 81.7 | 82.5 | 81.5 | 81.1 | 80.1 |
| GRU [33] | 83.5 | 84.0 | 83.6 | 82.1 | 82.4 | 81.9 | 82.7 | 81.2 | 80.9 |
| MLP [34] | 83.0 | 83.2 | 82.5 | 81.6 | 82.3 | 81.7 | 82.0 | 80.6 | 80.3 |
| 1D-CNN [35] | 81.2 | 81.8 | 81.1 | 80.9 | 80.7 | 80.2 | 82.1 | 81.0 | 80.2 |
| XGBoost [36] | 80.1 | 80.9 | 79.8 | 84.7 | 86.1 | 85.3 | 80.4 | 83.7 | 82.5 |
| CNN-LSTM [37] | 81.5 | 82.3 | 81.9 | 81.1 | 84.2 | 82.8 | 81.7 | 83.3 | 82.3 |
| Transformer [38] | 85.6 | 82.1 | 82.8 | 87.4 | 86.8 | 86.9 | 81.4 | 84.0 | 83.1 |
| EmT [18] | 89.1 | 87.9 | 87.4 | 88.6 | 89.8 | 89.0 | 87.1 | 88.3 | 87.6 |
| MASA-TCN [19] | 87.8 | 86.6 | 86.1 | 87.0 | 88.3 | 87.6 | 85.6 | 86.9 | 86.2 |
| AT-DGNN [20] | 88.6 | 87.4 | 86.9 | 88.0 | 89.2 | 88.4 | 86.4 | 87.7 | 87.0 |
| Ours (Multi-Task) | 90.2 | 89.9 | 89.4 | 88.9 | 90.8 | 90.1 | 90.9 | 89.7 | 89.2 |
| Class | Valence | Arousal | Liking |
|---|---|---|---|
| 1 | 88.4 | 88.1 | 87.6 |
| 2 | 89.3 | 88.8 | 88.4 |
| 3 | 90.5 | 89.7 | 89.1 |
| 4 | 91.1 | 90.4 | 89.8 |
| 5 | 92.0 | 91.3 | 90.6 |
| 6 | 90.8 | 90.6 | 90.2 |
| 7 | 89.7 | 89.5 | 89.4 |
| 8 | 90.4 | 89.9 | 89.9 |
| 9 | 89.6 | 89.0 | 89.6 |
| Macro Avg | 90.2 | 89.9 | 89.4 |
| Model Variant | Valence | Arousal | Liking | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | F1 (%) | Prec (%) | Acc (%) | F1 (%) | Prec (%) | Acc (%) | F1 (%) | Prec (%) | |
| w/o DTFE | 88.5 | 87.3 | 86.8 | 87.1 | 85.9 | 85.5 | 88.0 | 86.2 | 85.4 |
| w/o Cross-Modal Fusion | 89.4 | 88.2 | 87.9 | 88.3 | 87.0 | 86.4 | 89.1 | 87.1 | 86.8 |
| Ours (Full) | 92.8 | 92.6 | 92.2 | 91.8 | 93.7 | 93.0 | 93.6 | 92.4 | 91.9 |
| Loss Weight Type | Valence (%) | Arousal (%) | Liking (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Acc | F1 | Prec | Acc | F1 | Prec | Acc | F1 | Prec | |
| Fixed Weights | 90.3 | 89.7 | 88.9 | 89.1 | 89.4 | 88.7 | 89.5 | 88.8 | 88.0 |
| Learnable Weights (Ours) | 92.8 | 92.6 | 92.2 | 91.8 | 93.7 | 93.0 | 93.6 | 92.4 | 91.9 |
| Model | Modalities | Acc (%) | Avg | Cost | ||||
|---|---|---|---|---|---|---|---|---|
| Val | Aro | Like | Params (M) | FLOPs (G) | Time (min) | |||
| MLP-MT | EEG | 83.6 | 82.9 | 84.2 | 83.6 | 0.48 | 0.05 | 24 |
| MLP-MT | EEG + Peripheral | 85.9 | 85.3 | 86.7 | 86.0 | 0.56 | 0.07 | 28 |
| MLP-MT | EEG + Peripheral + EOG | 86.8 | 86.1 | 87.4 | 86.8 | 0.59 | 0.08 | 30 |
| Ours (Full) | EEG + Peripheral + EOG | 92.8 | 91.8 | 93.6 | 92.7 | 2.86 | 0.82 | 88 |
| Protocol | Val Acc (%) | Aro Acc (%) | Like Acc (%) |
|---|---|---|---|
| Overlapping (stride = 1 s) | 92.47 | 91.42 | 93.08 |
| Non-overlapping (stride = 10 s) | 91.27 | 90.12 | 91.98 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Qu, W.; Wang, M.-J.-S. Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference. Symmetry 2026, 18, 488. https://doi.org/10.3390/sym18030488
Qu W, Wang M-J-S. Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference. Symmetry. 2026; 18(3):488. https://doi.org/10.3390/sym18030488
Chicago/Turabian StyleQu, Wenli, and Mu-Jiang-Shan Wang. 2026. "Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference" Symmetry 18, no. 3: 488. https://doi.org/10.3390/sym18030488
APA StyleQu, W., & Wang, M.-J.-S. (2026). Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference. Symmetry, 18(3), 488. https://doi.org/10.3390/sym18030488

