Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features
Abstract
1. Introduction
Research Objectives
- Design a multimodal input architecture that extracts and integrates features from audio using Wav2Vec2, visual data represented by Skeleton Landmarks, and textual utterances encoded through BERT and Doc2Vec;
- Reflect inter-modality relationships and optimize recognition performance through a fusion structure based on the Self-Attention mechanism;
- Improve the model execution speed and enable real-time applicability through a lightweight computation structure based on parallel processing;
- Analyze class separability and the structure of latent representations using visualization based on t-SNE;
- Verify the effectiveness of the proposed model by comparing its performance and interpretability with existing RNN-based personality recognition approaches.
2. Related Work
2.1. Overview of Automatic Personality Recognition
2.2. Unimodal Approaches
2.3. Multimodal Personality Recognition
2.4. Limitations of Existing Fusion Methods
2.5. Attention-Based Research Trends
3. Proposed Method
3.1. Data Collection and Preprocessing
3.2. Feature Extraction from Each Modality
3.2.1. Text
3.2.2. Audio
3.2.3. Video
3.2.4. Feature Normalization and Encoding for Fusion
3.3. Attention-Based Encoding for Each Modality
- Text: Sentence-level embeddings are extracted using Doc2Vec and BERT in parallel. These embeddings are concatenated and passed through Self-Attention to model inter-sentence dependencies. Mean pooling is applied across sentences to obtain the final text vector.
- Audio: Frame-wise representations are obtained using the Wav2Vec2 model. The resulting sequence is processed with multi-head Self-Attention, and a learned linear pooling layer is used to emphasize salient time segments.
- Visual: Skeleton-based joint trajectory data is processed frame-by-frame into joint vectors, which are then passed through a Self-Attention module to model motion dynamics. Mean pooling is applied to obtain the final visual representation.
3.4. Personality Classification Using Fully Connected Layers
- Bottom 33.3%: Low;
- Middle 33.3%: Medium;
- Top 33.3%: High.
3.5. Parallel Learning Structure and Real-Time Feasibility
4. Experiment Setup
4.1. Data Splitting and Preprocessing
4.2. Evaluation Metrics
4.3. Experimental Design for Performance Validation
4.3.1. Comparison with Related Work
4.3.2. Unimodal Performance Comparison
4.3.3. Multimodal Fusion Performance Comparison
4.4. Robustness Evaluation Based on Input Length
4.4.1. Performance Variation by Number of Sentences
4.4.2. Unimodal vs. Multimodal Comparison Under Sentence Constraints
4.5. Latent Space Clustering Based on t-SNE Visualization
4.6. Training Environment and Parameter Settings
5. Results and Analysis
5.1. Performance Comparison with Existing Studies
5.2. Performance Comparison Based on Unimodal Input
5.3. Multimodal Performance Comparison
5.4. Analysis of Performance Variation According to the Number of Input Sentences
5.5. Interpretation of Personality Features Using t-SNE Visualization
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Harris, K.; Vazire, S. On Friendship Development and the Big Five Personality Traits. Soc. Personal. Psychol. Compass 2016, 10, 647–667. [Google Scholar] [CrossRef]
- Mund, M.; Finn, C.; Hagemeyer, B.; Neyer, F.J. Understanding Dynamic Transactions Between Personality Traits and Partner Relationships. Curr. Dir. Psychol. Sci. 2016, 25, 411–416. [Google Scholar] [CrossRef]
- Bui, H.T. Big Five Personality Traits and Job Satisfaction: Evidence from a National Sample. J. Gen. Manag. 2017, 42, 21–30. [Google Scholar] [CrossRef]
- Vinciarelli, A.; Mohammadi, G. A Survey of Personality Computing. IEEE Trans. Affect. Comput. 2014, 5, 273–291. [Google Scholar] [CrossRef]
- Digman, J.M. Personality Structure: Emergence of the Five-Factor Model. Annu. Rev. Psychol. 1990, 41, 417–440. [Google Scholar] [CrossRef]
- Hogan, R.; Johnson, J.; Briggs, S. Handbook of Personality Psychology; Academic Press: Cambridge, MA, USA, 1997. [Google Scholar]
- Allport, G.W. Pattern and Growth in Personality; Springer: Berlin/Heidelberg, Germany, 1961. [Google Scholar]
- Han, S.; Huang, H.; Tang, Y. Knowledge of Words: An Interpretable Approach for Personality Recognition from Social Media. Knowl.-Based Syst. 2020, 194, 105550. [Google Scholar] [CrossRef]
- Poria, S.; Gelbukh, A.; Agarwal, B.; Cambria, E.; Howard, N. Common Sense Knowledge Based Personality Recognition from Text. In Advances in Soft Computing and Its Applications; Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J.M., Mattern, F., Mitchell, J.C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8266, pp. 484–496. [Google Scholar] [CrossRef]
- Carducci, G.; Rizzo, G.; Monti, D.; Palumbo, E.; Morisio, M. Twitpersonality: Computing Personality Traits from Tweets Using Word Embeddings and Supervised Learning. Information 2018, 9, 127. [Google Scholar] [CrossRef]
- KN, P.K.; Gavrilova, M.L. Latent Personality Traits Assessment from Social Network Activity Using Contextual Language Embedding. IEEE Trans. Comput. Soc. Syst. 2021, 9, 638–649. [Google Scholar] [CrossRef]
- Tadesse, M.M.; Lin, H.; Xu, B.; Yang, L. Personality Predictions Based on User Behavior on the Facebook Social Media Platform. IEEE Access. 2018, 6, 61959–61969. [Google Scholar] [CrossRef]
- Bindroo, R.; Sujit, S.D.; Seshadri, A.; Sathyanarayan, M. Psychometric Precision: ML-Driven Learning Strategies Informed on Big Five Traits. In Proceedings of the 2024 2nd International Conference on Networking, Embedded and Wireless Systems (ICNEWS), Bangalore, India, 22–23 August 2024; pp. 1–7. [Google Scholar] [CrossRef]
- Karpagam, G.; VM, H.V.; Kabilan, K.; Pranav, P.; Ramesh, P.; B, S.S. Multimodal Fusion for Precision Personality Trait Analysis: A Comprehensive Model Integrating Video, Audio, and Text Inputs. In Proceedings of the 2024 International Conference on Smart Systems for Electrical, Electronics, Communication and Computer Engineering (ICSSEECC), Coimbatore, India, 28–29 June 2024; pp. 327–332. [Google Scholar]
- Ma, Z.; Ma, F.; Sun, B.; Li, S. Hybrid mutimodal fusion for dimensional emotion recognition. In Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, Virtual, 24 October 2021; pp. 29–36. [Google Scholar]
- Ma, F.; He, Y.; Sun, B.; Li, S. Multimodal Prompt Alignment for Facial Expression Recognition. arXiv 2025, arXiv:2506.21017. [Google Scholar] [CrossRef]
- Kosan, M.A.; Karacan, H.; Urgen, B.A. Predicting personality traits with semantic structures and LSTM-based neural networks. Alex. Eng. J. 2022, 61, 8007–8025. [Google Scholar] [CrossRef]
- Jaysundara, A.; De Silva, D.; Kumarawadu, P. Personality prediction of social network users using LSTM based sentiment analysis. In Proceedings of the 2022 International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN), Villupuram, India, 25–26 March 2022; pp. 1–6. [Google Scholar]
- Cantador, I.; Fernández-Tobías, I.; Bellogín, A.; Kosinski, M.; Stillwell, D. Relating Personality Types with User Preferences in Multiple Entertainment Domains. In Proceedings of the UMAP Workshops, Rome, Italy, 10–14 June 2013; Volume 997. [Google Scholar]
- Strickhouser, J.E.; Zell, E.; Krizan, Z. Does Personality Predict Health and Well-Being? A Metasynthesis. Health Psychol. 2017, 36, 797. [Google Scholar] [CrossRef]
- Widiger, T.A.; Costa Jr, P.T. Personality and Personality Disorders. J. Abnorm. Psychol. 1994, 103, 78. [Google Scholar] [CrossRef]
- Suen, H.Y.; Hung, K.E.; Lin, C.L. TensorFlow-based Automatic Personality Recognition Used in Asynchronous Video Interviews. IEEE Access. 2019, 7, 61018–61023. [Google Scholar] [CrossRef]
- Song, S.; Jaiswal, S.; Sanchez, E.; Tzimiropoulos, G.; Shen, L.; Valstar, M. Self-Supervised Learning of Person-Specific Facial Dynamics for Automatic Personality Recognition. IEEE Trans. Affect. Comput. 2021, 14, 178–195. [Google Scholar] [CrossRef]
- Mehta, Y.; Majumder, N.; Gelbukh, A.; Cambria, E. Recent Trends in Deep Learning Based Personality Detection. Artif. Intell. Rev. 2020, 53, 2313–2339. [Google Scholar] [CrossRef]
- Ahmad, H.; Asghar, M.Z.; Khan, A.S.; Habib, A. A Systematic Literature Review of Personality Trait Classification from Textual Content. Open Comput. Sci. 2020, 10, 175–193. [Google Scholar] [CrossRef]
- Zumma, M.T.; Munia, J.A.; Halder, D.; Rahman, M.S. Personality Prediction from Twitter Dataset Using Machine Learning. In Proceedings of the 2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), Virtual, 3–5 October 2022; pp. 1–5. [Google Scholar]
- Jagannath, D.J.; Sreelakshmi, T.; George, J.; Achsah, M. Forecasting Traits: Human Personality Prediction with Machine Learning Methodology-a Comparative Study. In Proceedings of the 2nd International Conference on Computer Vision and Internet of Things (ICCVIoT 2024), Coimbatore, India, 10–11 December 2024; Volume 2024, pp. 93–99. [Google Scholar]
- Pennebaker, J.W.; King, L.A. Linguistic Styles: Language Use as an Individual Difference. J. Personal. Soc. Psychol. 1999, 77, 1296. [Google Scholar] [CrossRef]
- Asghar, J.; Akbar, S.; Asghar, M.Z.; Ahmad, B.; Al-Rakhami, M.S.; Gumaei, A. Detection and Classification of Psychopathic Personality Trait from Social Media Text Using Deep Learning Model. Comput. Math. Methods Med. 2021, 2021, 1–10. [Google Scholar] [CrossRef]
- Christian, H.; Suhartono, D.; Chowanda, A.; Zamli, K.Z. Text Based Personality Prediction from Multiple Social Media Data Sources Using Pre-Trained Language Model and Model Averaging. J. Big Data 2021, 8, 68. [Google Scholar] [CrossRef]
- Wang, Y.; Zheng, J.; Li, Q.; Wang, C.; Zhang, H.; Gong, J. Xlnet-Caps: Personality Classification from Textual Posts. Electronics 2021, 10, 1360. [Google Scholar] [CrossRef]
- Leonardi, S.; Monti, D.; Rizzo, G.; Morisio, M. Multilingual Transformer-Based Personality Traits Estimation. Information 2020, 11, 179. [Google Scholar] [CrossRef]
- Waqas, M.; Zhang, F.; Laghari, A.A.; Almadhor, A.; Petrinec, F.; Iqbal, A.; Khalil, M.M.Y. TraitBertGCN: Personality Trait Prediction Using BertGCN with Data Fusion Technique. Int. J. Comput. Intell. Syst. 2025, 18, 64. [Google Scholar] [CrossRef]
- Mohammadi, G.; Vinciarelli, A. Automatic Personality Perception: Prediction of Trait Attribution Based on Prosodic Features. IEEE Trans. Affect. Comput. 2012, 3, 273–284. [Google Scholar] [CrossRef]
- Yang, L.; Li, S.; Luo, X.; Xu, B.; Geng, Y.; Zeng, Z.; Zhang, F.; Lin, H. Computational Personality: A Survey. Soft Computing 2022, 26, 9587–9605. [Google Scholar] [CrossRef]
- Tsani, E.F.; Suhartono, D. Personality Identification from Social Media Using Ensemble BERT and RoBERTa. Informatica 2023, 47, 537–544. [Google Scholar] [CrossRef]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef]
- Yan, L.; Li, K.; Gao, R.; Wang, C.; Xiong, N. An intelligent weighted object detector for feature extraction to enrich global image information. Appl. Sci. 2022, 12, 7825. [Google Scholar] [CrossRef]
- Lin, C.B.; Dong, Z.; Kuan, W.K.; Huang, Y.F. A framework for fall detection based on OpenPose skeleton and LSTM/GRU models. Appl. Sci. 2020, 11, 329. [Google Scholar] [CrossRef]
- Nguyen, H.C.; Nguyen, T.H.; Scherer, R.; Le, V.H. Deep Learning for Human Activity Recognition on 3D Human Skeleton: Survey and Comparative Study. Sensors 2023, 23, 5121. [Google Scholar] [CrossRef]
- Liu, J.; Akhtar, N.; Mian, A. Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 10–19. [Google Scholar]
- Zhao, X.; Liao, Y.; Tang, Z.; Xu, Y.; Tao, X.; Wang, D.; Wang, G.; Lu, H. Integrating Audio and Visual Modalities for Multimodal Personality Trait Recognition via Hybrid Deep Learning. Front. Neurosci. 2023, 16, 1107284. [Google Scholar] [CrossRef] [PubMed]
- Lee, C.H.; Yang, H.C.; Su, X.Q.; Tang, Y.X. A Multimodal Affective Sensing Model for Constructing a Personality-Based Financial Advisor System. Appl. Sci. 2022, 12, 10066. [Google Scholar] [CrossRef]
- Giritlioğlu, D.; Mandira, B.; Yilmaz, S.F.; Ertenli, C.U.; Akgür, B.F.; Kınıklıoğlu, M.; Kurt, A.G.; Mutlu, E.; Gürel, Ş.C.; Dibeklioğlu, H. Multimodal analysis of personality traits on videos of self-presentation and induced behavior. J. Multimodal User Interfaces 2021, 15, 337–358. [Google Scholar] [CrossRef]
- Stern, J.; Schild, C.; Jones, B.C.; DeBruine, L.M.; Hahn, A.; Puts, D.A.; Zettler, I.; Kordsmeyer, T.L.; Feinberg, D.; Zamfir, D. Do Voices Carry Valid Information about a Speaker’s Personality? J. Res. Personal. 2021, 92, 104092. [Google Scholar] [CrossRef]
- Zhao, X.; Tang, Z.; Zhang, S. Deep Personality Trait Recognition: A Survey. Front. Psychol. 2022, 13, 839619. [Google Scholar] [CrossRef]
- Yan, L.; Ye, Y.; Wang, C.; Sun, Y. LocMix: Local saliency-based data augmentation for image classification. Signal Image Video Process. 2024, 18, 1383–1392. [Google Scholar] [CrossRef]
- Moorthy, S.; Moon, Y.K. Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics 2025, 13, 1100. [Google Scholar] [CrossRef]
- Praveen, R.G.; de Melo, W.C.; Ullah, N.; Aslam, H.; Zeeshan, O.; Denorme, T.; Pedersoli, M.; Koerich, A.L.; Bacon, S.; Cardinal, P. A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2486–2495. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X.; et al. A Comparative Study on Transformer vs. RNN in Speech Applications. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 449–456. [Google Scholar] [CrossRef]
- Vásquez, R.L.; Ochoa-Luna, J. Transformer-based approaches for personality detection using the MBTI model. In Proceedings of the 2021 XLVII Latin American Computing Conference (CLEI), Cartago, Costa Rica, 25–29 October 2021; pp. 1–7. [Google Scholar]
- Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; Vijayanarasimhan, S. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv 2016, arXiv:1609.08675v1. [Google Scholar]
- Li, B.B.; Huang, H.W. Cognition and beyond: Intersections of Personality Traits and Language. Psychol. Learn. Motiv. 2024, 80, 105–148. [Google Scholar]
- ysiak, M. Inner Dialogical Communication and Pathological Personality Traits. Front. Psychol. 2019, 10, 1663. [Google Scholar] [CrossRef]
- Kim, S.Y.; Kim, J.M.; Yoo, J.A.; Bae, K.Y.; Kim, S.W.; Yang, S.J.; Shin, I.S.; Yoon, J.S. Standardization and validation of big five inventory-Korean version (BFI-K) in elders. Korean J. Biol. Psychiatry 2010, 17, 15–25. [Google Scholar]
- Wang, Q.; Liu, A.; Yan, K.; Hou, J.; Li, W. BigFive: A Chinese Textual Dataset Supporting Psychology Knowledge Graph Construction. In Proceedings of the 2023 IEEE International Conference on Knowledge Graph (ICKG), Shanghai, China, 1–2 December 2023; pp. 77–83. [Google Scholar]
- Cherukuru, R.K.; Kumar, A.; Srivastava, S.; Verma, V.K. Prediction of Personality Trait using Machine Learning on Online Texts. In Proceedings of the 2022 International Conference for Advancement in Technology (ICONAT), Goa, India, 21–22 January 2022; pp. 1–8. [Google Scholar]
- Bhin, H.; Lim, Y.; Choi, J. Multimodal Personality Prediction: A Real-Time Recognition System for Social Robots with Data Acquisition. In Proceedings of the 2024 21st International Conference on Ubiquitous Robots (UR), Manhattan, NY, USA, 24–27 June 2024; pp. 673–676. [Google Scholar] [CrossRef]







| Trait | High | Low |
|---|---|---|
| Openness | Imaginative, curious, open-minded | Conventional, resistant to change, narrow interests |
| Conscientiousness | Organized, responsible, goal-oriented | Careless, impulsive, disorganized |
| Extraversion | Sociable, energetic, enthusiastic | Reserved, quiet, solitary |
| Agreeableness | Kind, cooperative, compassionate | Suspicious, antagonistic, uncooperative |
| Neuroticism | Prone to anxiety, moodiness, emotional instability | Calm, emotionally stable, resilient |
| Personality Trait | Before Trimming | After Trimming |
|---|---|---|
| Openness | 0.78 | 0.86 |
| Conscientiousness | 0.72 | 0.89 |
| Extraversion | 0.74 | 0.88 |
| Agreeableness | 0.69 | 0.87 |
| Neuroticism | 0.82 | 0.90 |
| Average | 0.75 | 0.88 |
| Modality | Seq. Length (n) | Feature Dimension (d) | Heads (h) | Pooling Type |
|---|---|---|---|---|
| Text | 50 | 120 | 4 | Mean pooling |
| Audio | 45 | 120 | 6 | Linear pooling |
| Visual | 40 | 128 | 4 | Mean pooling |
| Category | Features | Learning Sequence | Learning Duration | Test Duration | FPS | FPS@15 Snapshot |
|---|---|---|---|---|---|---|
| Audio, Video, Text | Each Feature | Each Modality | 404 min | 43 min | 20 | 3.2 |
| Multimodal | Concatenation Fusion | Multi-Process | 373 min | 25 min | 23.2 | 10.52 |
| Multimodal | Concatenation Fusion | Threading Process | 260 min | 15 min | 30 | 16.92 |
| Features | Year | Models | Performance |
|---|---|---|---|
| Text [27] | 2024 | SVM | F1-score: 0.74 |
| Text [26] | 2022 | Naïve Bayes, RF | Accuracy: 0.75 |
| Text [57] | 2023 | Correlation Analysis | F1-score: 0.66 |
| Audio, Text [58] | 2024 | Multiple Classifiers, RNN | F1-score: 0.82 |
| Audio, Video, Text [13] | 2024 | BERT, RNN | Accuracy: 0.80 |
| Audio, Video [14] | 2024 | XGBoost, RNN | F1-score: 0.76 |
| Audio, Video, Text | 2025 | Proposed Model | F1-score: 0.868 |
| Features | Year | Models | Performance (Paper) | Performance (Test) |
|---|---|---|---|---|
| Text [27] | 2024 | SVM | F1-score: 0.74 | Accuracy: 0.573 F1-score: 0.435 |
| Text [26] | 2022 | Naïve Bayes, RF | Accuracy: 0.75 | Accuracy: 0.685 F1-score: 0.652 |
| Text | 2025 | Proposed Model | – | F1-score: 0.821 |
| Features | Year | Models | Performance (Paper) | Performance (Test) |
|---|---|---|---|---|
| Audio, Video, Text [58] | 2024 | BERT, RNN | Accuracy: 0.80 | Accuracy: 0.840 F1-score: 0.791 |
| Audio, Video, Text [59] | 2024 | BiRNN | F1-score: 0.736 | Accuracy: 0.826 F1-score: 0.736 |
| Audio, Video, Text | 2025 | Proposed Model | – | Accuracy: 0.903 F1-score: 0.868 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bhin, H.; Choi, J. Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features. Electronics 2025, 14, 2837. https://doi.org/10.3390/electronics14142837
Bhin H, Choi J. Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features. Electronics. 2025; 14(14):2837. https://doi.org/10.3390/electronics14142837
Chicago/Turabian StyleBhin, Hyeonuk, and Jongsuk Choi. 2025. "Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features" Electronics 14, no. 14: 2837. https://doi.org/10.3390/electronics14142837
APA StyleBhin, H., & Choi, J. (2025). Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features. Electronics, 14(14), 2837. https://doi.org/10.3390/electronics14142837

