Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features

Bhin, Hyeonuk; Choi, Jongsuk

doi:10.3390/electronics14142837

Open AccessArticle

Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features

by

Hyeonuk Bhin

^1,2

and

Jongsuk Choi

^1,2,*

¹

Department of AI-Robot, Korea National University of Science and Technology (UST), Daejeon 34113, Republic of Korea

²

Center for Humanoid, Korea Institute of Science and Technology (KIST), Seoul 02792, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2837; https://doi.org/10.3390/electronics14142837

Submission received: 16 June 2025 / Revised: 14 July 2025 / Accepted: 14 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue Explainable Machine Learning and Data Mining)

Download

Browse Figures

Versions Notes

Abstract

Personality is a fundamental psychological trait that exerts a long-term influence on human behavior patterns and social interactions. Automatic personality recognition (APR) has exhibited increasing importance across various domains, including Human–Robot Interaction (HRI), personalized services, and psychological assessments. In this study, we propose a multimodal personality recognition model that classifies the Big Five personality traits by extracting features from three heterogeneous sources: audio processed using Wav2Vec2, video represented as Skeleton Landmark time series, and text encoded through Bidirectional Encoder Representations from Transformers (BERT) and Doc2Vec embeddings. Each modality is handled through an independent Self-Attention block that highlights salient temporal information, and these representations are then summarized and integrated using a late fusion approach to effectively reflect both the inter-modal complementarity and cross-modal interactions. Compared to traditional recurrent neural network (RNN)-based multimodal models and unimodal classifiers, the proposed model achieves an improvement of up to 12 percent in the F1-score. It also maintains a high prediction accuracy and robustness under limited input conditions. Furthermore, a visualization based on t-distributed Stochastic Neighbor Embedding (t-SNE) demonstrates clear distributional separation across the personality classes, enhancing the interpretability of the model and providing insights into the structural characteristics of its latent representations. To support real-time deployment, a lightweight thread-based processing architecture is implemented, ensuring computational efficiency. By leveraging deep learning-based feature extraction and the Self-Attention mechanism, we present a novel personality recognition framework that balances performance with interpretability. The proposed approach establishes a strong foundation for practical applications in HRI, counseling, education, and other interactive systems that require personalized adaptation.

Keywords:

automatic personality recognition; multimodal fusion; attention mechanism; Big Five traits classifier modeling; human–robot interaction; real-time affective computing

1. Introduction

Human personality is defined as a long-term and stable trait that deeply influences individual behavior, decision-making, and social interaction. In particular, personality recognition has emerged as a core technology for understanding user characteristics and generating personalized responses in a wide range of application domains, including HRI, personalized services, education, counseling, and healthcare [1,2,3]. For these reasons, research on personality recognition has been continuously expanding, with growing interest in APR systems [4].

Among various personality theories, the Big Five personality model has become the most widely accepted theoretical framework due to its universality and reliability [5]. This model systematically categorizes human personality into five dimensions: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. The reliability of the model has been empirically validated in various linguistic and cultural contexts [6,7]. Table 1 presents the characteristic traits associated with each of the five personality dimensions.

Previous research in APR has primarily explored unimodal approaches, utilizing textual utterances, audio signals, or video data to predict personality traits. Among these, text-based methods have been actively studied in conjunction with advances in natural language processing. However, relying solely on linguistic content does not adequately reflect nonverbal behaviors or emotional cues. This limitation often leads to a reduced generalization performance and results that are highly sensitive to individual differences in expression styles [8,9,10,11,12].

To overcome these drawbacks, recent studies have focused on multimodal approaches that integrate audio, visual, and textual information [13,14]. These methods aim to improve the prediction consistency and reliability by capturing complementary characteristics across different modalities. Despite recent advances, effective multimodal personality recognition systems still face significant technical challenges. Integrating audio, video, and textual modalities often leads to high computational complexity, as processing these heterogeneous and high-dimensional features simultaneously demands considerable resources. This computational burden poses a significant challenge for real-time deployment, particularly in environments with constrained hardware capabilities, such as mobile robots or embedded systems.

Similar technical strategies have been adopted in recent affective computing challenges. For instance, a top-performing model combined Long Short-Term Memory (LSTM) and Self-Attention mechanisms along with late fusion to predict emotional valence, arousal, and psycho-physiological states from multimodal inputs including audio, video, and biosignals [15]. Their results demonstrated that temporal modeling with dynamic fusion can significantly improve prediction robustness, even in stress-inducing environments such as the Trier Social Stress Test.

Feature concatenation remains a commonly used fusion strategy due to its simplicity and ease of application across diverse modalities. However, this approach has inherent limitations in capturing the complex, non-linear relationships between modalities. Moreover, when certain modalities are missing or corrupted by noise, the prediction performance can degrade significantly. These limitations present a critical barrier to the practicality and reliability of multimodal recognition systems.

In addition to these issues, the interpretability of model outputs remains a persistent concern. Deep learning models typically function as black boxes, providing limited transparency into how different input features or modalities influence the final predictions. This opacity can undermine user trust and hinder the integration of such models into critical applications where accountability and explainability are essential, such as education, counseling, and healthcare.

Recent studies in related fields have explored strategies such as cross-modal alignment and prompt learning to enhance the interpretability and granularity of multimodal models. For example, in the domain of facial expression recognition, some approaches have utilized Large Language Models (LLMs) to generate fine-grained textual descriptions for each expression and align them with visual features. By minimizing the discrepancy between textual and visual representations, these methods improve semantic alignment and provide more interpretable outputs [16]. Such strategies offer promising directions for improving the explainability and precision of future personality recognition systems.

In deep learning-based models, RNN architectures have been widely used to process temporal dependencies [17,18]. However, these structures are known to struggle with long-term dependency, limited parallelism, and computational inefficiencies, which hinder their application in real-time systems. Such limitations are especially problematic in HRI environments, where timely responses are essential and performance degradation can directly impact usability.

To address these issues, this study proposes a multimodal personality recognition model based on the Self-Attention mechanism. The model builds on a conventional concatenation-based fusion architecture, where Self-Attention is applied independently within each modality to capture salient temporal patterns. This enables the model to selectively focus on informative segments in each modality, improving the modeling of long-term dependencies while preserving modularity. The architecture also supports parallel processing, making it well-suited for real-time applications. Furthermore, to enhance model transparency and interpretability, we employ t-SNE to visualize the modality-specific representations learned by the model. This approach achieves both high recognition performance and real-time feasibility, contributing to robust personality recognition in interactive systems.

Research Objectives

The primary goal of this study is to develop a high-performance personality recognition model that integrates audio, visual, and textual modalities while ensuring both real-time operability and interpretability. To this end, the following specific research objectives are defined:

Design a multimodal input architecture that extracts and integrates features from audio using Wav2Vec2, visual data represented by Skeleton Landmarks, and textual utterances encoded through BERT and Doc2Vec;
Reflect inter-modality relationships and optimize recognition performance through a fusion structure based on the Self-Attention mechanism;
Improve the model execution speed and enable real-time applicability through a lightweight computation structure based on parallel processing;
Analyze class separability and the structure of latent representations using visualization based on t-SNE;
Verify the effectiveness of the proposed model by comparing its performance and interpretability with existing RNN-based personality recognition approaches.

2. Related Work

Human personality is a psychological trait that significantly influences individual behavior patterns, decision-making, and interpersonal relationships. As such, it has been widely studied and applied in various fields [19,20,21]. Traditionally, personality assessment has relied on self-reported questionnaires and expert evaluations. However, these methods are often time-consuming, costly, and subject to response bias. To address these limitations and enable more efficient and objective personality evaluation, APR technologies have been introduced [22,23].

2.1. Overview of Automatic Personality Recognition

APR refers to the technology that automatically estimates personality traits based on various behavioral data, including audio, visual, and textual information. These systems extract personality cues from spontaneous user behaviors, thereby reducing the response burden and enabling real-time personality estimation, long-term monitoring, and personalized services [24]. In particular, APR has demonstrated promising potential in application areas such as HRI, personalized education systems, psychological counseling, healthcare monitoring, and marketing [25]. The development of APR has been accelerated by advancements in sensor technologies, deep learning-based feature extraction techniques, and the growing availability of large-scale datasets. Following this trend, the present study proposes a new approach that integrates multimodal data using advanced deep learning techniques for effective personality estimation.

2.2. Unimodal Approaches

Early studies on APR have primarily focused on unimodal approaches that estimate personality traits from one type of data [26,27]. While these approaches allow for an in-depth analysis of individual modality characteristics, they are inherently limited in terms of generalizability due to their dependence on restricted information sources.

Text-based personality recognition is among the most actively studied areas. Researchers have attempted to predict personality using natural language data collected from social media, emails, essays, and interview transcripts [28,29]. Early methods relied on frequency-based features such as Term Frequency-Inverse Document Frequency (TF-IDF), Linguistic Inquiry and Word Count (LIWC), and bag-of-words. More recently, deep learning-based embedding techniques such as Doc2Vec, Word2Vec, and BERT have significantly improved the prediction performance [30,31]. Widely used datasets include myPersonality and PAN-2015 Author Profiling [30,32,33]. Despite its accessibility and scalability, text-based analysis lacks the ability to reflect nonverbal information.

Audio-based personality recognition aims to infer personality traits from acoustic features present in speech. Common features include the speaking rate, pitch, intensity, and voice activity duration, which have been reported to correlate with the Big Five dimensions to some extent [34,35]. Recently, pretrained speech models such as Wav2Vec2 and BERT have been adopted for more refined feature extraction [36]. Nevertheless, this approach remains sensitive to the recording conditions, utterance length, and speakers’ emotional states.

Visual-based personality recognition focuses on extracting personality-relevant cues from nonverbal behaviors such as facial expressions, gaze, gestures, posture, and body movements. Notably, Skeleton Landmark-based time-series data and pose extraction tools such as OpenPose [37] and advanced object detection models based on Convolutional Neural Networks (CNNs) [38] have been used to quantify movement patterns for temporal analysis [39,40,41]. While this approach effectively captures nonverbal traits, it can be affected by environmental factors such as lighting, video resolution, and limited behavioral diversity among participants.

Although these unimodal approaches have shown unique advantages and technological advancements, relying on only one modality presents structural limitations in capturing the complexity of personality. To address this, recent studies have actively explored personality recognition based on multimodal data integration [42,43].

2.3. Multimodal Personality Recognition

While unimodal approaches are effective in capturing specific aspects of behavioral and expressive information, there is a growing consensus that accurate predictions of personality, an inherently complex and high-dimensional psychological trait, require the integration of multiple sources. The greatest advantage of the multimodal approach lies in its ability to combine complementary information across modalities, thereby compensating for the limitations of each and enhancing the prediction reliability. For example, the linguistic content in utterances reflects cognitive tendencies [44], acoustic features capture emotional attributes such as emotional stability and extraversion, and visual signals are useful for assessing social expressiveness and tension levels [45]. Integrating such heterogeneous features has been shown to significantly improve performance compared to unimodal approaches, as evidenced by numerous prior studies [46]. Moreover, recent works have also explored data augmentation strategies to further enhance the robustness of multimodal models [47].

2.4. Limitations of Existing Fusion Methods

Although multimodal personality recognition has shown clear advantages in enhancing prediction performance by integrating features from diverse modalities, several structural limitations remain in practical implementations. In particular, the RNN-based fusion models that have been widely adopted in previous studies present the following challenges.

First, although RNN architectures are well suited for modeling temporal dependencies, they often suffer from the long-term dependency problem. As the input sequences become longer, earlier information tends to vanish, which is problematic given that personality traits often emerge from global patterns of speech or behavior. This structural limitation can lead to performance degradation.

Second, the relative importance of each modality may vary depending on context and individual differences. However, many existing fusion methods treat all inputs with equal weight or use simple averaging or weighted summation schemes. Such approaches make it difficult to assign dynamic weights based on the salience of information, potentially failing to capture critical cues.

Third, most existing fusion models operate as black boxes, offering limited interpretability of the prediction outcomes. In real-world applications such as robotics, education, and psychological assessment, the quantitative interpretation and visual explanation of the results are crucial. The lack of interpretability thus presents a major obstacle to practical deployment.

Fourth, the increased computational complexity resulting from multimodal integration cannot be overlooked. In environments where real-time interaction is essential, not only the predictive accuracy but also the computational efficiency becomes a critical factor. Existing fusion architectures often fall short of meeting these dual demands.

These technical and practical challenges underscore the need for more flexible and interpretable fusion structures in multimodal personality recognition. In this context, Attention-based models have recently received considerable interest as a promising alternative.

2.5. Attention-Based Research Trends

To overcome the structural limitations of existing RNN-based multimodal personality recognition models, recent studies have increasingly focused on models utilizing the Attention mechanism [48,49]. Attention enables the model to selectively assign weights to important information within the input, allowing it to represent the relative importance of information in a learnable manner. This property is considered a promising approach to effectively addressing the issue of information imbalance that often arises when integrating complex multimodal data.

In particular, the introduction of Transformer-based architectures has significantly contributed to resolving the problem of long-term dependency. Unlike RNNs, Attention-based architectures can process entire input sequences in parallel while flexibly modeling relationships between different positions. These characteristics make them highly suitable for handling high-dimensional information that includes sequential patterns and interdependent cues, as is typical in personality-related data. Indeed, Transformer-based models have demonstrated a superior performance to traditional architectures in various domains such as Natural Language Processing (NLP), speech recognition, and video understanding [50,51].

In the field of APR, this trend has been partially reflected through models that utilize Transformers within unimodal or limited fusion settings [52]. For example, some studies have applied fine-tuned BERT models for text-based personality prediction, or implemented Self-Attention mechanisms on audio time-series data. However, most of these models remain restricted to unimodal settings or fail to dynamically capture relationships across modalities, thus limiting technological advancement in multimodal fusion contexts.

Moreover, Attention-based models offer not only performance benefits but also enhanced interpretability. By visualizing Attention weights, one can analyze which input cues the model focuses on during inference, providing a basis for increasing user trust in APR systems deployed in real-world applications.

Against this backdrop, the present study proposes a multimodal personality recognition model based on the Attention mechanism to address the structural limitations identified in previous research.

3. Proposed Method

This study proposes a multimodal personality recognition model that takes audio, video, and utterance as inputs and predicts personality traits through an Attention-based fusion architecture. The overall system consists of the following five stages.

3.1. Data Collection and Preprocessing

A total of 156 samples were selected based on the publicly available YouTube-8M dataset [53]. Since personality is often revealed through reflective experiences and internal thoughts [54,55], videos were filtered using keywords such as “vlog”, “experience”, “thoughts”, and “opinions” during data selection. To ensure the consistency of multimodal information, only videos with clearly visible faces and upper bodies were included, allowing for the synchronized acquisition of audio, video, and textual data.

For ground truth labeling, a third-party evaluation experiment was conducted using a custom web-based application, as illustrated in Figure 1. The interface featured a fixed video panel at the top of the screen and a Big Five personality questionnaire below, facilitating consistent evaluation by five independent annotators.

Each annotator rated the personality traits of video samples based on the Big Five Inventory-10-K (BFI-10-K) questionnaire. While Cronbach’s alpha was not computed, prior studies have demonstrated the BFI-10-K’s acceptable internal consistency [56].

To mitigate the effect of annotation outliers, the top and bottom 10 percent of scores were excluded before calculating inter-rater agreement. Fleiss’ kappa was then used to assess agreement, yielding an average score of 0.75, which increased to 0.88 after trimming. Trait-wise results are summarized in Table 2, demonstrating high annotation reliability across all five traits.

Through this process, a total of 156 multimodal samples were constructed, each comprising approximately five minutes of audio, video, and utterance data, along with the corresponding Big Five personality scores as ground truth. The entire data processing workflow was automated using a backend server and database handler and was efficiently managed in a globally hosted environment linked to the experimental server (Figure 2).

3.2. Feature Extraction from Each Modality

In this study, features were extracted separately from the three modalities—audio, video, and text—using modality-specific methods and then were normalized to be compatible with the Attention-based architecture. Figure 3 illustrates the overall process of feature extraction and preprocessing.

3.2.1. Text

Utterances were embedded using both Doc2Vec and BERT in parallel to capture contextual information. For Doc2Vec, paragraph IDs were assigned at the sentence level rather than at the document level, allowing the narrative context of each sentence to be embedded. This resulted in a 300-dimensional vector per sentence. For BERT, each sentence was tokenized and a 768-dimensional vector was extracted using the output of the classification token. The resulting embeddings were averaged and normalized before being input into the model.

3.2.2. Audio

Raw audio was converted into a high-dimensional frame-level vector sequence using Wav2Vec2, generating a 768-dimensional vector for each frame. Since personality-related cues primarily appear during speech segments, Voice Activity Detection (VAD) was applied to isolate those segments before extracting features. The resulting features were then converted into a fixed-length representation through average pooling.

3.2.3. Video

Visual data were represented as time-series Skeleton Landmarks extracted using OpenPose. A total of 18 upper-body and facial joints (13 upper-body, 5 facial) were used, yielding 49-dimensional vectors for each frame, including x/y coordinates and confidence values. Only frames corresponding to the detected speech segments were used and the time-series data were passed through a Gated Recurrent Unit (GRU)-based encoder to obtain high-dimensional representations.

3.2.4. Feature Normalization and Encoding for Fusion

All three modality outputs were transformed into

256 \times 256

normalized input vectors for the subsequent Attention-based fusion process. The position and modality of each feature were explicitly specified in a metadata mask. The remaining entries in the fixed-size input were padded with null values, which were excluded from computation during the actual processing.

3.3. Attention-Based Encoding for Each Modality

To extract modality-specific contextual representations while preserving temporal and semantic structure, we applied Self-Attention encoding followed by pooling to each modality stream—text, audio, and visual. This unified attention-based framework allows for parallel yet independent processing of heterogeneous input types, resulting in a set of fixed-length summary vectors suitable for late fusion.

Let

X_{m} \in R^{n \times d}

denote the input sequence for modality m, where n is the number of time steps (e.g., sentences, audio frames, or pose frames), and d is the embedding dimension. Each input is projected into query, key, and value spaces via learnable linear layers:

Q = X_{m} W^{Q}, K = X_{m} W^{K}, V = X_{m} W^{V}, W^{Q}, W^{K}, W^{V} \in R^{d \times d_{h}} .

We apply scaled dot-product attention with h heads to capture contextual dependencies:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{h}}}) V, MultiHead (X_{m}) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O} .

The attention outputs are summarized into fixed-size vectors through either mean pooling or learned linear pooling, depending on the modality:

f_{m} = Pooling (MultiHead (X_{m})) \in R^{d^{'}} .

Each modality is handled as follows:

Text: Sentence-level embeddings are extracted using Doc2Vec and BERT in parallel. These embeddings are concatenated and passed through Self-Attention to model inter-sentence dependencies. Mean pooling is applied across sentences to obtain the final text vector.
Audio: Frame-wise representations are obtained using the Wav2Vec2 model. The resulting sequence is processed with multi-head Self-Attention, and a learned linear pooling layer is used to emphasize salient time segments.
Visual: Skeleton-based joint trajectory data is processed frame-by-frame into joint vectors, which are then passed through a Self-Attention module to model motion dynamics. Mean pooling is applied to obtain the final visual representation.

The output vector

f_{m}

for each modality has a dimensionality of

d^{'} \in [4800, 5500]

depending on the sequence length and attention head configuration. These vectors are concatenated and reshaped (with optional zero-padding) into a

128 \times 128

matrix, which is then passed to the classification module. Table 3 summarizes the attention parameters for each modality.

3.4. Personality Classification Using Fully Connected Layers

After modality-specific attention encoding and pooling, the resulting summary vectors—each representing salient temporal and semantic features—are integrated through a late fusion strategy. As illustrated in Figure 4, the three modality-specific vectors are concatenated and reshaped into a unified

128 \times 128

matrix, which is used as input to the final classification module. Zero-padding is applied if needed to ensure size consistency. This structured matrix serves as the input to the final classification module, allowing the model to learn inter-modal dependencies jointly.

The classification module consists of two fully connected (FC) layers. The matrix is first flattened into a one-dimensional vector

x \in R^{16,384}

, which is then passed through the following operations:

h = ReLU (W_{1} x + b_{1}), W_{1} \in R^{16,384 \times d_{hidden}}, b_{1} \in R^{d_{hidden}}

z = W_{2} h + b_{2}, W_{2} \in R^{d_{hidden} \times 3}, b_{2} \in R^{3}

p_{i} = \frac{exp (z_{i})}{\sum_{j = 1}^{3} exp (z_{j})} for i \in {1, 2, 3}

Here,

p_{i}

represents the predicted probability of each of the three personality levels: High, Medium, and Low. Positional encodings and optional mask metadata can be added to the fused representation before the first FC layer to enhance temporal coherence and handle missing data if necessary.

To convert continuous Big Five personality scores into categorical labels for classification, the score distribution was divided into three intervals based on equal percentiles. Each interval was then assigned a corresponding categorical label, enabling the formulation of the task as a multi-class classification problem.

Bottom 33.3%: Low;
Middle 33.3%: Medium;
Top 33.3%: High.

Splitting the scores into equal-sized groups based on their distribution helps ensure class balance and reduce overfitting.

3.5. Parallel Learning Structure and Real-Time Feasibility

The model was designed so that the three modality-specific processing streams operate in parallel within the same training session. To address the imbalance in feature dimensionality across modalities, adaptive pooling and metadata-guided backpropagation techniques were applied. These strategies help minimize information loss and support efficient feature learning.

In addition, a thread-based parallel processing structure was implemented to evaluate real-time feasibility. During both training and inference, the average frame-per-second (FPS) rate was quantitatively measured. Table 4 presents the real-time performance across different training configurations. The model consistently achieves an average of over 15 FPS, demonstrating its suitability for real-time interactive environments.

4. Experiment Setup

The experiments were designed to evaluate the performance and robustness of the proposed Attention-based multimodal personality recognition model.

4.1. Data Splitting and Preprocessing

A total of 156 multimodal samples were divided into training and test sets in a 4:1 ratio, considering the balanced distribution of personality labels. To ensure training stability and reliable evaluation, 5-fold cross-validation was also conducted in addition to this fixed split. This enabled the computation of the average performance across independent test folds, providing a quantitative measure of generalization. Each sample includes approximately five minutes of audio, video, and utterance data, with corresponding Big Five personality scores collected through participant surveys. To reduce the effect of outliers and enhance label stability, the final personality labels were computed using a trimmed-mean. All input data were normalized to fixed-size tensors of

256 \times 256

and segments with missing or noisy data in each modality were masked to maintain training consistency.

4.2. Evaluation Metrics

To evaluate both the classification accuracy and balanced predictive performance, in this study, we employed two primary metrics: the F1-score and accuracy. Accuracy indicates the proportion of correctly classified samples and provides an intuitive assessment of overall model performance, especially when class distributions are balanced. In contrast, the F1-score, which is the harmonic mean of precision and recall, offers a more precise evaluation in scenarios with class imbalance. In this study, F1-scores were calculated for each of the three levels (High, Medium, Low) within each personality trait. The overall model performance was then assessed using the macro-average of these class-wise F1-scores. These two metrics were used not only to compare different model architectures but also to analyze sensitivity under varying input conditions.

4.3. Experimental Design for Performance Validation

4.3.1. Comparison with Related Work

The performance of the proposed model was explicitly compared with the F1-score results reported in previous studies on automatic personality recognition. To ensure fair comparison across models, the results were organized and discussed based on the experimental settings specified in the literature, including the number of classes, label format, and evaluation metrics.

4.3.2. Unimodal Performance Comparison

In addition to the proposed multimodal model, independent experiments were conducted on the same dataset using models based solely on individual modalities. For frameworks not publicly available, models were reproduced following the descriptions provided in the respective papers. All experiments were conducted under the same data splits and training conditions for consistent performance comparison.

4.3.3. Multimodal Fusion Performance Comparison

The proposed model was also compared with existing multimodal fusion models using the same dataset. Officially released frameworks from prior studies were utilized to replicate their results and the differences in fusion strategies were analyzed to assess the relative performance of the proposed approach.

4.4. Robustness Evaluation Based on Input Length

4.4.1. Performance Variation by Number of Sentences

To compare the impact of input length, we evaluated how the F1-score changes with respect to the number of input sentences for both traditional classifiers (e.g., Doc2Vec-based Support Vector Machine (SVM) and Random Forest (RF)) and the proposed Attention-based model. The experiments were repeated on the same sample segments while incrementally increasing the number of input sentences. This setup allowed us to analyze each model’s sensitivity and stability with respect to the amount of available information.

4.4.2. Unimodal vs. Multimodal Comparison Under Sentence Constraints

Under the same sentence-length conditions, we compared the performance of the unimodal models (text, audio, and video, respectively) with the proposed multimodal model. This evaluation aimed to assess how effectively multimodal fusion compensates for limited input and to examine the robustness and representational capacity of the fusion model. All of the experiments were conducted under identical training settings and results were averaged using the F1-score as the primary metric.

4.5. Latent Space Clustering Based on t-SNE Visualization

To interpret the distributional characteristics of the final fused representations, we performed t-SNE-based visualization on the latent representations of the test samples. The embeddings were color-coded by personality class (Low, Medium, High) to observe the clustering patterns and inter-class separability. This visualization aimed to qualitatively understand the model’s behavior and served as an auxiliary tool to enhance interpretability.

4.6. Training Environment and Parameter Settings

All experiments were conducted on a system equipped with an NVIDIA RTX 2080 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA), which provides a theoretical performance of 13.45 teraflops at 32-bit floating point precision. The learning rate was set to

1 \times 10^{- 4}

, the batch size to 32, and the maximum number of training epochs to 50. Model selection was guided by early stopping based on the F1-score evaluated on the validation set. The training objective was defined using the softmax-based cross-entropy loss function, and optimization was carried out using the Adam algorithm.

The proposed model was implemented with a thread-level parallel structure that enables simultaneous processing of each modality within a single training session. This architectural choice was made with future deployment scenarios in mind, particularly on edge computing devices with integrated GPUs. One such target platform is the NVIDIA Jetson AGX Orin with 64 GB of memory, which delivers up to 10.65 teraflops in 32-bit floating point operations. Given the comparable processing capabilities between the training environment and the target device, the proposed model is sufficiently lightweight and efficient for real-time operation on offline robotic platforms. This ensures practical deployability without relying on external servers or cloud-based computation, even in resource-limited environments.

5. Results and Analysis

5.1. Performance Comparison with Existing Studies

The performance of the proposed Attention-based multimodal personality recognition model was compared with that of models described in previously published studies on personality recognition. The comparison covered key elements such as employed modalities, model architectures, data sources, and evaluation metrics including the F1-score and accuracy, all of which are summarized in Table 5.

Our model, built upon a state-of-the-art architecture (as of 2025), achieved an average F1-score of 0.868. This result demonstrates a more than 12 percent improvement over existing text-based models under identical experimental conditions, providing empirical evidence for the effectiveness of multimodal fusion and the Attention-based representation framework.

5.2. Performance Comparison Based on Unimodal Input

To evaluate the effectiveness of the Attention-based classifier, a key component of the proposed model, we conducted comparative experiments using identical text input across traditional classifiers. The comparison models included SVM, XGBoost, and Logistic Regression, all of which were independently implemented using the same methodology. Consistent settings were applied, including Doc2Vec embeddings, a 4:1 train–test split, and trimmed-mean Big Five personality scores as the ground truth. All model hyperparameters were optimized using a grid search method. Table 6 presents the experimental results. The Attention-based fully connected model achieved an F1-score of 0.821, outperforming the conventional text classifiers. This result indicates that the Attention mechanism effectively captures inter-sentence importance and enables the learning of more refined personality representations.

5.3. Multimodal Performance Comparison

In this experiment, the performance of the proposed Attention-based fusion model was compared with that of an existing publicly released multimodal model composed of BERT for text, Mel-Frequency Cepstral Coefficients (MFCC) for audio, and Skeleton-based motion features for video [58]. The same dataset containing 156 multimodal samples was used and consistency was maintained across the input modalities, ground truth labels, and preprocessing procedures. However, training conditions such as the number of epochs, the type of optimizer, and the choice of loss function were customized to suit the specific structure of each model. All of the experiments were independently repeated under identical GPU settings. Table 7 summarizes the results. The proposed model achieved an accuracy of 0.903 and an F1-score of 0.868, which represents an improvement of approximately 6–7 percent over the existing multimodal structure. This model also demonstrated robust performance even when the number of input sentences was limited or the utterance length was short. Such results highlight the effectiveness of the Attention-based late fusion mechanism for integrating information, as well as the role of the normalization strategy based on masking in enhancing the overall performance. Notably, our previous study [59] employed a multimodal architecture without leveraging any Attention-based features. Despite sharing the same dataset, the proposed model significantly outperformed the prior approach, achieving an accuracy of 0.903 and an F1-score of 0.868, representing an improvement of approximately 6–7 percent. These gains were obtained even under varying training conditions, underscoring the efficacy of the Attention-based late fusion strategy and the contribution of the masking-based normalization mechanism to robust performance, particularly in scenarios with limited input length or short utterances.

To further investigate the classification performance of each modality, we analyzed the precision–recall (PR) curves and computed the area under the curve (AUC) scores. As illustrated in Figure 5, the multimodal model achieves the highest AUC of 0.882, followed by the text modality with an AUC of 0.851, audio with 0.791, and video with 0.786. The multimodal curve consistently maintains high precision across a wide range of recall values, indicating superior robustness and generalization. These results reaffirm the advantage of multimodal fusion in personality trait classification tasks.

5.4. Analysis of Performance Variation According to the Number of Input Sentences

The sensitivity of model performance was evaluated by adjusting the number of input sentences, and the Attention-based fully connected model demonstrated a relatively higher robustness compared to that of the traditional classifiers. The left side of Figure 6 presents the change in the F1-score for the SVMs, RF, and Attention-based fully connected model as the number of input sentences increases from five to fifty. The right side of the figure compares the convergence behavior between the unimodal models, including text, audio, and video, and the proposed multimodal model.

The multimodal architecture reaches stable convergence with an F1-score exceeding 0.85 when more than approximately twenty sentences are provided. Compared to the unimodal models, it exhibits a faster and more consistent learning curve. This result indicates that the Attention mechanism effectively integrates complementary features across different modalities by processing them simultaneously. In conclusion, the proposed model maintained a strong predictive performance even under limited input conditions, demonstrating its suitability for real-time applications.

5.5. Interpretation of Personality Features Using t-SNE Visualization

To analyze the interpretability of the model and the structural characteristics of its latent representations, t-SNE-based visualization was conducted on the learned features from each modality. The models compared in this analysis include a text model based on Doc2Vec, a speech model using Wav2Vec2, a video model utilizing skeleton-based features, and the proposed multimodal model. The output vectors from the final Attention layer were projected onto a two-dimensional space after dimensionality reduction. Figure 7 illustrates the visualization results based on the t-SNE projection. The unimodal models showed limited clustering patterns for certain personality traits. Notably, the text and audio models exhibited frequent overlaps in the representation space, especially for traits such as Openness and Conscientiousness. In contrast, the multimodal model produced clearer boundaries between personality classes, with minimal interference between representations and a more structurally separated distribution. These findings suggest that the Attention-based fusion mechanism enables the more consistent learning of personality-related features and that the integration of diverse modalities contributes to the formation of a meaningful and well-separated representation space for personality classification. Such results imply that the learned embeddings can serve as a foundation for various future applications, including interpretable personality analysis, cluster-based classifier design, and anomaly detection. They demonstrate the model’s potential for both enhanced interpretability and broader applicability.

6. Conclusions

This study proposed an Attention-based personality recognition model that integrates heterogeneous multimodal inputs, including audio, video, and text, to predict the Big Five personality traits. Through a series of experiments, the model’s effectiveness and practical applicability were empirically validated. The proposed approach achieved the following key outcomes.

First, an end-to-end architecture was implemented that extracts high-dimensional features from each of the three modalities, processes them independently using Self-Attention mechanisms, and integrates them through a late fusion strategy. This design effectively leveraged the complementary nature of each modality to capture complex personality signals.

Second, the Self-Attention-based fusion structure achieved up to a 12 percent improvement in the F1-score compared to the unimodal and RNN-based personality recognition models. The model also maintained a high accuracy and robustness under limited input conditions, demonstrating strong data efficiency.

Third, by employing a threading-based training architecture optimized for parallel processing and lightweight computation, the model improved training and inference efficiency, achieving an average processing speed of over 15 frames per second. This confirms the model’s applicability to real-time interactive environments.

Fourth, the t-SNE visualizations of the latent representations revealed distinct clustering and separation among the personality classes. The multimodal fusion model showed more clearly structured latent spaces, indicating that it produces consistent internal representations and supports interpretability.

Finally, comprehensive experiments demonstrated that the proposed model outperforms traditional machine learning classifiers and existing multimodal approaches in terms of both the prediction accuracy and practical usability. The proposed framework offers a unified solution that balances performance, interpretability, and real-time applicability in personality recognition tasks.

Future work will focus on applying the model to real-time personality inference scenarios based on user interaction and dialogue. We also plan to integrate explainable AI techniques to provide users with interpretable explanations and expand the system toward socially intelligent HRI.

Author Contributions

Conceptualization, H.B. and J.C.; Methodology, H.B.; Software, H.B.; Validation, H.B. and J.C.; Formal analysis, H.B.; Resources, J.C.; Data curation, H.B.; Writing—original draft preparation, H.B.; Writing—review and editing, J.C.; Visualization, H.B.; Supervision, J.C.; Project administration, J.C.; Funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Institute of Science and Technology (KIST) Institutional Program under Grant (2E33602) and by the Technology Innovation Program (RS-2024-00419883, Development of a Collaborative Robot System and Multimodal Human–Robot Interaction Services for Supporting Young Children’s Daily Activity Care; RS-2024-00507746) funded by the Ministry of Trade Industry & Energy (MOTIE, Korea).

Data Availability Statement

The data used in this study were obtained from experiments conducted under a government-funded research project. These data are not publicly available at this time due to institutional and funding agency policies. Access may be granted in the future subject to approval by the relevant institutions and agencies.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Harris, K.; Vazire, S. On Friendship Development and the Big Five Personality Traits. Soc. Personal. Psychol. Compass 2016, 10, 647–667. [Google Scholar] [CrossRef]
Mund, M.; Finn, C.; Hagemeyer, B.; Neyer, F.J. Understanding Dynamic Transactions Between Personality Traits and Partner Relationships. Curr. Dir. Psychol. Sci. 2016, 25, 411–416. [Google Scholar] [CrossRef]
Bui, H.T. Big Five Personality Traits and Job Satisfaction: Evidence from a National Sample. J. Gen. Manag. 2017, 42, 21–30. [Google Scholar] [CrossRef]
Vinciarelli, A.; Mohammadi, G. A Survey of Personality Computing. IEEE Trans. Affect. Comput. 2014, 5, 273–291. [Google Scholar] [CrossRef]
Digman, J.M. Personality Structure: Emergence of the Five-Factor Model. Annu. Rev. Psychol. 1990, 41, 417–440. [Google Scholar] [CrossRef]
Hogan, R.; Johnson, J.; Briggs, S. Handbook of Personality Psychology; Academic Press: Cambridge, MA, USA, 1997. [Google Scholar]
Allport, G.W. Pattern and Growth in Personality; Springer: Berlin/Heidelberg, Germany, 1961. [Google Scholar]
Han, S.; Huang, H.; Tang, Y. Knowledge of Words: An Interpretable Approach for Personality Recognition from Social Media. Knowl.-Based Syst. 2020, 194, 105550. [Google Scholar] [CrossRef]
Poria, S.; Gelbukh, A.; Agarwal, B.; Cambria, E.; Howard, N. Common Sense Knowledge Based Personality Recognition from Text. In Advances in Soft Computing and Its Applications; Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J.M., Mattern, F., Mitchell, J.C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8266, pp. 484–496. [Google Scholar] [CrossRef]
Carducci, G.; Rizzo, G.; Monti, D.; Palumbo, E.; Morisio, M. Twitpersonality: Computing Personality Traits from Tweets Using Word Embeddings and Supervised Learning. Information 2018, 9, 127. [Google Scholar] [CrossRef]
KN, P.K.; Gavrilova, M.L. Latent Personality Traits Assessment from Social Network Activity Using Contextual Language Embedding. IEEE Trans. Comput. Soc. Syst. 2021, 9, 638–649. [Google Scholar] [CrossRef]
Tadesse, M.M.; Lin, H.; Xu, B.; Yang, L. Personality Predictions Based on User Behavior on the Facebook Social Media Platform. IEEE Access. 2018, 6, 61959–61969. [Google Scholar] [CrossRef]
Bindroo, R.; Sujit, S.D.; Seshadri, A.; Sathyanarayan, M. Psychometric Precision: ML-Driven Learning Strategies Informed on Big Five Traits. In Proceedings of the 2024 2nd International Conference on Networking, Embedded and Wireless Systems (ICNEWS), Bangalore, India, 22–23 August 2024; pp. 1–7. [Google Scholar] [CrossRef]
Karpagam, G.; VM, H.V.; Kabilan, K.; Pranav, P.; Ramesh, P.; B, S.S. Multimodal Fusion for Precision Personality Trait Analysis: A Comprehensive Model Integrating Video, Audio, and Text Inputs. In Proceedings of the 2024 International Conference on Smart Systems for Electrical, Electronics, Communication and Computer Engineering (ICSSEECC), Coimbatore, India, 28–29 June 2024; pp. 327–332. [Google Scholar]
Ma, Z.; Ma, F.; Sun, B.; Li, S. Hybrid mutimodal fusion for dimensional emotion recognition. In Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, Virtual, 24 October 2021; pp. 29–36. [Google Scholar]
Ma, F.; He, Y.; Sun, B.; Li, S. Multimodal Prompt Alignment for Facial Expression Recognition. arXiv 2025, arXiv:2506.21017. [Google Scholar] [CrossRef]
Kosan, M.A.; Karacan, H.; Urgen, B.A. Predicting personality traits with semantic structures and LSTM-based neural networks. Alex. Eng. J. 2022, 61, 8007–8025. [Google Scholar] [CrossRef]
Jaysundara, A.; De Silva, D.; Kumarawadu, P. Personality prediction of social network users using LSTM based sentiment analysis. In Proceedings of the 2022 International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN), Villupuram, India, 25–26 March 2022; pp. 1–6. [Google Scholar]
Cantador, I.; Fernández-Tobías, I.; Bellogín, A.; Kosinski, M.; Stillwell, D. Relating Personality Types with User Preferences in Multiple Entertainment Domains. In Proceedings of the UMAP Workshops, Rome, Italy, 10–14 June 2013; Volume 997. [Google Scholar]
Strickhouser, J.E.; Zell, E.; Krizan, Z. Does Personality Predict Health and Well-Being? A Metasynthesis. Health Psychol. 2017, 36, 797. [Google Scholar] [CrossRef]
Widiger, T.A.; Costa Jr, P.T. Personality and Personality Disorders. J. Abnorm. Psychol. 1994, 103, 78. [Google Scholar] [CrossRef]
Suen, H.Y.; Hung, K.E.; Lin, C.L. TensorFlow-based Automatic Personality Recognition Used in Asynchronous Video Interviews. IEEE Access. 2019, 7, 61018–61023. [Google Scholar] [CrossRef]
Song, S.; Jaiswal, S.; Sanchez, E.; Tzimiropoulos, G.; Shen, L.; Valstar, M. Self-Supervised Learning of Person-Specific Facial Dynamics for Automatic Personality Recognition. IEEE Trans. Affect. Comput. 2021, 14, 178–195. [Google Scholar] [CrossRef]
Mehta, Y.; Majumder, N.; Gelbukh, A.; Cambria, E. Recent Trends in Deep Learning Based Personality Detection. Artif. Intell. Rev. 2020, 53, 2313–2339. [Google Scholar] [CrossRef]
Ahmad, H.; Asghar, M.Z.; Khan, A.S.; Habib, A. A Systematic Literature Review of Personality Trait Classification from Textual Content. Open Comput. Sci. 2020, 10, 175–193. [Google Scholar] [CrossRef]
Zumma, M.T.; Munia, J.A.; Halder, D.; Rahman, M.S. Personality Prediction from Twitter Dataset Using Machine Learning. In Proceedings of the 2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), Virtual, 3–5 October 2022; pp. 1–5. [Google Scholar]
Jagannath, D.J.; Sreelakshmi, T.; George, J.; Achsah, M. Forecasting Traits: Human Personality Prediction with Machine Learning Methodology-a Comparative Study. In Proceedings of the 2nd International Conference on Computer Vision and Internet of Things (ICCVIoT 2024), Coimbatore, India, 10–11 December 2024; Volume 2024, pp. 93–99. [Google Scholar]
Pennebaker, J.W.; King, L.A. Linguistic Styles: Language Use as an Individual Difference. J. Personal. Soc. Psychol. 1999, 77, 1296. [Google Scholar] [CrossRef]
Asghar, J.; Akbar, S.; Asghar, M.Z.; Ahmad, B.; Al-Rakhami, M.S.; Gumaei, A. Detection and Classification of Psychopathic Personality Trait from Social Media Text Using Deep Learning Model. Comput. Math. Methods Med. 2021, 2021, 1–10. [Google Scholar] [CrossRef]
Christian, H.; Suhartono, D.; Chowanda, A.; Zamli, K.Z. Text Based Personality Prediction from Multiple Social Media Data Sources Using Pre-Trained Language Model and Model Averaging. J. Big Data 2021, 8, 68. [Google Scholar] [CrossRef]
Wang, Y.; Zheng, J.; Li, Q.; Wang, C.; Zhang, H.; Gong, J. Xlnet-Caps: Personality Classification from Textual Posts. Electronics 2021, 10, 1360. [Google Scholar] [CrossRef]
Leonardi, S.; Monti, D.; Rizzo, G.; Morisio, M. Multilingual Transformer-Based Personality Traits Estimation. Information 2020, 11, 179. [Google Scholar] [CrossRef]
Waqas, M.; Zhang, F.; Laghari, A.A.; Almadhor, A.; Petrinec, F.; Iqbal, A.; Khalil, M.M.Y. TraitBertGCN: Personality Trait Prediction Using BertGCN with Data Fusion Technique. Int. J. Comput. Intell. Syst. 2025, 18, 64. [Google Scholar] [CrossRef]
Mohammadi, G.; Vinciarelli, A. Automatic Personality Perception: Prediction of Trait Attribution Based on Prosodic Features. IEEE Trans. Affect. Comput. 2012, 3, 273–284. [Google Scholar] [CrossRef]
Yang, L.; Li, S.; Luo, X.; Xu, B.; Geng, Y.; Zeng, Z.; Zhang, F.; Lin, H. Computational Personality: A Survey. Soft Computing 2022, 26, 9587–9605. [Google Scholar] [CrossRef]
Tsani, E.F.; Suhartono, D. Personality Identification from Social Media Using Ensemble BERT and RoBERTa. Informatica 2023, 47, 537–544. [Google Scholar] [CrossRef]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef]
Yan, L.; Li, K.; Gao, R.; Wang, C.; Xiong, N. An intelligent weighted object detector for feature extraction to enrich global image information. Appl. Sci. 2022, 12, 7825. [Google Scholar] [CrossRef]
Lin, C.B.; Dong, Z.; Kuan, W.K.; Huang, Y.F. A framework for fall detection based on OpenPose skeleton and LSTM/GRU models. Appl. Sci. 2020, 11, 329. [Google Scholar] [CrossRef]
Nguyen, H.C.; Nguyen, T.H.; Scherer, R.; Le, V.H. Deep Learning for Human Activity Recognition on 3D Human Skeleton: Survey and Comparative Study. Sensors 2023, 23, 5121. [Google Scholar] [CrossRef]
Liu, J.; Akhtar, N.; Mian, A. Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 10–19. [Google Scholar]
Zhao, X.; Liao, Y.; Tang, Z.; Xu, Y.; Tao, X.; Wang, D.; Wang, G.; Lu, H. Integrating Audio and Visual Modalities for Multimodal Personality Trait Recognition via Hybrid Deep Learning. Front. Neurosci. 2023, 16, 1107284. [Google Scholar] [CrossRef] [PubMed]
Lee, C.H.; Yang, H.C.; Su, X.Q.; Tang, Y.X. A Multimodal Affective Sensing Model for Constructing a Personality-Based Financial Advisor System. Appl. Sci. 2022, 12, 10066. [Google Scholar] [CrossRef]
Giritlioğlu, D.; Mandira, B.; Yilmaz, S.F.; Ertenli, C.U.; Akgür, B.F.; Kınıklıoğlu, M.; Kurt, A.G.; Mutlu, E.; Gürel, Ş.C.; Dibeklioğlu, H. Multimodal analysis of personality traits on videos of self-presentation and induced behavior. J. Multimodal User Interfaces 2021, 15, 337–358. [Google Scholar] [CrossRef]
Stern, J.; Schild, C.; Jones, B.C.; DeBruine, L.M.; Hahn, A.; Puts, D.A.; Zettler, I.; Kordsmeyer, T.L.; Feinberg, D.; Zamfir, D. Do Voices Carry Valid Information about a Speaker’s Personality? J. Res. Personal. 2021, 92, 104092. [Google Scholar] [CrossRef]
Zhao, X.; Tang, Z.; Zhang, S. Deep Personality Trait Recognition: A Survey. Front. Psychol. 2022, 13, 839619. [Google Scholar] [CrossRef]
Yan, L.; Ye, Y.; Wang, C.; Sun, Y. LocMix: Local saliency-based data augmentation for image classification. Signal Image Video Process. 2024, 18, 1383–1392. [Google Scholar] [CrossRef]
Moorthy, S.; Moon, Y.K. Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics 2025, 13, 1100. [Google Scholar] [CrossRef]
Praveen, R.G.; de Melo, W.C.; Ullah, N.; Aslam, H.; Zeeshan, O.; Denorme, T.; Pedersoli, M.; Koerich, A.L.; Bacon, S.; Cardinal, P. A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2486–2495. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X.; et al. A Comparative Study on Transformer vs. RNN in Speech Applications. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 449–456. [Google Scholar] [CrossRef]
Vásquez, R.L.; Ochoa-Luna, J. Transformer-based approaches for personality detection using the MBTI model. In Proceedings of the 2021 XLVII Latin American Computing Conference (CLEI), Cartago, Costa Rica, 25–29 October 2021; pp. 1–7. [Google Scholar]
Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; Vijayanarasimhan, S. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv 2016, arXiv:1609.08675v1. [Google Scholar]
Li, B.B.; Huang, H.W. Cognition and beyond: Intersections of Personality Traits and Language. Psychol. Learn. Motiv. 2024, 80, 105–148. [Google Scholar]
ysiak, M. Inner Dialogical Communication and Pathological Personality Traits. Front. Psychol. 2019, 10, 1663. [Google Scholar] [CrossRef]
Kim, S.Y.; Kim, J.M.; Yoo, J.A.; Bae, K.Y.; Kim, S.W.; Yang, S.J.; Shin, I.S.; Yoon, J.S. Standardization and validation of big five inventory-Korean version (BFI-K) in elders. Korean J. Biol. Psychiatry 2010, 17, 15–25. [Google Scholar]
Wang, Q.; Liu, A.; Yan, K.; Hou, J.; Li, W. BigFive: A Chinese Textual Dataset Supporting Psychology Knowledge Graph Construction. In Proceedings of the 2023 IEEE International Conference on Knowledge Graph (ICKG), Shanghai, China, 1–2 December 2023; pp. 77–83. [Google Scholar]
Cherukuru, R.K.; Kumar, A.; Srivastava, S.; Verma, V.K. Prediction of Personality Trait using Machine Learning on Online Texts. In Proceedings of the 2022 International Conference for Advancement in Technology (ICONAT), Goa, India, 21–22 January 2022; pp. 1–8. [Google Scholar]
Bhin, H.; Lim, Y.; Choi, J. Multimodal Personality Prediction: A Real-Time Recognition System for Social Robots with Data Acquisition. In Proceedings of the 2024 21st International Conference on Ubiquitous Robots (UR), Manhattan, NY, USA, 24–27 June 2024; pp. 673–676. [Google Scholar] [CrossRef]

Figure 1. Web application interface for personality trait labeling experiment.

Figure 2. Overview of the data preparation process for training the APR model.

Figure 3. Feature extraction and preprocessing steps for each modality.

Figure 4. Architecture and training flow of the Attention-based multimodal fusion model.

Figure 5. Precision–recall curves and AUC scores for different modalities.

Figure 6. F1-score comparison according to the number of input sentences: (a) comparison of text-based models using Doc2Vec with various classifiers; (b) comparison of models by modality (unimodal vs. multimodal).

Figure 7. t-SNE clustering visualization of personality features extracted from individual modalities.

Table 1. Descriptions of the Big Five personality traits.

Trait	High	Low
Openness	Imaginative, curious, open-minded	Conventional, resistant to change, narrow interests
Conscientiousness	Organized, responsible, goal-oriented	Careless, impulsive, disorganized
Extraversion	Sociable, energetic, enthusiastic	Reserved, quiet, solitary
Agreeableness	Kind, cooperative, compassionate	Suspicious, antagonistic, uncooperative
Neuroticism	Prone to anxiety, moodiness, emotional instability	Calm, emotionally stable, resilient

Table 2. Fleiss’ Kappa scores for each personality trait before and after trimming.

Personality Trait	Before Trimming	After Trimming
Openness	0.78	0.86
Conscientiousness	0.72	0.89
Extraversion	0.74	0.88
Agreeableness	0.69	0.87
Neuroticism	0.82	0.90
Average	0.75	0.88

Table 3. Modality-specific Self-Attention configurations.

Modality	Seq. Length (n)	Feature Dimension (d)	Heads (h)	Pooling Type
Text	50	120	4	Mean pooling
Audio	45	120	6	Linear pooling
Visual	40	128	4	Mean pooling

Table 4. Comparison of learning and inference efficiency.

Category	Features	Learning Sequence	Learning Duration	Test Duration	FPS	FPS@15 Snapshot
Audio, Video, Text	Each Feature	Each Modality	404 min	43 min	20	3.2
Multimodal	Concatenation Fusion	Multi-Process	373 min	25 min	23.2	10.52
Multimodal	Concatenation Fusion	Threading Process	260 min	15 min	30	16.92

Table 5. Comparison of recognition performance with recent state-of-the-art methods.

Features	Year	Models	Performance
Text [27]	2024	SVM	F1-score: 0.74
Text [26]	2022	Naïve Bayes, RF	Accuracy: 0.75
Text [57]	2023	Correlation Analysis	F1-score: 0.66
Audio, Text [58]	2024	Multiple Classifiers, RNN	F1-score: 0.82
Audio, Video, Text [13]	2024	BERT, RNN	Accuracy: 0.80
Audio, Video [14]	2024	XGBoost, RNN	F1-score: 0.76
Audio, Video, Text	2025	Proposed Model	F1-score: 0.868

Table 6. Comparison of unimodal (text-based) personality recognition models.

Features	Year	Models	Performance (Paper)	Performance (Test)
Text [27]	2024	SVM	F1-score: 0.74	Accuracy: 0.573 F1-score: 0.435
Text [26]	2022	Naïve Bayes, RF	Accuracy: 0.75	Accuracy: 0.685 F1-score: 0.652
Text	2025	Proposed Model	–	F1-score: 0.821

Table 7. Comparison of multimodal personality recognition models.

Features	Year	Models	Performance (Paper)	Performance (Test)
Audio, Video, Text [58]	2024	BERT, RNN	Accuracy: 0.80	Accuracy: 0.840 F1-score: 0.791
Audio, Video, Text [59]	2024	BiRNN	F1-score: 0.736	Accuracy: 0.826 F1-score: 0.736
Audio, Video, Text	2025	Proposed Model	–	Accuracy: 0.903 F1-score: 0.868

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bhin, H.; Choi, J. Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features. Electronics 2025, 14, 2837. https://doi.org/10.3390/electronics14142837

AMA Style

Bhin H, Choi J. Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features. Electronics. 2025; 14(14):2837. https://doi.org/10.3390/electronics14142837

Chicago/Turabian Style

Bhin, Hyeonuk, and Jongsuk Choi. 2025. "Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features" Electronics 14, no. 14: 2837. https://doi.org/10.3390/electronics14142837

APA Style

Bhin, H., & Choi, J. (2025). Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features. Electronics, 14(14), 2837. https://doi.org/10.3390/electronics14142837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features

Abstract

1. Introduction

Research Objectives

2. Related Work

2.1. Overview of Automatic Personality Recognition

2.2. Unimodal Approaches

2.3. Multimodal Personality Recognition

2.4. Limitations of Existing Fusion Methods

2.5. Attention-Based Research Trends

3. Proposed Method

3.1. Data Collection and Preprocessing

3.2. Feature Extraction from Each Modality

3.2.1. Text

3.2.2. Audio

3.2.3. Video

3.2.4. Feature Normalization and Encoding for Fusion

3.3. Attention-Based Encoding for Each Modality

3.4. Personality Classification Using Fully Connected Layers

3.5. Parallel Learning Structure and Real-Time Feasibility

4. Experiment Setup

4.1. Data Splitting and Preprocessing

4.2. Evaluation Metrics

4.3. Experimental Design for Performance Validation

4.3.1. Comparison with Related Work

4.3.2. Unimodal Performance Comparison

4.3.3. Multimodal Fusion Performance Comparison

4.4. Robustness Evaluation Based on Input Length

4.4.1. Performance Variation by Number of Sentences

4.4.2. Unimodal vs. Multimodal Comparison Under Sentence Constraints

4.5. Latent Space Clustering Based on t-SNE Visualization

4.6. Training Environment and Parameter Settings

5. Results and Analysis

5.1. Performance Comparison with Existing Studies

5.2. Performance Comparison Based on Unimodal Input

5.3. Multimodal Performance Comparison

5.4. Analysis of Performance Variation According to the Number of Input Sentences

5.5. Interpretation of Personality Features Using t-SNE Visualization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI