1. Introduction
A central goal of next-generation artificial intelligence (AI) is to build systems that can interact with humans in a natural and effective manner. Achieving this goal requires not only understanding explicit semantic content, but also perceiving deeper human attributes such as personality traits and emotional states, which enable AI systems to respond in a more adaptive and human-like way. In the fields of personality and emotion recognition, such capability is considered a key foundation for more natural human–AI interactions [
1]. However, progress in this area remains constrained by a practical challenge: high-quality real-world data are often limited in scale. Collecting and annotating video data is typically expensive, time-consuming, and subject to privacy and ethical constraints, making it difficult for existing datasets to cover sufficiently diverse behavioral patterns and environmental conditions. As a result, models trained on such limited data are more prone to overfitting and may disproportionately rely on dataset-specific appearance or contextual cues.
Recently, in fields such as medicine and autonomous driving, the use of synthetic data to augment real-world datasets has attracted increasing attention [
2,
3]. Inspired by this trend, we ask whether transforming existing real-world data during training, rather than generating fully synthetic data from scratch, can also benefit model learning. Based on this intuition, we propose style-abstraction-based data augmentation. Specifically, this method employs cartoonization to generate visually abstract representations, which preserve key behavioral information such as facial dynamics, body posture, and gestures while suppressing irrelevant stylistic details. At the same time, this process suppresses low-level visual cues, including texture, illumination, background clutter, and appearance-specific details. We hypothesize that such style abstraction acts as an effective regularizer, encouraging the model to rely less on task-irrelevant superficial patterns and more on higher-level behavioral representations related to personality and emotion. Following this idea, we jointly use original and style-abstracted videos during training to enrich data diversity and further improve recognition performance in real-world scenarios.
To empirically test our hypothesis, we designed a rigorous experimental framework. We evaluated performance using the standard Big-5 personality model [
4] and four emotion categories. Our evaluation covered four diverse benchmark datasets: First Impression v2 [
5], UDIVA v0.5 [
6], KETI, and emotion dataset. These datasets span different languages, interaction contexts (monologues vs. dialogues), annotation methods, and tasks enabling a robust assessment of our method’s generalization capabilities. Two tasks allow us to determine whether visual abstraction through cartoonization improves robustness not only in continuous personality regression but also in discrete emotion classification, thereby validating broader applicability across affective computing tasks.
As a concrete form of visual abstraction, we employed cartoonization. We then systematically trained state-of-the-art video models, including ViViT (Video Vision Transformer) [
7], VST (Video Swin Transformer) [
8], and TimeSformer [
9], on training sets created with progressively larger proportions of abstracted data.
Our experiments show a consistent trend. As the proportion of abstracted data in the training set increases, we observe improvements in personality and emotion recognition performance across multiple models and datasets. These results suggest that visual abstraction can act as a form of regularization by reducing the influence of appearance-related features. Furthermore, the findings indicate that the proposed approach can serve as an effective data augmentation strategy that encourages models to focus on behavior-relevant cues. In this sense, the method may provide a practical direction for alleviating data scarcity challenges in affective computing. Specifically, our main contributions are threefold.
Insights into Style-Abstraction-based Regularization: We show that our style-abstraction-based augmentation can act as a regularization mechanism that reduces the model’s reliance on shortcut cues, such as background textures, lighting conditions, and other environmental artifacts. By abstracting these appearance-related factors, the model is encouraged to focus more on behavior-relevant cues, such as facial expressions and hand gestures, resulting in more stable representation learning for affective computing tasks.
A Practical Framework for Robust Affective Computing: We introduce a style-abstraction-based data augmentation, a lightweight yet effective framework based on cartoonization to improve recognition models. Unlike complex generative models or full 3D simulations, our cartoonization-based style abstraction for data augmentation offers a practical and scalable solution to mitigate overfitting and address data scarcity, bridging the gap between simulated and real data by reframing it as a stylized-to-original paradigm.
Comprehensive Empirical Validation: We conduct extensive experiments on two tasks using large-scale benchmarks (First Impression v2, UDIVA v0.5, KETI, and a emotion dataset) with multiple state-of-the-art Transformer-based video models. Our results consistently show significant performance improvements across different demographic groups, interaction contexts, tasks, and model architectures, thereby establishing the broad applicability and effectiveness of our proposed approach.
4. Experiments
In this section, we present a series of experiments to evaluate the effectiveness of our proposed style-abstraction-based data augmentation for both personality and emotion recognition. Our evaluation is conducted on three diverse personality benchmark datasets (First Impression v2, UDIVA v0.5, and KETI) and one emotion dataset, which differ in language, interaction context, and annotation methods, as summarized in
Table 2, thereby enabling an assessment of the robustness of our approach. We employ three state-of-the-art video-based models (ViViT, VST, and TimeSformer) and systematically analyze their performance by training them on datasets with progressively larger proportions of style-abstracted data.
4.1. Datasets
4.1.1. First Impression v2 Dataset
The First Impression v2 dataset is a widely used benchmark in the field of personality recognition, originally introduced in the ECCV 2016 ChaLearn LAP challenge [
23]. It contains over 3000 videos collected from YouTube, which were segmented into approximately 15 s clips. This process yielded around 10,000 front-facing video clips of individuals speaking in English. The dataset is divided into training, validation, and test sets with a 3:1:1 split. The individuals in the videos cover a diverse range of nationalities, ethnicities, genders, and ages [
29,
30,
31]. Personality labels were annotated using Amazon Mechanical Turk (AMT) based on the Big-5 OCEAN personality model.
4.1.2. UDIVA v0.5 Dataset
The UDIVA v0.5 dataset consists of videos capturing interactions among 147 volunteers from 22 countries and regions [
32]. A total of 134 participants were assigned to one of four task-based sessions: Talk, Lego, Ghost, and Animals. Spanish was the primary language, followed by Catalan and English. Each participant’s voice was recorded with both a lavalier microphone and a desk-mounted omnidirectional microphone, and they also wore a first-person-view camera. The dataset includes not only participants’ videos but also metadata such as nationality, gender, and age, along with automatically extracted annotations including facial contours, body poses, hand movements, and speech transcriptions. The data is split into training (116 sessions, 99 participants), validation (18 sessions, 20 participants), and test (11 sessions, 15 participants) sets. Participants further provided self-assessed Big-5 OCEAN personality scores using the BFI-2 (Big Five Inventory–2) [
33]. These scores are normalized using a z-score transformation, resulting in continuous values with a mean of 0 and a standard deviation of 1.
4.1.3. KETI Dataset
The KETI multimodal dialogue corpus is a multimodal dataset containing video recordings and transcribed dialogue from 516 participants. The participants interact in groups of four while discussing three task-oriented scenarios: prioritizing the invitation list for an event, planning the setup of a festival booth, and organizing a Korean trip for a foreign friend. In addition to these structured scenarios, free-form conversation sessions were also collected to assess the recording environment. Each video is accompanied by precise speech timing information aligned with the corresponding dialogue content. The dataset further provides OCEAN personality scores for each participant, obtained from both the IPIP-NEO-120 and the abridged IPIP-NEO-60 questionnaires, with all scores reported as integers.
4.1.4. Emotion Dataset
This categorical emotion recognition dataset is composed of 100 acting students and professional actors. Each participant performed approximately 100 scripted utterances per emotion, resulting in a total of 10,351 high-resolution videos. The dataset contains seven emotion categories; namely, Happiness, Surprise, Neutral, Fear, Disgust, Anger, and Sadness. For our experiments, we selected four representative emotions, including Happiness, Anger, Neutral, and Sadness, resulting in a filtered subset of 5314 samples. All recordings were captured in FHD (1920 × 1080) at 30 fps in m2ts format with synchronized 16-bit, 48 kHz audio.
4.1.5. Identity Overlap Analysis in First Impression v2
Unlike UDIVA v0.5, KETI, and the emotion dataset, which are partitioned by participant ID, First Impression v2 employs an official pre-defined split. We conducted an identity overlap analysis using the YouTube video ID (extracted as the prefix of the clip index) as a proxy for individual identity. The analysis reveals substantial overlap: 1222 IDs are shared between the training and validation sets, involving 2948 out of 6000 training clips and 1674 out of 2000 validation clips. These results indicate that the official validation set is not strictly identity-disjoint, which may lead to an overestimation of generalization performance on unseen individuals.
4.2. Preprocessing
A unified preprocessing and augmentation pipeline was employed for three video-based personality recognition datasets (First Impression v2, UDIVA v0.5, and KETI) and one emotion dataset. For all videos, a fixed set of 15 frames was uniformly sampled across the entire clip duration, which in many cases exceeds 15 s in length. This global sampling strategy helps the model capture long-term temporal dynamics and the progression of affective expressions, rather than relying solely on instantaneous static cues. Consequently, the approach preserves essential temporal context while maintaining computational efficiency for training large-scale video models.
For the training set, we designed four experimental groups (G1–G4) by progressively increasing the proportion of style-abstracted data relative to a fixed baseline of original videos. Starting from the baseline of original videos (G1), we incrementally added style-abstracted samples in stages: G2 includes an additional amount of style-abstracted data equivalent to one-third of the baseline, G3 increases this to two-thirds, and G4 reaches a balanced mixture where the amount of style-abstracted data equals the amount of original data. Across all groups, the validation set consisted entirely of original videos. This setup allows us to analyze the incremental contribution of style-abstracted samples and determine whether they provide complementary information that improves model robustness as the total data volume increases.
Furthermore, to examine whether the observed performance gains are specifically due to the effect of style abstraction rather than a simple increase in training data volume, we conducted two additional control experiments while keeping the total sample count identical to G4: (1) Control A, trained exclusively with 100% original videos, and (2) Control B, trained exclusively with 100% style-abstracted videos. By comparing G4 with these control groups, we can isolate the impact of combining different visual styles under a constant data volume.
4.2.1. First Impression v2 Preprocessing
The videos in the First Impression v2 dataset are relatively short, with an average duration of approximately 15 s. Given this duration, we processed the videos directly to maintain visual consistency. The validation set consists of 2000 videos, from each of which 15 original frames were extracted to ensure data reliability.
For the training set, we utilized 6000 videos in total to construct our experimental groups. A baseline was formed using 3000 videos, where 15 original frames were extracted from each. To evaluate the incremental impact of style abstraction, style-abstracted frames were sequentially added in increments corresponding to 1000 videos (each contributing 15 frames) across the groups G1 to G4. Furthermore, two control settings were introduced to isolate the effect of the augmentation type: one trained with 6000 original videos only (Control A) and the other with 6000 style-abstracted videos only (Control B). In total, this results in six experimental configurations, as summarized in
Table 3.
4.2.2. UDIVA v0.5 Preprocessing
For the UDIVA v0.5 dataset, due to the relatively small number of participants, we merged all four task-based scenarios (Talk, Lego, Ghost, and Animals). The dataset was originally split into training, validation, and test sets according to participant identity. Each participant’s videos, ranging from several minutes to tens of minutes, were further segmented into one-minute clips, resulting in approximately 1100 videos for the validation set and 6516 videos for the training set. In the validation set, 15 frames were uniformly sampled from each video.
The training set includes 99 participants in total. As a baseline, data from 49 participants were used exclusively with 15 uniformly sampled original frames each. To evaluate the impact of our approach, the remaining 50 participants were utilized to generate style-abstracted samples, which were then incrementally combined with the baseline data to form groups G1 through G4. Furthermore, two control settings were introduced by training the model using data from all 99 participants with either 100% original videos (Control A) or 100% style-abstracted videos (Control B). In total, this design results in six experimental configurations, as summarized in
Table 4.
4.2.3. KETI Dataset Preprocessing
The KETI dataset was processed in a manner similar to UDIVA v0.5, but at a significantly larger scale. We selected the scenario Planning a Korean trip for a foreign friend, which has previously demonstrated the most robust personality recognition performance. Since the KETI dataset is not pre-split, we first segmented each participant’s approximately 20 min video into 30 s clips to both improve recognition performance and address computational memory constraints. All clips originating from the same participant were assigned to the same data split to prevent identity leakage.
The dataset was partitioned by participant identity into training, validation, and test sets with an approximate 3:1:1 ratio. Specifically, participant IDs 1–310 were assigned to the training set, IDs 311–410 to the validation set, and IDs 411–516 to the test set, resulting in 29,037 training clips, 9184 validation clips, and 13,043 test clips. In this study, we adopted the 24–120 labeling scheme, normalizing each label by 120 to constrain values to the interval.
For the validation set, 15 original frames were uniformly sampled from each video. The training set consists of 310 participants, including a baseline subset of 155 participants. To evaluate the incremental impact of our approach, the remaining 155 participants were used to generate style-abstracted samples, which were then progressively combined with the baseline data to form groups G1 through G4. Furthermore, two control settings were introduced by training the model using data from all 310 participants with either 100% original videos (Control A) or 100% style-abstracted videos (Control B). In total, this results in six experimental configurations, as summarized in
Table 5.
4.2.4. Emotion Dataset Preprocessing
The emotion recognition dataset was split following the same strategy used for the KETI and UDIVA v0.5 datasets. Specifically, we employed a subject-independent splitting scheme to mitigate shortcut learning and prevent data leakage. Based on participant identity, all samples from the same speaker were assigned to a single subset. This ensures that the model cannot rely on speaker-specific traits or script-dependent linguistic patterns, and instead is encouraged to learn more generalizable emotional cues.
The dataset was divided into training and validation sets according to speaker identity, resulting in 3966 training samples from 67 speakers and 1348 validation samples from 23 speakers. Within the training set, data from 34 speakers (2047 samples) were retained as a fixed original baseline, while the remaining 33 speakers (1919 samples) were used to generate style-abstracted data for incremental experiments across groups G1 to G4. Furthermore, three control settings were introduced using the full training set of 3966 samples: (Control A) 100% original videos, (Control B) 100% style-abstracted videos, and (Control C) a balanced 1:1 mixture of original and style-abstracted videos. In total, this design results in seven experimental configurations, as summarized in
Table 6.
4.3. Models Used in Experiments
This section presents the architectures of three Transformer-based video models used for personality and emotion recognition: ViViT, VST, and TimeSformer. Detailed explanations of their structures and characteristics are provided in the following subsections.
To ensure consistency in comparison and enhance the efficiency of spatio-temporal feature extraction, we consistently used a CNN (Convolutional Neural Network)-based R2plus1D architecture as the visual backbone for all three models. Unlike standard 3D convolutions, which suffer from high computational cost and a large number of parameters that increase the risk of overfitting, the R2plus1D architecture factorizes the 3D convolution into two sequential operations. The first is a 2D convolution across spatial dimensions to capture shapes and objects within each frame, followed by a 1D convolution along the temporal dimension to model the evolution of these spatial features over time. This decomposition significantly reduces the total number of parameters, improves efficiency, and also introduces a non-linear activation between the spatial and temporal operations. As a result, the network’s expressive power is enhanced, enabling it to learn more complex and robust feature representations.
The R2plus1D backbone was originally pre-trained on the Kinetics-400 dataset, which consists of natural, real-world videos. We acknowledge the potential domain gap between natural textures and the abstract, style-abstracted inputs used in our study. To address this issue, the backbone was not used with frozen weights. Instead, the entire network was fine-tuned end-to-end on our dataset. This allows the pre-trained spatial filters and temporal kernels to adapt to the simplified textures and quantized color spaces of style-abstracted videos, enabling the extraction of robust spatiotemporal representations relevant to our task.
Owing to its efficiency and strong performance in spatiotemporal modeling, R2plus1D has been widely adopted as a backbone for video understanding tasks. Therefore, we utilized the R2plus1D backbone to extract spatio-temporal features from the input video, where the raw video data is represented as .
4.3.1. Video Vision Transformer (ViViT)
ViViT was the first to extend the Vision Transformer from images to videos, with the goal of effectively modeling spatio-temporal information using a global self-attention mechanism [
34]. As shown in
Figure 2, the ViViT-based personality recognition model used in our study first feeds time-ordered video frames into a pre-trained R2plus1D backbone to extract spatio-temporal features. The resulting features are divided into 3D patches, which are flattened, linearly projected into tokens, and augmented with positional encoding. These tokens are then processed by Transformer blocks equipped with Factorized Self-Attention (FSA), which models spatial and temporal dependencies separately in a sequential manner. Finally, the model utilizes a regression or classification layer to output the five OCEAN personality scores or the four emotion indicators.
4.3.2. TimeSformer
TimeSformer is one of the earliest Transformer-based models for video understanding. It extends the application of transformers from images to videos by analyzing spatio-temporal features from sequences of frame-level patches [
35].
As illustrated in
Figure 3, sequential frames extracted from a video are first processed by the pre-trained R2plus1D backbone. This step extracts spatio-temporal features. The output is then divided into multiple 2D patches. These are flattened and linearly projected into tokens, producing a sequence of
patch tokens, where
T is the number of frames and
N is the number of patches per frame. Positional embeddings are added to encode both temporal and spatial locations. The resulting token sequence is then processed by TimeSformer blocks employing Divided Space-Time Attention, where temporal attention and spatial attention are applied sequentially. Finally, the model utilizes a regression or classification layer to output the five OCEAN personality scores or the four emotion indicators.
4.3.3. Video Swin Transformer (VST)
The VST model is derived from the Swin Transformer, extending its architecture from spatial to spatio-temporal domains to enable video understanding. It retains the core sliding-window attention mechanism of the Swin Transformer, which allows for the efficient capture of both local and global spatio-temporal information when processing videos.
As illustrated in
Figure 4, spatio-temporal features are first extracted using the pre-trained R2plus1D backbone. The extracted features are then divided into spatio-temporal patches with shape
, which are linearly projected into embedding vectors. Video Swin Transformer blocks are subsequently applied, using 3D window-based and shifted-window attention mechanisms to capture spatio-temporal dependencies. Finally, the model utilizes a regression or classification layer to output the five OCEAN personality scores or the four emotion indicators.
4.4. Model Training
All experiments were conducted under a unified training and validation protocol. Across all configurations, the validation set consisted exclusively of original videos, while the training set was constructed by progressively increasing the proportion of style-abstracted samples. This design ensures that evaluation is consistently performed in the original visual domain, enabling a direct assessment of generalization rather than mere adaptation to stylized appearances.
For the First Impression v2 dataset, four experimental groups were defined based on its 6000 training videos. A fixed baseline of 3000 original videos was maintained across all groups, while 1000, 2000, and 3000 style-abstracted versions of the remaining videos were incrementally added. This resulted in training sets of 4000, 5000, and 6000 videos. This setup allows the effect of style-abstracted data to be isolated while keeping the original data distribution constant.
For the UDIVA v0.5, KETI, and emotion datasets, experimental groups were constructed at the participant level to prevent identity leakage across training and validation sets. Although the grouping strategy was tailored to each dataset’s structure, the core experimental principle remained consistent: a fixed subset of original samples was retained as the baseline, while the proportion of style-abstracted data was gradually increased. In all cases, the validation set consisted only of original videos, ensuring that performance changes could be clearly attributed to the introduction of style-abstracted samples rather than to shifts in the original data distribution. This design allows the effect of style abstraction to be evaluated consistently under participant-independent settings across both personality and emotion recognition tasks.
By fixing the validation domain to original videos and introducing style-abstracted samples only during training, this protocol explicitly tests our core hypothesis that style abstraction can function as a regularization mechanism that suppresses shortcut learning within the evaluated domain. Rather than encouraging adaptation to a specific visual style, our style-abstraction-based data augmentation promotes greater reliance on behavior-relevant cues, such as facial motion dynamics and gesture patterns, which tend to remain more stable across varying visual conditions.
All experiments were conducted using three Transformer-based video models: ViViT, VST, and TimeSformer. This consistent training design across multiple architectures ensures that the observed performance improvements reflect enhanced robustness and generalization, rather than a mere memorization of stylized appearances or increased data redundancy.
4.5. Experimental Environment
All experiments were conducted in a high-performance computing environment, as summarized in
Table 7. The hardware setup included an Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz, four NVIDIA GeForce RTX 4090 GPUs, and 192 GB of RAM. The software environment was based on Ubuntu 20.04.6 LTS, with implementation in Python 3.10.13, PyTorch 2.2.0, and Torchvision 0.17.0. GPU computation was accelerated with CUDA 12.2.
As summarized in
Table 8, the models were trained using the AdamW optimizer, which is effective for weight decay regularization. Training was conducted for 120 epochs with a learning rate of
and a batch size of 4. To adequately capture temporal information, each video input was sampled at 15 frames per clip.
4.6. Experiment Results
As described earlier, to evaluate the impact of style-abstracted data on the performance of personality recognition models, we conducted experiments using four datasets: First Impression v2, UDIVA v0.5, KETI and Emotion dataset. For model comparison, we employed three architectures: ViViT, VST, and TimeSformer. The evaluation metrics were selected according to the label characteristics of each dataset, as explained in the previous section. Specifically, for First Impression v2 and KETI, since the labels are normalized continuous values within the [0, 1] range, we adopted the 1-MAE metric. This choice is motivated by two factors: first, unlike MSE which squares and diminishes small errors in this narrow range, MAE provides a linear and direct representation of error magnitude; second, the 1-MAE transformation offers an intuitive accuracy score aligned with standard benchmarks in personality computing.
For UDIVA v0.5, with labels represented as continuous float values in the [−3, 3] range, we applied the MSE (Mean Squared Error) metric. We chose MSE over MAE for this dataset because of its quadratic penalty property: it disproportionately penalizes larger errors compared to smaller ones. Given the wider value range and the polarized nature of the labels, employing MSE ensures that significant deviations, such as predicting opposite traits, are heavily weighted, thereby encouraging the model to minimize extreme prediction failures.
In addition to personality recognition, we also evaluated emotion recognition performance to further strengthen the validity of our experimental findings. Since the emotion labels are categorical, classification accuracy was used as the evaluation metric, allowing direct comparison of prediction correctness across classes. To ensure the stability and reliability of the results, all major experiments were repeated three times under the same experimental conditions, and the final performance is reported as the mean ± standard deviation over the three runs.
Equation (
6) represents the 1-
MAE formula, which is defined as the complement of the Mean Absolute Error computed between the predicted values
and the ground-truth values
, where
N denotes the total number of samples. Higher values, closer to 1, indicate better performance.
Equation (
7) represents the MSE formula, which computes the mean of the squared differences between the predicted values
and the ground-truth values
, where
N denotes the total number of samples. Lower values indicate better prediction performance.
As shown in
Table 9,
Table 10,
Table 11 and
Table 12, the four benchmarks generally exhibited a positive trend in performance as the proportion of style-abstracted data in the training set increased.
On the First Impression v2 dataset, the TimeSformer model showed a steady improvement, with its score rising from 0.9087 (G1) to 0.9138 (G4), representing a gain of 0.0051. Both ViViT and VST also reached their highest scores at G4 (100% Style-Abs.), suggesting that reducing stylistic variance through style abstraction may assist the models in capturing more generalized personality-related cues.
For the UDIVA v0.5 dataset, a reduction in MSE was observed across all architectures. Specifically, the ViViT model achieved its lowest MSE at G4, with a decrease of 0.0758 (from 1.1708 to 1.0950). While VST and TimeSformer reached their peak performance at G3 (66% Style-Abs.), the overall reduction in prediction gaps suggests that style-abstracted data helps the model avoid being distracted by unnecessary visual details, such as background or lighting.
On the KETI dataset, the performance gains were modest but consistent, with all models reaching their peak scores at G4 (100% Style-Abs.). The TimeSformer model improved by 0.0023, while ViViT and VST each saw a marginal gain of 0.0015. Although these improvements are incremental, their consistency across different architectures highlights the potential of the proposed style-abstraction-based augmentation method.
Finally, in the Emotion dataset, the TimeSformer model demonstrated an accuracy increase of 3.44% (from 54.67% to 58.11%) in the G4 setting. Similarly, VST improved by 2.62% at G4, and ViViT showed its best performance at G3 (66% Style-Abs.) with a 1.83% increase. These results suggest that our style-abstraction-based data augmentation may encourage models to prioritize expressive motion dynamics over static, low-level visual features, thereby contributing to more stable performance in affective computing tasks.
To verify that the observed performance improvements are not simply caused by an increase in the number of training samples, but rather by the effect of our style-abstraction-based data augmentation, we conducted experiments using three backbone models (ViViT, VST and TimeSformer) across four datasets. In addition to three personality recognition datasets (First Impression v2, UDIVA v0.5, and KETI), we further evaluated the approach on a four-class categorical emotion dataset to assess its generalizability. For each dataset, three training settings were considered: 100% original videos, 100% style-abstracted videos and a balanced mixture with 50% original and 50% style-abstracted videos. Importantly, to eliminate the effect of increased training data, the total number of training samples was kept identical across all configurations for each dataset, only varying the proportion of original and style-abstracted videos.
For the UDIVA v0.5 (
Table 13) and KETI (
Table 14) datasets, the training and validation sets were split by participant identity, ensuring that no individual appears in both sets. This identity-disjoint protocol prevents identity leakage and encourages the models to learn more generalizable visual cues for personality recognition. Under this setting, the balanced mixture strategy (50% original + 50% style-abstracted videos) consistently achieves the best performance, suggesting that combining style-abstracted and original videos increases data diversity while preserving essential visual information.
In contrast, the results on the First Impression v2 dataset (
Table 15) differ from our initial expectation. This dataset provides predefined training and validation splits, which we follow to ensure comparability with prior work. Under this protocol, training with 100% original videos yields the best performance. A possible explanation is that identity-related appearance cues may overlap between the splits, allowing the model to benefit from learning person-specific visual characteristics.
To further examine the generalizability of the proposed approach, we evaluated it on a four-class categorical emotion dataset. As shown in
Table 16, the results follow a trend similar to the identity-disjoint personality datasets. All three models achieved their best performance with the balanced mixture configuration. Notably, the ViViT model showed the most distinct improvement, with accuracy increasing from 64.47% to 68.25% (a 3.78% gain). This suggests that providing a mixture of styles may help models focus on expressive facial motion and intensity dynamics, which are essential for emotion recognition.
Overall, these results indicate that the effectiveness of our style-abstraction-based data augmentation can vary depending on the dataset characteristics and split strategy. When identity-related visual shortcuts are limited, incorporating a balanced mixture of style-abstracted data can assist models in learning more stable representations for personality and emotion recognition.
4.7. Qualitative Spatial Attention Visualization
To investigate how the proportion of style-abstracted data in the training set affects the model’s spatial attention, we conduct experiments using the VST model on the KETI and UDIVA v0.5 datasets.
Figure 5 compares the attention maps generated under three training settings: (1) a model trained exclusively on original videos (left column), (2) a model trained on a balanced mixture (50% original and 50% style-abstracted videos) (middle column), and (3) a model trained exclusively on style-abstracted videos (right column).
In the visualization, the numerical values in each patch represent normalized attention scores. The results show that the model selectively attends to key expressive regions across both benchmarks. As the proportion of style-abstracted data increases, a noticeable qualitative shift in spatial attention emerges. Specifically, the attention distribution becomes increasingly concentrated on the subject’s primary expressive regions, such as the face and hands.
By reducing background textures and other visually distracting patterns, the proposed style-based augmentation may encourage the model to rely more on structural and behavioral cues rather than superficial appearance details. Models trained with a balanced mixture of original and style-abstracted data tend to focus more consistently on subject-related regions where social and emotional signals are expressed. This shift in attention from background information to person-centered cues, such as facial expressions and hand gestures, may help the model to learn more stable behavioral representations and improve generalization across different environments.
4.8. Performance Comparison of Data Augmentation Methods
In this section, we evaluate the effectiveness of the proposed style-abstraction-based data augmentation by comparing it with a conventional geometric rotation baseline. For a fair comparison, both settings utilized a balanced mixture (50% original and 50% style-abstracted frames). Specifically, 50% of the training samples remained unchanged, while the remaining 50% were either randomly rotated within the range of (Original + Rotation) or transformed using the proposed style-abstraction-based data augmentation pipeline (Original + Style-Abs.).
The comparative results across multiple video architectures are summarized in
Table 17. Overall, the proposed style-abstraction-based data augmentation balanced mixture demonstrates better results compared with the rotation-based baseline across the evaluated benchmarks.
Performance Improvement on UDIVA v0.5: The most noticeable improvement is observed on the UDIVA v0.5 dataset. For all evaluated models, the proposed style-abstraction-based data augmentation achieves lower prediction errors than the rotation baseline. In particular, the ViViT model shows the largest improvement, reducing the error from 1.1716 to 1.0851.
Results on KETI and First Impression v2: On these datasets, the proposed approach achieves performance that is largely comparable to the rotation baseline. Although the numerical differences are smaller than those observed for UDIVA v0.5, the results remain consistent across different backbone models. This suggests that visual abstraction can serve as a viable augmentation strategy while maintaining stable predictive performance.
Observed Improvements in Emotion Recognition: The proposed balanced mixture also shows a positive impact on the Emotion dataset. In particular, when using the VST model, our style-abstraction-based data augmentation results in an accuracy of 62.09% compared to 58.09% with the rotation baseline, a gain of 4.00%. Similarly, TimeSformer shows an increase of 3.41%. This suggests that by simplifying unnecessary visual details, the model may be better able to focus on essential affective cues such as facial motion and posture, which are important for emotion recognition.
Overall, while rotation-based augmentation introduces spatial variation, the proposed style-abstraction-based augmentation provides an alternative form of variability by abstracting low-level appearance patterns. This abstraction may encourage the model to focus more strongly on behavior-relevant cues such as facial expressions and hand gestures, while reducing sensitivity to background textures and other appearance-related artifacts.
4.9. Impact of Style Abstraction Intensity on Model Performance
In this section, we investigate how different levels of stylistic abstraction, generated by our style abstraction pipeline, influence model performance on the First Impression v2 dataset. For this analysis, we constructed an augmented training set consisting of 3000 original samples and 1000 style-abstracted samples, where the intensity of the transformation was systematically varied.
Figure 6 illustrates the visual transitions generated by our pipeline, ranging from the original frame to three distinct levels of intensity: low, medium, and high.
Notably, the medium style abstraction intensity used in this evaluation corresponds to the default parameter configuration adopted in our previous experiments. As shown in
Table 18, this medium level consistently yielded the superior performance across all evaluated architectures, including ViViT, VST, and TimeSformer.
While the low style abstraction intensity retained redundant pixel-level noise from the original frames and the high intensity led to the loss of subtle facial cues due to excessive abstraction, the medium intensity provided the most effective balance between visual abstraction and the preservation of affect-relevant facial dynamics. These results validate our selection of the medium style abstraction configuration as the optimal setting for enhancing affective representation in the First Impression v2 dataset.
5. Discussion, Limitations, and Future Directions
While the proposed style-abstraction-based data augmentation demonstrates promising performance improvements, several limitations remain to be addressed. First, the current style-abstraction-based data augmentation process relies on a sequence of deterministic operations, including Bilateral Filtering, Edge Detection to create the edge map E, Color Space Conversion to HSV, Adaptive Color Quantization, and Outline Drawing. At present, this pipeline is treated as a black box regarding its specific contribution to model regularization. Due to the computational complexity and the defined scope of this study, a granular ablation study on these individual components was not feasible. Future work should involve a comparative analysis to identify which specific operations or combinations of operations provide the most significant regularization benefits, which will be key to optimizing the augmentation pipeline for diverse vision tasks.
Another key limitation is the restricted style diversity of our current style-abstraction-based data augmentation. Future studies should explore diffusion-based generative stylization and controllable multi-style augmentation to improve expressive variation and robustness. Such generative approaches could allow for precise control over stylistic parameters, such as stroke intensity and color palette, further enhancing the model’s generalization capabilities across different domains.
Furthermore, we acknowledge that the style-abstraction-based data augmentation process can obscure nuanced physical cues, such as subtle skin textures and tones. This loss of detail may potentially distort facial or bodily cues unevenly across different demographic groups. Due to the lack of fine-grained metadata in the current dataset, a comprehensive subgroup failure analysis was not feasible in this study.
Ensuring that style-abstraction-based augmentation does not introduce or exacerbate unintended biases is a priority for future iterations. Finally, extending this framework to multimodal fusion by integrating speech and text modalities remains a critical direction for achieving fair and robust affective intelligence in real-world AGI systems.