Style-Abstraction-Based Data Augmentation for Robust Affective Computing

Qiu, Xu; Kim, Taewan; Kim, Bongjae

doi:10.3390/app16063109

Open AccessArticle

Style-Abstraction-Based Data Augmentation for Robust Affective Computing

by

Xu Qiu

^†

,

Taewan Kim

^†

and

Bongjae Kim

^*

Department of Computer Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(6), 3109; https://doi.org/10.3390/app16063109

Submission received: 19 February 2026 / Revised: 16 March 2026 / Accepted: 20 March 2026 / Published: 23 March 2026

(This article belongs to the Special Issue Advances in Computer Vision and Digital Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Personality recognition and emotion recognition, two core tasks within affective computing, are fundamentally constrained by data scarcity as collecting and annotating human behavioral data is expensive and restricted by privacy concerns. Under these limited data conditions, existing models tend to rely on superficial shortcut features such as background appearance, lighting conditions, or color variations, rather than behavior-relevant cues including facial expressions, posture, and motion dynamics. To address this issue, we propose Style-Abstraction-based Data Augmentation, a style transfer-based augmentation strategy that reduces dependency on low-level appearance information while preserving high-level semantic cues. Specifically, we employ cartoonization to generate stylized variants of training videos that retain expressive characteristics but remove stylistic bias. We validate our approach on three diverse personality benchmarks (First Impression v2, UDIVA v0.5, and KETI) and emotion benchmark(Emotion Dataset) using state-of-the-art models including ViViT (Video Vision Transformer), TimeSformer, and VST (Video Swin Transformer). Our experiments indicate that increasing the proportion of style-abstracted data in the training set can improve performance on the evaluated datasets. Notably, our method yields consistent gains across all benchmarks: a 0.0893 reduction in MSE on UDIVA v0.5 (with VST), a 0.0023 improvement in 1-MAE on KETI (with TimeSformer), and a 0.0051 improvement on First Impression v2 (with TimeSformer). Furthermore, extending style-abstraction-based data augmentation to a four-class categorical emotion recognition task demonstrates similar performance gains, achieving up to a 3.44% accuracy increase with the TimeSformer backbone. These findings verify that our style-abstraction-based data augmentation facilitates learning of behavior-relevant features by reducing reliance on superficial shortcuts. Overall, cartoonization-based style abstraction for data augmentation functions as both an effective augmentation strategy and a regularization mechanism, encouraging the model to learn more stable and generalizable representations for affective computing applications.

Keywords:

affective computing; personality recognition; data augmentation; shortcut learning

1. Introduction

A central goal of next-generation artificial intelligence (AI) is to build systems that can interact with humans in a natural and effective manner. Achieving this goal requires not only understanding explicit semantic content, but also perceiving deeper human attributes such as personality traits and emotional states, which enable AI systems to respond in a more adaptive and human-like way. In the fields of personality and emotion recognition, such capability is considered a key foundation for more natural human–AI interactions [1]. However, progress in this area remains constrained by a practical challenge: high-quality real-world data are often limited in scale. Collecting and annotating video data is typically expensive, time-consuming, and subject to privacy and ethical constraints, making it difficult for existing datasets to cover sufficiently diverse behavioral patterns and environmental conditions. As a result, models trained on such limited data are more prone to overfitting and may disproportionately rely on dataset-specific appearance or contextual cues.

Recently, in fields such as medicine and autonomous driving, the use of synthetic data to augment real-world datasets has attracted increasing attention [2,3]. Inspired by this trend, we ask whether transforming existing real-world data during training, rather than generating fully synthetic data from scratch, can also benefit model learning. Based on this intuition, we propose style-abstraction-based data augmentation. Specifically, this method employs cartoonization to generate visually abstract representations, which preserve key behavioral information such as facial dynamics, body posture, and gestures while suppressing irrelevant stylistic details. At the same time, this process suppresses low-level visual cues, including texture, illumination, background clutter, and appearance-specific details. We hypothesize that such style abstraction acts as an effective regularizer, encouraging the model to rely less on task-irrelevant superficial patterns and more on higher-level behavioral representations related to personality and emotion. Following this idea, we jointly use original and style-abstracted videos during training to enrich data diversity and further improve recognition performance in real-world scenarios.

To empirically test our hypothesis, we designed a rigorous experimental framework. We evaluated performance using the standard Big-5 personality model [4] and four emotion categories. Our evaluation covered four diverse benchmark datasets: First Impression v2 [5], UDIVA v0.5 [6], KETI, and emotion dataset. These datasets span different languages, interaction contexts (monologues vs. dialogues), annotation methods, and tasks enabling a robust assessment of our method’s generalization capabilities. Two tasks allow us to determine whether visual abstraction through cartoonization improves robustness not only in continuous personality regression but also in discrete emotion classification, thereby validating broader applicability across affective computing tasks.

As a concrete form of visual abstraction, we employed cartoonization. We then systematically trained state-of-the-art video models, including ViViT (Video Vision Transformer) [7], VST (Video Swin Transformer) [8], and TimeSformer [9], on training sets created with progressively larger proportions of abstracted data.

Our experiments show a consistent trend. As the proportion of abstracted data in the training set increases, we observe improvements in personality and emotion recognition performance across multiple models and datasets. These results suggest that visual abstraction can act as a form of regularization by reducing the influence of appearance-related features. Furthermore, the findings indicate that the proposed approach can serve as an effective data augmentation strategy that encourages models to focus on behavior-relevant cues. In this sense, the method may provide a practical direction for alleviating data scarcity challenges in affective computing. Specifically, our main contributions are threefold.

Insights into Style-Abstraction-based Regularization: We show that our style-abstraction-based augmentation can act as a regularization mechanism that reduces the model’s reliance on shortcut cues, such as background textures, lighting conditions, and other environmental artifacts. By abstracting these appearance-related factors, the model is encouraged to focus more on behavior-relevant cues, such as facial expressions and hand gestures, resulting in more stable representation learning for affective computing tasks.
A Practical Framework for Robust Affective Computing: We introduce a style-abstraction-based data augmentation, a lightweight yet effective framework based on cartoonization to improve recognition models. Unlike complex generative models or full 3D simulations, our cartoonization-based style abstraction for data augmentation offers a practical and scalable solution to mitigate overfitting and address data scarcity, bridging the gap between simulated and real data by reframing it as a stylized-to-original paradigm.
Comprehensive Empirical Validation: We conduct extensive experiments on two tasks using large-scale benchmarks (First Impression v2, UDIVA v0.5, KETI, and a emotion dataset) with multiple state-of-the-art Transformer-based video models. Our results consistently show significant performance improvements across different demographic groups, interaction contexts, tasks, and model architectures, thereby establishing the broad applicability and effectiveness of our proposed approach.

2. Related Works

This study lies at the intersection of data augmentation and deep learning-based approaches to personality and emotion recognition. In the following, we review related work in both domains to highlight recent progress.

2.1. Generative and Simulation-Based Data Augmentation

Data augmentation is a critical technique to improve model generalization, particularly in data-scarce domains. Traditional methods such as geometric transformations and color-space adjustments [10] are computationally efficient. However, these approaches are limited to superficial modifications of existing data. They cannot generate samples with new semantic or stylistic attributes, which becomes a critical limitation when training data is inherently biased.

To overcome this, recent work has focused on two primary paradigms: deep generative models and simulation-based data generation. Deep generative models, such as Generative Adversarial Networks (GANs) [11] and Variational Autoencoders (VAEs) [12], learn the underlying data distribution and synthesize high-fidelity samples. More recently, Diffusion Models [13] have demonstrated superior quality and stability in data generation. Among recent generative approaches, large-scale pre-trained text-to-image models have gained particular attention for controllable data augmentation [14].

Alongside generative models, simulation-based approaches have emerged as a powerful paradigm for creating large-scale, perfectly annotated datasets. This approach constructs controllable virtual 3D environments for systematic data generation. A prominent example is the Unity Perception package introduced by Borkman et al. [15], which provides a framework for generating vast quantities of synthetic data through domain randomization: systematically varying parameters such as lighting, object textures, and camera positions. Their work demonstrates that models trained on a mix of such synthetic data and a small portion of real-world data can surpass the performance of models trained solely on real-world data, thereby bridging the gap between simulated and real domains.

2.2. Shortcut Learning and Domain Generalization Challenges

Deep learning models often rely on shortcut cues such as background textures, lighting conditions, or other appearance-related signals rather than semantically meaningful behavioral information. This shortcut reliance can lead to poor performance under domain shifts and limits generalization beyond the training environment [16]. To address this issue, domain generalization research aims to encourage models to learn invariant representations that remain stable across different environments. Stylization-based augmentation has been proposed as an effective approach, as it modifies structure-independent textures while preserving high-level semantic cues, thereby reducing reliance on superficial appearance patterns. However, most existing stylization methods primarily increase appearance diversity through texture randomization and do not explicitly suppress shortcut cues present in the data. In contrast, our approach applies visual abstraction through cartoonization to remove low-level appearance patterns while preserving behavioral structures. This encourages models to rely more on semantically meaningful cues, such as facial expressions and hand gestures, for personality recognition. Unlike conventional augmentation strategies that mainly introduce additional visual variability, the proposed method explicitly aims to suppress appearance-based shortcuts through style abstraction, thereby promoting the learning of behavior-relevant representations. Table 1 summarizes representative studies addressing shortcut learning and cross-domain robustness.

2.3. Deep Learning for Affective Behavior Recognition

Deep learning has advanced Affective Computing by modeling multimodal behavioral cues. Early methods relied on CNNs to extract visual expression features [21]. Later, 3D CNNs [22] improved spatiotemporal modeling for video-based analysis. More recently, Transformer-based architectures, including ViViT, Video Swin Transformer, TimeSformer, and multimodal Transformers [23,24], have achieved state-of-the-art performance by enabling long-range temporal reasoning and cross-modal fusion.

Despite significant progress, prior studies indicate that deep models frequently rely on shortcut features such as background appearance, lighting, and skin tone rather than semantically meaningful cues [18]. This shortcut dependency results in poor robustness under distribution shifts and limits generalization to realistic, in-the-wild environments. Consequently, mitigating appearance bias and promoting learning of behavior-relevant, domain-agnostic features remains a critical open challenge in affective computing. These limitations directly motivate our style-abstraction-based data augmentation approach, which applies visual abstraction through cartoonization to suppress shortcut cues and promote learning from behavior-relevant signals such as facial dynamics and gestures.

2.4. Recent Advances in Multimodal Personality and Emotion Recognition

Recent research in affective computing has increasingly focused on multimodal approaches for automatic personality trait assessment based on the Big Five (OCEAN) model. Ryumina et al. [25] proposed a Gated Siamese Fusion Network for multimodal personality prediction that integrates video, audio, and text modalities. Their framework employs a Siamese architecture with a gated fusion mechanism to adaptively integrate multimodal information, demonstrating improved performance on the ChaLearn First Impressions v2 and MuPTA datasets.

Bhin and Choi [26] proposed a multimodal personality recognition framework integrating audio, visual, and textual information via a self-attention-based fusion strategy. The model extracts modality-specific features using Wav2Vec2, OpenPose-based skeleton landmarks, and BERT/Doc2Vec embeddings, integrating them through a late fusion mechanism for improved personality recognition.

Beyond personality recognition, multimodal emotion recognition (MER) has also seen significant advances in addressing long-range dependency and cross-modal alignment. Chen et al. [27] proposed a framework integrating Mamba and Liquid Neural Networks (LNNs) to capture global context and transient emotional dynamics, employing Optimal Transport (OT)-based sparse alignment across video, audio, text, and EEG modalities. Evaluated on IEMOCAP and CMU-MOSEI, the model demonstrates superior performance in handling asynchronous heterogeneous multimodal data.

Fang et al. [28] introduced the EMOE framework to address modality imbalance in MER. Their Mixture of Modality Experts (MoME) module dynamically adjusts modality importance weights, while a Unimodal Distillation (UD) strategy preserves modality-specific discriminative capabilities during fusion, achieving state-of-the-art performance on multiple MER benchmarks.

3. Proposed Augmentation Technique

In this study, to ensure diverse visual representations and to rigorously evaluate domain generalization across different visual styles, we applied a style-abstraction-based data augmentation technique implemented using an OpenCV-based cartoonization function originally proposed by Lu Tianming. This method transforms real-world images into a simplified, structure-preserving cartoon style by combining edge-aware smoothing, contour extraction, and color abstraction.

By suppressing low-level texture details while retaining essential structural cues, our style-abstraction-based data augmentation enables a controlled evaluation of whether recognition models focus on high-level behavioral information rather than overfitting to superficial appearance characteristics. This property is particularly desirable for affective computing tasks, where subtle behavioral patterns such as facial dynamics and gestures should be emphasized over irrelevant stylistic variations. Moreover, by increasing visual diversity in the training data, this augmentation strategy helps alleviate data scarcity and improves both emotion and personality recognition performance in low-resource settings.

The implementation of our style-abstraction-based data augmentation consists of the following sequential steps:

1.: Bilateral Filtering: Bilateral filtering is applied to perform edge-preserving smoothing by suppressing fine-grained textures, such as skin details, noise, and other high-frequency variations, while retaining salient structural boundaries. For each pixel p, neighboring pixels q within a local window $Ω$ are aggregated using a weighted average, where the weights are determined by both spatial proximity and color similarity. Specifically, the spatial weight is governed by the squared Euclidean distance ${∥ p - q ∥}^{2}$ , with $σ_{s}$ controlling the spatial extent of smoothing, while the range (color) weight depends on the squared color difference $∥ I^{(c)} (p) - I^{(c)} {(q) ∥}^{2}$ , with $σ_{r}$ regulating sensitivity to color variations. The weights are normalized by a factor $W_{p}$ to ensure unit sum, yielding the final filtered value at pixel p:

${\tilde{I}}^{(c)} (p) = \frac{1}{W_{p}} \sum_{q \in Ω} exp (- \frac{{∥ p - q ∥}^{2}}{2 σ_{s}^{2}} - \frac{{∥I^{(c)} (p) - I^{(c)} (q)∥}^{2}}{2 σ_{r}^{2}}) I^{(c)} (q), c \in {r, g, b} .$

(1)
2.: Edge Detection: To preserve structural information, edge detection is performed on the smoothed image to extract salient contours. The resulting edge map is defined as

$E = E (\tilde{I}), E \in {0, 1}^{H \times W},$

(2)

where $E (\cdot)$ denotes a standard edge detection operator, such as Canny or Sobel. The extracted edges provide structural cues for subsequent outline rendering.
3.: Color Space Conversion: Since the RGB color space entangles luminance and chromatic information, the smoothed image is transformed into the HSV color space, which offers a more perceptually meaningful representation and allows brightness and color information to be handled more independently:

$I_{H S V} = Φ_{RGB \to HSV} (\tilde{I}) .$

(3)
4.: Adaptive Color Quantization: To reduce color complexity and promote region-level abstraction, adaptive color quantization is performed in the HSV space. For each pixel p, its color is assigned to the nearest color center $c_{k} \in C$ according to the Euclidean distance:

$I_{Q} (p) = arg min_{c_{k} \in C} {∥I_{H S V} (p) - c_{k}∥}_{2} .$

(4)
5.: Outline Drawing: To enhance the visibility of region boundaries, detected edge pixels are rendered in black, while non-edge pixels retain their quantized colors. The resulting image $I_{O}$ is defined as

$I_{O} (p) = \{\begin{matrix} 0, & if E (p) = 1, \\ I_{Q} (p), & otherwise . \end{matrix}$

(5)
6.: Final Composition: All processing steps are sequentially combined to produce the final style-abstracted image, which exhibits reduced texture complexity while preserving essential structural information.

Figure 1 shows examples of face images before and after the proposed style-abstraction-based data augmentation, illustrating the simplification of visual information and the emphasis on structural features. We conducted experiments to investigate the impact of this style abstraction technique on enhancing both the expressive power and domain robustness of video-based affective computing models.

4. Experiments

In this section, we present a series of experiments to evaluate the effectiveness of our proposed style-abstraction-based data augmentation for both personality and emotion recognition. Our evaluation is conducted on three diverse personality benchmark datasets (First Impression v2, UDIVA v0.5, and KETI) and one emotion dataset, which differ in language, interaction context, and annotation methods, as summarized in Table 2, thereby enabling an assessment of the robustness of our approach. We employ three state-of-the-art video-based models (ViViT, VST, and TimeSformer) and systematically analyze their performance by training them on datasets with progressively larger proportions of style-abstracted data.

4.1. Datasets

4.1.1. First Impression v2 Dataset

The First Impression v2 dataset is a widely used benchmark in the field of personality recognition, originally introduced in the ECCV 2016 ChaLearn LAP challenge [23]. It contains over 3000 videos collected from YouTube, which were segmented into approximately 15 s clips. This process yielded around 10,000 front-facing video clips of individuals speaking in English. The dataset is divided into training, validation, and test sets with a 3:1:1 split. The individuals in the videos cover a diverse range of nationalities, ethnicities, genders, and ages [29,30,31]. Personality labels were annotated using Amazon Mechanical Turk (AMT) based on the Big-5 OCEAN personality model.

4.1.2. UDIVA v0.5 Dataset

The UDIVA v0.5 dataset consists of videos capturing interactions among 147 volunteers from 22 countries and regions [32]. A total of 134 participants were assigned to one of four task-based sessions: Talk, Lego, Ghost, and Animals. Spanish was the primary language, followed by Catalan and English. Each participant’s voice was recorded with both a lavalier microphone and a desk-mounted omnidirectional microphone, and they also wore a first-person-view camera. The dataset includes not only participants’ videos but also metadata such as nationality, gender, and age, along with automatically extracted annotations including facial contours, body poses, hand movements, and speech transcriptions. The data is split into training (116 sessions, 99 participants), validation (18 sessions, 20 participants), and test (11 sessions, 15 participants) sets. Participants further provided self-assessed Big-5 OCEAN personality scores using the BFI-2 (Big Five Inventory–2) [33]. These scores are normalized using a z-score transformation, resulting in continuous values with a mean of 0 and a standard deviation of 1.

4.1.3. KETI Dataset

The KETI multimodal dialogue corpus is a multimodal dataset containing video recordings and transcribed dialogue from 516 participants. The participants interact in groups of four while discussing three task-oriented scenarios: prioritizing the invitation list for an event, planning the setup of a festival booth, and organizing a Korean trip for a foreign friend. In addition to these structured scenarios, free-form conversation sessions were also collected to assess the recording environment. Each video is accompanied by precise speech timing information aligned with the corresponding dialogue content. The dataset further provides OCEAN personality scores for each participant, obtained from both the IPIP-NEO-120 and the abridged IPIP-NEO-60 questionnaires, with all scores reported as integers.

4.1.4. Emotion Dataset

This categorical emotion recognition dataset is composed of 100 acting students and professional actors. Each participant performed approximately 100 scripted utterances per emotion, resulting in a total of 10,351 high-resolution videos. The dataset contains seven emotion categories; namely, Happiness, Surprise, Neutral, Fear, Disgust, Anger, and Sadness. For our experiments, we selected four representative emotions, including Happiness, Anger, Neutral, and Sadness, resulting in a filtered subset of 5314 samples. All recordings were captured in FHD (1920 × 1080) at 30 fps in m2ts format with synchronized 16-bit, 48 kHz audio.

4.1.5. Identity Overlap Analysis in First Impression v2

Unlike UDIVA v0.5, KETI, and the emotion dataset, which are partitioned by participant ID, First Impression v2 employs an official pre-defined split. We conducted an identity overlap analysis using the YouTube video ID (extracted as the prefix of the clip index) as a proxy for individual identity. The analysis reveals substantial overlap: 1222 IDs are shared between the training and validation sets, involving 2948 out of 6000 training clips and 1674 out of 2000 validation clips. These results indicate that the official validation set is not strictly identity-disjoint, which may lead to an overestimation of generalization performance on unseen individuals.

4.2. Preprocessing

A unified preprocessing and augmentation pipeline was employed for three video-based personality recognition datasets (First Impression v2, UDIVA v0.5, and KETI) and one emotion dataset. For all videos, a fixed set of 15 frames was uniformly sampled across the entire clip duration, which in many cases exceeds 15 s in length. This global sampling strategy helps the model capture long-term temporal dynamics and the progression of affective expressions, rather than relying solely on instantaneous static cues. Consequently, the approach preserves essential temporal context while maintaining computational efficiency for training large-scale video models.

For the training set, we designed four experimental groups (G1–G4) by progressively increasing the proportion of style-abstracted data relative to a fixed baseline of original videos. Starting from the baseline of original videos (G1), we incrementally added style-abstracted samples in stages: G2 includes an additional amount of style-abstracted data equivalent to one-third of the baseline, G3 increases this to two-thirds, and G4 reaches a balanced mixture where the amount of style-abstracted data equals the amount of original data. Across all groups, the validation set consisted entirely of original videos. This setup allows us to analyze the incremental contribution of style-abstracted samples and determine whether they provide complementary information that improves model robustness as the total data volume increases.

Furthermore, to examine whether the observed performance gains are specifically due to the effect of style abstraction rather than a simple increase in training data volume, we conducted two additional control experiments while keeping the total sample count identical to G4: (1) Control A, trained exclusively with 100% original videos, and (2) Control B, trained exclusively with 100% style-abstracted videos. By comparing G4 with these control groups, we can isolate the impact of combining different visual styles under a constant data volume.

4.2.1. First Impression v2 Preprocessing

The videos in the First Impression v2 dataset are relatively short, with an average duration of approximately 15 s. Given this duration, we processed the videos directly to maintain visual consistency. The validation set consists of 2000 videos, from each of which 15 original frames were extracted to ensure data reliability.

For the training set, we utilized 6000 videos in total to construct our experimental groups. A baseline was formed using 3000 videos, where 15 original frames were extracted from each. To evaluate the incremental impact of style abstraction, style-abstracted frames were sequentially added in increments corresponding to 1000 videos (each contributing 15 frames) across the groups G1 to G4. Furthermore, two control settings were introduced to isolate the effect of the augmentation type: one trained with 6000 original videos only (Control A) and the other with 6000 style-abstracted videos only (Control B). In total, this results in six experimental configurations, as summarized in Table 3.

4.2.2. UDIVA v0.5 Preprocessing

For the UDIVA v0.5 dataset, due to the relatively small number of participants, we merged all four task-based scenarios (Talk, Lego, Ghost, and Animals). The dataset was originally split into training, validation, and test sets according to participant identity. Each participant’s videos, ranging from several minutes to tens of minutes, were further segmented into one-minute clips, resulting in approximately 1100 videos for the validation set and 6516 videos for the training set. In the validation set, 15 frames were uniformly sampled from each video.

The training set includes 99 participants in total. As a baseline, data from 49 participants were used exclusively with 15 uniformly sampled original frames each. To evaluate the impact of our approach, the remaining 50 participants were utilized to generate style-abstracted samples, which were then incrementally combined with the baseline data to form groups G1 through G4. Furthermore, two control settings were introduced by training the model using data from all 99 participants with either 100% original videos (Control A) or 100% style-abstracted videos (Control B). In total, this design results in six experimental configurations, as summarized in Table 4.

4.2.3. KETI Dataset Preprocessing

The KETI dataset was processed in a manner similar to UDIVA v0.5, but at a significantly larger scale. We selected the scenario Planning a Korean trip for a foreign friend, which has previously demonstrated the most robust personality recognition performance. Since the KETI dataset is not pre-split, we first segmented each participant’s approximately 20 min video into 30 s clips to both improve recognition performance and address computational memory constraints. All clips originating from the same participant were assigned to the same data split to prevent identity leakage.

The dataset was partitioned by participant identity into training, validation, and test sets with an approximate 3:1:1 ratio. Specifically, participant IDs 1–310 were assigned to the training set, IDs 311–410 to the validation set, and IDs 411–516 to the test set, resulting in 29,037 training clips, 9184 validation clips, and 13,043 test clips. In this study, we adopted the 24–120 labeling scheme, normalizing each label by 120 to constrain values to the

[0, 1]

interval.

For the validation set, 15 original frames were uniformly sampled from each video. The training set consists of 310 participants, including a baseline subset of 155 participants. To evaluate the incremental impact of our approach, the remaining 155 participants were used to generate style-abstracted samples, which were then progressively combined with the baseline data to form groups G1 through G4. Furthermore, two control settings were introduced by training the model using data from all 310 participants with either 100% original videos (Control A) or 100% style-abstracted videos (Control B). In total, this results in six experimental configurations, as summarized in Table 5.

4.2.4. Emotion Dataset Preprocessing

The emotion recognition dataset was split following the same strategy used for the KETI and UDIVA v0.5 datasets. Specifically, we employed a subject-independent splitting scheme to mitigate shortcut learning and prevent data leakage. Based on participant identity, all samples from the same speaker were assigned to a single subset. This ensures that the model cannot rely on speaker-specific traits or script-dependent linguistic patterns, and instead is encouraged to learn more generalizable emotional cues.

The dataset was divided into training and validation sets according to speaker identity, resulting in 3966 training samples from 67 speakers and 1348 validation samples from 23 speakers. Within the training set, data from 34 speakers (2047 samples) were retained as a fixed original baseline, while the remaining 33 speakers (1919 samples) were used to generate style-abstracted data for incremental experiments across groups G1 to G4. Furthermore, three control settings were introduced using the full training set of 3966 samples: (Control A) 100% original videos, (Control B) 100% style-abstracted videos, and (Control C) a balanced 1:1 mixture of original and style-abstracted videos. In total, this design results in seven experimental configurations, as summarized in Table 6.

4.3. Models Used in Experiments

This section presents the architectures of three Transformer-based video models used for personality and emotion recognition: ViViT, VST, and TimeSformer. Detailed explanations of their structures and characteristics are provided in the following subsections.

To ensure consistency in comparison and enhance the efficiency of spatio-temporal feature extraction, we consistently used a CNN (Convolutional Neural Network)-based R2plus1D architecture as the visual backbone for all three models. Unlike standard 3D convolutions, which suffer from high computational cost and a large number of parameters that increase the risk of overfitting, the R2plus1D architecture factorizes the 3D convolution into two sequential operations. The first is a 2D convolution across spatial dimensions to capture shapes and objects within each frame, followed by a 1D convolution along the temporal dimension to model the evolution of these spatial features over time. This decomposition significantly reduces the total number of parameters, improves efficiency, and also introduces a non-linear activation between the spatial and temporal operations. As a result, the network’s expressive power is enhanced, enabling it to learn more complex and robust feature representations.

The R2plus1D backbone was originally pre-trained on the Kinetics-400 dataset, which consists of natural, real-world videos. We acknowledge the potential domain gap between natural textures and the abstract, style-abstracted inputs used in our study. To address this issue, the backbone was not used with frozen weights. Instead, the entire network was fine-tuned end-to-end on our dataset. This allows the pre-trained spatial filters and temporal kernels to adapt to the simplified textures and quantized color spaces of style-abstracted videos, enabling the extraction of robust spatiotemporal representations relevant to our task.

Owing to its efficiency and strong performance in spatiotemporal modeling, R2plus1D has been widely adopted as a backbone for video understanding tasks. Therefore, we utilized the R2plus1D backbone to extract spatio-temporal features from the input video, where the raw video data is represented as

V i d e o \in R^{C \times T \times H \times W}

.

4.3.1. Video Vision Transformer (ViViT)

ViViT was the first to extend the Vision Transformer from images to videos, with the goal of effectively modeling spatio-temporal information using a global self-attention mechanism [34]. As shown in Figure 2, the ViViT-based personality recognition model used in our study first feeds time-ordered video frames into a pre-trained R2plus1D backbone to extract spatio-temporal features. The resulting features are divided into 3D patches, which are flattened, linearly projected into tokens, and augmented with positional encoding. These tokens are then processed by Transformer blocks equipped with Factorized Self-Attention (FSA), which models spatial and temporal dependencies separately in a sequential manner. Finally, the model utilizes a regression or classification layer to output the five OCEAN personality scores or the four emotion indicators.

4.3.2. TimeSformer

TimeSformer is one of the earliest Transformer-based models for video understanding. It extends the application of transformers from images to videos by analyzing spatio-temporal features from sequences of frame-level patches [35].

As illustrated in Figure 3, sequential frames extracted from a video are first processed by the pre-trained R2plus1D backbone. This step extracts spatio-temporal features. The output is then divided into multiple 2D patches. These are flattened and linearly projected into tokens, producing a sequence of

T \times N

patch tokens, where T is the number of frames and N is the number of patches per frame. Positional embeddings are added to encode both temporal and spatial locations. The resulting token sequence is then processed by TimeSformer blocks employing Divided Space-Time Attention, where temporal attention and spatial attention are applied sequentially. Finally, the model utilizes a regression or classification layer to output the five OCEAN personality scores or the four emotion indicators.

4.3.3. Video Swin Transformer (VST)

The VST model is derived from the Swin Transformer, extending its architecture from spatial to spatio-temporal domains to enable video understanding. It retains the core sliding-window attention mechanism of the Swin Transformer, which allows for the efficient capture of both local and global spatio-temporal information when processing videos.

As illustrated in Figure 4, spatio-temporal features are first extracted using the pre-trained R2plus1D backbone. The extracted features are then divided into spatio-temporal patches with shape

T \times H \times W

, which are linearly projected into embedding vectors. Video Swin Transformer blocks are subsequently applied, using 3D window-based and shifted-window attention mechanisms to capture spatio-temporal dependencies. Finally, the model utilizes a regression or classification layer to output the five OCEAN personality scores or the four emotion indicators.

4.4. Model Training

All experiments were conducted under a unified training and validation protocol. Across all configurations, the validation set consisted exclusively of original videos, while the training set was constructed by progressively increasing the proportion of style-abstracted samples. This design ensures that evaluation is consistently performed in the original visual domain, enabling a direct assessment of generalization rather than mere adaptation to stylized appearances.

For the First Impression v2 dataset, four experimental groups were defined based on its 6000 training videos. A fixed baseline of 3000 original videos was maintained across all groups, while 1000, 2000, and 3000 style-abstracted versions of the remaining videos were incrementally added. This resulted in training sets of 4000, 5000, and 6000 videos. This setup allows the effect of style-abstracted data to be isolated while keeping the original data distribution constant.

For the UDIVA v0.5, KETI, and emotion datasets, experimental groups were constructed at the participant level to prevent identity leakage across training and validation sets. Although the grouping strategy was tailored to each dataset’s structure, the core experimental principle remained consistent: a fixed subset of original samples was retained as the baseline, while the proportion of style-abstracted data was gradually increased. In all cases, the validation set consisted only of original videos, ensuring that performance changes could be clearly attributed to the introduction of style-abstracted samples rather than to shifts in the original data distribution. This design allows the effect of style abstraction to be evaluated consistently under participant-independent settings across both personality and emotion recognition tasks.

By fixing the validation domain to original videos and introducing style-abstracted samples only during training, this protocol explicitly tests our core hypothesis that style abstraction can function as a regularization mechanism that suppresses shortcut learning within the evaluated domain. Rather than encouraging adaptation to a specific visual style, our style-abstraction-based data augmentation promotes greater reliance on behavior-relevant cues, such as facial motion dynamics and gesture patterns, which tend to remain more stable across varying visual conditions.

All experiments were conducted using three Transformer-based video models: ViViT, VST, and TimeSformer. This consistent training design across multiple architectures ensures that the observed performance improvements reflect enhanced robustness and generalization, rather than a mere memorization of stylized appearances or increased data redundancy.

4.5. Experimental Environment

All experiments were conducted in a high-performance computing environment, as summarized in Table 7. The hardware setup included an Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz, four NVIDIA GeForce RTX 4090 GPUs, and 192 GB of RAM. The software environment was based on Ubuntu 20.04.6 LTS, with implementation in Python 3.10.13, PyTorch 2.2.0, and Torchvision 0.17.0. GPU computation was accelerated with CUDA 12.2.

As summarized in Table 8, the models were trained using the AdamW optimizer, which is effective for weight decay regularization. Training was conducted for 120 epochs with a learning rate of

3 \times 10^{- 5}

and a batch size of 4. To adequately capture temporal information, each video input was sampled at 15 frames per clip.

4.6. Experiment Results

As described earlier, to evaluate the impact of style-abstracted data on the performance of personality recognition models, we conducted experiments using four datasets: First Impression v2, UDIVA v0.5, KETI and Emotion dataset. For model comparison, we employed three architectures: ViViT, VST, and TimeSformer. The evaluation metrics were selected according to the label characteristics of each dataset, as explained in the previous section. Specifically, for First Impression v2 and KETI, since the labels are normalized continuous values within the [0, 1] range, we adopted the 1-MAE metric. This choice is motivated by two factors: first, unlike MSE which squares and diminishes small errors in this narrow range, MAE provides a linear and direct representation of error magnitude; second, the 1-MAE transformation offers an intuitive accuracy score aligned with standard benchmarks in personality computing.

For UDIVA v0.5, with labels represented as continuous float values in the [−3, 3] range, we applied the MSE (Mean Squared Error) metric. We chose MSE over MAE for this dataset because of its quadratic penalty property: it disproportionately penalizes larger errors compared to smaller ones. Given the wider value range and the polarized nature of the labels, employing MSE ensures that significant deviations, such as predicting opposite traits, are heavily weighted, thereby encouraging the model to minimize extreme prediction failures.

In addition to personality recognition, we also evaluated emotion recognition performance to further strengthen the validity of our experimental findings. Since the emotion labels are categorical, classification accuracy was used as the evaluation metric, allowing direct comparison of prediction correctness across classes. To ensure the stability and reliability of the results, all major experiments were repeated three times under the same experimental conditions, and the final performance is reported as the mean ± standard deviation over the three runs.

Equation (6) represents the 1-MAE formula, which is defined as the complement of the Mean Absolute Error computed between the predicted values

{\hat{y}}_{i}

and the ground-truth values

y_{i}

, where N denotes the total number of samples. Higher values, closer to 1, indicate better performance.

1 - M A E = 1 - \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} |

(6)

Equation (7) represents the MSE formula, which computes the mean of the squared differences between the predicted values

{\hat{y}}_{i}

and the ground-truth values

y_{i}

, where N denotes the total number of samples. Lower values indicate better prediction performance.

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}

(7)

As shown in Table 9, Table 10, Table 11 and Table 12, the four benchmarks generally exhibited a positive trend in performance as the proportion of style-abstracted data in the training set increased.

On the First Impression v2 dataset, the TimeSformer model showed a steady improvement, with its

1 - M A E

score rising from 0.9087 (G1) to 0.9138 (G4), representing a gain of 0.0051. Both ViViT and VST also reached their highest scores at G4 (100% Style-Abs.), suggesting that reducing stylistic variance through style abstraction may assist the models in capturing more generalized personality-related cues.

For the UDIVA v0.5 dataset, a reduction in MSE was observed across all architectures. Specifically, the ViViT model achieved its lowest MSE at G4, with a decrease of 0.0758 (from 1.1708 to 1.0950). While VST and TimeSformer reached their peak performance at G3 (66% Style-Abs.), the overall reduction in prediction gaps suggests that style-abstracted data helps the model avoid being distracted by unnecessary visual details, such as background or lighting.

On the KETI dataset, the performance gains were modest but consistent, with all models reaching their peak

1 - M A E

scores at G4 (100% Style-Abs.). The TimeSformer model improved by 0.0023, while ViViT and VST each saw a marginal gain of 0.0015. Although these improvements are incremental, their consistency across different architectures highlights the potential of the proposed style-abstraction-based augmentation method.

Finally, in the Emotion dataset, the TimeSformer model demonstrated an accuracy increase of 3.44% (from 54.67% to 58.11%) in the G4 setting. Similarly, VST improved by 2.62% at G4, and ViViT showed its best performance at G3 (66% Style-Abs.) with a 1.83% increase. These results suggest that our style-abstraction-based data augmentation may encourage models to prioritize expressive motion dynamics over static, low-level visual features, thereby contributing to more stable performance in affective computing tasks.

To verify that the observed performance improvements are not simply caused by an increase in the number of training samples, but rather by the effect of our style-abstraction-based data augmentation, we conducted experiments using three backbone models (ViViT, VST and TimeSformer) across four datasets. In addition to three personality recognition datasets (First Impression v2, UDIVA v0.5, and KETI), we further evaluated the approach on a four-class categorical emotion dataset to assess its generalizability. For each dataset, three training settings were considered: 100% original videos, 100% style-abstracted videos and a balanced mixture with 50% original and 50% style-abstracted videos. Importantly, to eliminate the effect of increased training data, the total number of training samples was kept identical across all configurations for each dataset, only varying the proportion of original and style-abstracted videos.

For the UDIVA v0.5 (Table 13) and KETI (Table 14) datasets, the training and validation sets were split by participant identity, ensuring that no individual appears in both sets. This identity-disjoint protocol prevents identity leakage and encourages the models to learn more generalizable visual cues for personality recognition. Under this setting, the balanced mixture strategy (50% original + 50% style-abstracted videos) consistently achieves the best performance, suggesting that combining style-abstracted and original videos increases data diversity while preserving essential visual information.

In contrast, the results on the First Impression v2 dataset (Table 15) differ from our initial expectation. This dataset provides predefined training and validation splits, which we follow to ensure comparability with prior work. Under this protocol, training with 100% original videos yields the best performance. A possible explanation is that identity-related appearance cues may overlap between the splits, allowing the model to benefit from learning person-specific visual characteristics.

To further examine the generalizability of the proposed approach, we evaluated it on a four-class categorical emotion dataset. As shown in Table 16, the results follow a trend similar to the identity-disjoint personality datasets. All three models achieved their best performance with the balanced mixture configuration. Notably, the ViViT model showed the most distinct improvement, with accuracy increasing from 64.47% to 68.25% (a 3.78% gain). This suggests that providing a mixture of styles may help models focus on expressive facial motion and intensity dynamics, which are essential for emotion recognition.

Overall, these results indicate that the effectiveness of our style-abstraction-based data augmentation can vary depending on the dataset characteristics and split strategy. When identity-related visual shortcuts are limited, incorporating a balanced mixture of style-abstracted data can assist models in learning more stable representations for personality and emotion recognition.

4.7. Qualitative Spatial Attention Visualization

To investigate how the proportion of style-abstracted data in the training set affects the model’s spatial attention, we conduct experiments using the VST model on the KETI and UDIVA v0.5 datasets. Figure 5 compares the attention maps generated under three training settings: (1) a model trained exclusively on original videos (left column), (2) a model trained on a balanced mixture (50% original and 50% style-abstracted videos) (middle column), and (3) a model trained exclusively on style-abstracted videos (right column).

In the visualization, the numerical values in each patch represent normalized attention scores. The results show that the model selectively attends to key expressive regions across both benchmarks. As the proportion of style-abstracted data increases, a noticeable qualitative shift in spatial attention emerges. Specifically, the attention distribution becomes increasingly concentrated on the subject’s primary expressive regions, such as the face and hands.

By reducing background textures and other visually distracting patterns, the proposed style-based augmentation may encourage the model to rely more on structural and behavioral cues rather than superficial appearance details. Models trained with a balanced mixture of original and style-abstracted data tend to focus more consistently on subject-related regions where social and emotional signals are expressed. This shift in attention from background information to person-centered cues, such as facial expressions and hand gestures, may help the model to learn more stable behavioral representations and improve generalization across different environments.

4.8. Performance Comparison of Data Augmentation Methods

In this section, we evaluate the effectiveness of the proposed style-abstraction-based data augmentation by comparing it with a conventional geometric rotation baseline. For a fair comparison, both settings utilized a balanced mixture (50% original and 50% style-abstracted frames). Specifically, 50% of the training samples remained unchanged, while the remaining 50% were either randomly rotated within the range of

[- 10^{\circ}, 10^{\circ}]

(Original + Rotation) or transformed using the proposed style-abstraction-based data augmentation pipeline (Original + Style-Abs.).

The comparative results across multiple video architectures are summarized in Table 17. Overall, the proposed style-abstraction-based data augmentation balanced mixture demonstrates better results compared with the rotation-based baseline across the evaluated benchmarks.

Performance Improvement on UDIVA v0.5: The most noticeable improvement is observed on the UDIVA v0.5 dataset. For all evaluated models, the proposed style-abstraction-based data augmentation achieves lower prediction errors than the rotation baseline. In particular, the ViViT model shows the largest improvement, reducing the error from 1.1716 to 1.0851.
Results on KETI and First Impression v2: On these datasets, the proposed approach achieves performance that is largely comparable to the rotation baseline. Although the numerical differences are smaller than those observed for UDIVA v0.5, the results remain consistent across different backbone models. This suggests that visual abstraction can serve as a viable augmentation strategy while maintaining stable predictive performance.
Observed Improvements in Emotion Recognition: The proposed balanced mixture also shows a positive impact on the Emotion dataset. In particular, when using the VST model, our style-abstraction-based data augmentation results in an accuracy of 62.09% compared to 58.09% with the rotation baseline, a gain of 4.00%. Similarly, TimeSformer shows an increase of 3.41%. This suggests that by simplifying unnecessary visual details, the model may be better able to focus on essential affective cues such as facial motion and posture, which are important for emotion recognition.

Overall, while rotation-based augmentation introduces spatial variation, the proposed style-abstraction-based augmentation provides an alternative form of variability by abstracting low-level appearance patterns. This abstraction may encourage the model to focus more strongly on behavior-relevant cues such as facial expressions and hand gestures, while reducing sensitivity to background textures and other appearance-related artifacts.

4.9. Impact of Style Abstraction Intensity on Model Performance

In this section, we investigate how different levels of stylistic abstraction, generated by our style abstraction pipeline, influence model performance on the First Impression v2 dataset. For this analysis, we constructed an augmented training set consisting of 3000 original samples and 1000 style-abstracted samples, where the intensity of the transformation was systematically varied. Figure 6 illustrates the visual transitions generated by our pipeline, ranging from the original frame to three distinct levels of intensity: low, medium, and high.

Notably, the medium style abstraction intensity used in this evaluation corresponds to the default parameter configuration adopted in our previous experiments. As shown in Table 18, this medium level consistently yielded the superior performance across all evaluated architectures, including ViViT, VST, and TimeSformer.

While the low style abstraction intensity retained redundant pixel-level noise from the original frames and the high intensity led to the loss of subtle facial cues due to excessive abstraction, the medium intensity provided the most effective balance between visual abstraction and the preservation of affect-relevant facial dynamics. These results validate our selection of the medium style abstraction configuration as the optimal setting for enhancing affective representation in the First Impression v2 dataset.

5. Discussion, Limitations, and Future Directions

While the proposed style-abstraction-based data augmentation demonstrates promising performance improvements, several limitations remain to be addressed. First, the current style-abstraction-based data augmentation process relies on a sequence of deterministic operations, including Bilateral Filtering, Edge Detection to create the edge map E, Color Space Conversion to HSV, Adaptive Color Quantization, and Outline Drawing. At present, this pipeline is treated as a black box regarding its specific contribution to model regularization. Due to the computational complexity and the defined scope of this study, a granular ablation study on these individual components was not feasible. Future work should involve a comparative analysis to identify which specific operations or combinations of operations provide the most significant regularization benefits, which will be key to optimizing the augmentation pipeline for diverse vision tasks.

Another key limitation is the restricted style diversity of our current style-abstraction-based data augmentation. Future studies should explore diffusion-based generative stylization and controllable multi-style augmentation to improve expressive variation and robustness. Such generative approaches could allow for precise control over stylistic parameters, such as stroke intensity and color palette, further enhancing the model’s generalization capabilities across different domains.

Furthermore, we acknowledge that the style-abstraction-based data augmentation process can obscure nuanced physical cues, such as subtle skin textures and tones. This loss of detail may potentially distort facial or bodily cues unevenly across different demographic groups. Due to the lack of fine-grained metadata in the current dataset, a comprehensive subgroup failure analysis was not feasible in this study.

Ensuring that style-abstraction-based augmentation does not introduce or exacerbate unintended biases is a priority for future iterations. Finally, extending this framework to multimodal fusion by integrating speech and text modalities remains a critical direction for achieving fair and robust affective intelligence in real-world AGI systems.

6. Conclusions

Artificial General Intelligence (AGI) is widely regarded as a long-term goal in the development of artificial intelligence. However, one of the major challenges is the scarcity of large and diverse datasets, which constrains the training of effective models and limits progress toward general intelligence. In this study, we investigated the effect of style-abstraction-based data augmentation on video-based affective behavior recognition, including both personality recognition and emotion recognition tasks. Experiments using three backbone models (ViViT, VST, and TimeSformer) across four datasets (First Impression v2, UDIVA v0.5, KETI, and an emotion dataset) showed consistent performance improvements as the proportion of style-abstracted samples increased.

These findings suggest that style-abstraction-based data augmentation may help reduce the model’s reliance on superficial appearance cues while encouraging greater attention to behavior-relevant information. Importantly, the performance gains observed not only in personality recognition but also in emotion recognition indicate that style-abstracted image contributes to general improvements in affective computing tasks. This suggests that the proposed technique represents a practical solution to the data scarcity problem, enabling improved model generalization without requiring the creation of fully synthetic datasets.

Author Contributions

Conceptualization, X.Q. and T.K.; methodology, X.Q.; software, X.Q. and T.K.; validation, X.Q. and T.K.; formal analysis, X.Q. and T.K.; investigation, X.Q. and T.K.; resources, B.K.; data curation, X.Q. and T.K.; writing—original draft preparation, X.Q.; writing—review and editing, X.Q., T.K. and B.K.; visualization, X.Q.; supervision, B.K.; project administration, B.K.; funding acquisition, B.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (IITP-2026-RS-2020-II201462).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study include both publicly available benchmarks and institutionally restricted datasets. The First Impression v2 and UDIVA v0.5 datasets are publicly accessible and can be obtained from their respective official project websites. Specifically, the First Impression v2 dataset is available at https://chalearnlap.cvc.uab.es/dataset/24/description/ (accessed on 19 March 2026), and the UDIVA v0.5 dataset can be accessed at https://chalearnlap.cvc.uab.cat/dataset/41/description/ (accessed on 19 March 2026). In contrast, the KETI and Emotion datasets are not publicly released due to institutional data governance policies and privacy considerations.

Acknowledgments

This research was supported by the Korea Electronics Technology Institute (KETI), which provided the KETI dataset used in this study. The authors gratefully acknowledge their contribution.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fei, N.; Lu, Z.; Gao, Y.; Yang, G.; Huo, Y.; Wen, J.; Lu, H.; Song, R.; Gao, X.; Xiang, T.; et al. Towards artificial general intelligence via a multimodal foundation model. Nat. Commun. 2022, 13, 3094. [Google Scholar] [CrossRef] [PubMed]
Moro, C.; Birt, J.; Stromberga, Z.; Phelps, C.; Clark, J.; Glasziou, P.; Scott, A.M. Virtual and augmented reality enhancements to medical and science student physiology and anatomy test performance: A systematic review and meta-analysis. Anat. Sci. Educ. 2021, 14, 368–376. [Google Scholar] [CrossRef] [PubMed]
Checa, D.; Miguel-Alonso, I.; Bustillo, A. Immersive virtual-reality computer-assembly serious game to enhance autonomous learning. Virtual Real. 2023, 27, 3301–3318. [Google Scholar] [CrossRef] [PubMed]
Xu, Q.; Han, G.; Kim, B. Performance Analysis for Accuracy of Personality Recognition Models based on Setting of Margin Values at Face Region Extraction. J. Inst. Internet Broadcast. Commun. 2024, 24, 141–147. [Google Scholar]
Ponce-López, V.; Chen, B.; Oliu, M.; Corneanu, C.; Clapés, A.; Guyon, I.; Baró, X.; Escalante, H.J.; Escalera, S. Chalearn lap 2016: First round challenge on first impressions-dataset and results. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 400–418. [Google Scholar]
Palmero, C.; Selva, J.; Smeureanu, S.; Junior, J.; CS, J.; Clapés, A.; Moseguí, A.; Zhang, Z.; Gallardo, D.; Guilera, G.; et al. Context-aware personality inference in dyadic scenarios: Introducing the udiva dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 1–12. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 6836–6846. [Google Scholar]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2022; pp. 3192–3201. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the 38th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; Volume 139, pp. 1–12. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Trabucco, B.; Doherty, K.; Gurinas, M.; Salakhutdinov, R. Effective data augmentation with diffusion models. arXiv 2023, arXiv:2302.07944. [Google Scholar] [CrossRef]
Borkman, S.; Crespi, A.; Dhakad, S.; Ganguly, S.; Hogins, J.; Jhang, Y.C.; Kamalzadeh, M.; Li, B.; Leal, S.; Parisi, P.; et al. Unity perception: Generate synthetic data for computer vision. arXiv 2021, arXiv:2107.04259. [Google Scholar] [CrossRef]
Zhou, K.; Yang, Y.; Qiao, Y.; Xiang, T. Domain generalization with mixstyle. arXiv 2021, arXiv:2104.02008. [Google Scholar] [CrossRef]
Jackson, P.T.; Abarghouei, A.A.; Bonner, S.; Breckon, T.P.; Obara, B. Style augmentation: Data augmentation via style randomization. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Piscataway, NJ, USA, 2019; Volume 6, pp. 10–11. [Google Scholar]
Geirhos, R.; Jacobsen, J.H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
Wang, J.; Lan, C.; Liu, C.; Ouyang, Y.; Qin, T.; Lu, W.; Chen, Y.; Zeng, W.; Yu, P.S. Generalizing to unseen domains: A survey on domain generalization. IEEE Trans. Knowl. Data Eng. 2022, 35, 8052–8072. [Google Scholar] [CrossRef]
Wang, S.; Veldhuis, R.; Brune, C.; Strisciuglio, N. What do neural networks learn in image classification? A frequency shortcut perspective. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2023; pp. 1433–1442. [Google Scholar]
Aljuhani, N.M.; Al-Ghamdi, A.A.M.; Alghamdi, H.S.; Saleem, F. Convolutional Bi-LSTM for Automatic Personality Recognition from Social Media Texts. IEEE Access 2025, 13, 65582–65603. [Google Scholar] [CrossRef]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018; pp. 6450–6459. [Google Scholar]
Sun, M.; Zhang, K. Multimodal Co-attention Transformer for Video-Based Personality Understanding. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData); IEEE: Piscataway, NJ, USA, 2023; pp. 1450–1459. [Google Scholar]
Agrawal, T.; Balazia, M.; Müller, P.; Brémond, F. Multimodal vision transformers with forced attention for behavior analysis. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2023; pp. 3381–3391. [Google Scholar]
Ryumina, E.; Markitantov, M.; Ryumin, D.; Karpov, A. Gated Siamese Fusion Network based on multimodal deep and hand-crafted features for personality traits assessment. Pattern Recognition Letters. 2024, 185, 45–51. [Google Scholar] [CrossRef]
Bhin, H.; Choi, J. Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features. Electronics 2025, 14, 2837. [Google Scholar] [CrossRef]
Chen, G.; Liao, Y.; Zhang, D.; Yang, W.; Mai, Z.; Xu, C. Multimodal Emotion Recognition via the Fusion of Mamba and Liquid Neural Networks with Cross-Modal Alignment. Electronics 2025, 14, 3638. [Google Scholar] [CrossRef]
Fang, Y.; Huang, W.; Wan, G.; Su, K.; Ye, M. Emoe: Modality-specific enhanced dynamic emotion experts. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2025; pp. 14314–14324. [Google Scholar]
Ventura, C.; Masip, D.; Lapedriza, A. Interpreting cnn models for apparent personality trait regression. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Piscataway, NJ, USA, 2017; pp. 1705–1713. [Google Scholar]
Eddine Bekhouche, S.; Dornaika, F.; Ouafi, A.; Taleb-Ahmed, A. Personality traits and job candidate screening via analyzing facial videos. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Piscataway, NJ, USA, 2017; pp. 1660–1663. [Google Scholar]
Kaya, H.; Gurpinar, F.; Ali Salah, A. Multi-modal score fusion and decision trees for explainable automatic job candidate screening from video cvs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Piscataway, NJ, USA, 2017; pp. 1651–1659. [Google Scholar]
Agrawal, T.; Guermal, M.; Balazia, M.; Bremond, F. CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2025; pp. 7379–7388. [Google Scholar]
Kassab, K.; Kashevnik, A.; Mayatin, A.; Zubok, D. Vptd: Human face video dataset for personality traits detection. Data 2023, 8, 113. [Google Scholar] [CrossRef]
Higashi, T.; Ishibashi, R.; Meng, L. ViViT fall detection and action recognition. In Proceedings of the 2024 International Conference on Advanced Mechatronic Systems (ICAMechS); IEEE: Piscataway, NJ, USA, 2024; pp. 291–296. [Google Scholar]
Hong, J.; Najafizadeh, L. TopoEEG: A Timesformer-Based Topographic Image Representation Method for Single-Trial Early Detection of P300. In Proceedings of the 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI); IEEE: Piscataway, NJ, USA, 2025; pp. 1–4. [Google Scholar]

Figure 1. Visual examples of the proposed Style-Abstraction-based Data Augmentation applied to the First Impression v2 dataset (ChaLearn LAP Challenge), a publicly available dataset for academic research.

Figure 2. Video Vision Transformer (ViViT) Architecture.

Figure 3. TimeSformer Architecture.

Figure 4. Video Swin Transformer (VST) Architecture.

Figure 5. Qualitative visualization of spatial attention maps across KETI and UDIVA v0.5 datasets.

Figure 6. Visual comparison of style abstraction intensities: (a) Original; (b) Low; (c) Medium; (d) High.

Table 1. Representative studies addressing shortcut learning and domain generalization.

Reference	Research Focus	Key Contributions
Jackson et al. (2019) [17]	Model Robustness	Introduces style augmentation based on random style transfer to randomize texture and color.
Geirhos et al. (2020) [18]	Shortcut Learning	Identifies model reliance on texture and background cues, leading to poor generalization.
Zhou et al. (2021) [16]	Domain Generalization	Proposes MixStyle to enhance domain invariance by mixing feature statistics.
Wang et al. (2022) [19]	Generalization Survey	Categorizes algorithms into data manipulation, representation learning, and learning strategies.
Wang et al. (2023) [20]	Structural Bias	Reveals architecture-dependent reliance on frequency-based shortcuts (low vs. high frequency).

Table 2. Summary of the datasets used in this study.

Dataset	Participants	Videos	Language	Duration	Annotation
First Impression v2	>3000	10,000	English	15 s (fixed)	Big-5 (AMT)
UDIVA v0.5	134	290	Spanish, Catalan, English	2–20 min	Big-5 (BFI-2)
KETI Dataset	516	2575	Korean	13 min (avg.)	Big-5 (IPIP-NEO)
Emotion Dataset	100	5314 *	Korean	2–6 s	4-class categorical labels

* Subset of the full 10,351-sample dataset, filtered to four emotion classes.

Table 3. Composition of the training and validation sets for the First Impression v2 dataset.

	Training Set		Validation Set
Group	Original Samples	Style-Abstracted Samples	Original Samples
G1 (0% Style-Abs.)	3000	0	2000
G2 (33% Style-Abs.)	3000	1000	2000
G3 (66% Style-Abs.)	3000	2000	2000
G4 (100% Style-Abs.)	3000	3000	2000
Control A (100% Orig.)	6000	0	2000
Control B (100% Style-Abs.)	0	6000	2000

Table 4. Composition of the training and validation sets for the UDIVA v0.5 dataset.

	Training Set		Validation Set
Group	Original Participants	Style-Abstracted Participants	Original Participants
G1 (0% Style-Abs.)	49	0	20
G2 (33% Style-Abs.)	49	17	20
G3 (66% Style-Abs.)	49	34	20
G4 (100% Style-Abs.)	49	50	20
Control A (100% Orig.)	99	0	20
Control B (100% Style-Abs.)	0	99	20

Table 5. Composition of the training and validation sets for the KETI dataset.

	Training Set		Validation Set
Group	Original Participants	Style-Abstracted Participants	Original Participants
G1 (0% Style-Abs.)	155	0	100
G2 (33% Style-Abs.)	155	52	100
G3 (66% Style-Abs.)	155	104	100
G4 (100% Style-Abs.)	155	155	100
Control A (100% Orig.)	310	0	100
Control B (100% Style-Abs.)	0	310	100

Table 6. Composition of the training and validation sets for the emotion dataset.

	Training Set		Validation Set
Group	Original Samples	Style-Abstracted Samples	Original Samples
G1 (0% Style-Abs.)	2047	0	1348
G2 (33% Style-Abs.)	2047	633	1348
G3 (66% Style-Abs.)	2047	1267	1348
G4 (100% Style-Abs.)	2047	1919	1348
Control A (100% Orig.)	3966	0	1348
Control B (100% Style-Abs.)	0	3966	1348
Control C (Balanced Mixture)	1983	1983	1348

Table 7. Hardware and software configuration of the experimental environment.

Component	Specification
CPU	Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz
GPU	NVIDIA GeForce RTX 4090 (4×)
RAM	192 GB
OS	Ubuntu 20.04.6 LTS
Python	3.10.13
PyTorch	2.2.0
Torchvision	0.17.0
CUDA	12.2

Table 8. Experimental Hyperparameters. All hyperparameters are commonly applied to ViViT, VST, and TimeSformer.

Hyperparameter	Configuration
Optimizer	AdamW
Epochs	120
Learning Rate	$3 \times 10^{- 5}$
Batch Size	4
Frame Number	15
Loss Function
UDIVA v0.5	MSELoss
First Impression V2	L1Loss
KETI	L1Loss
Emotion Dataset	CrossEntropy

Table 9. Comparison of model performance (1-MAE↑) across different proportions of style-abstracted data on the First Impression v2 dataset.

Model	G1 (0% Style-Abs.)	G2 (33% Style-Abs.)	G3 (66% Style-Abs.)	G4 (100% Style-Abs.)
ViViT	0.9121 ± 0.0001	0.9128 ± 0.0006	0.9139 ± 0.0003	0.9142 ± 0.0003
VST	0.9089 ± 0.0005	0.9099 ± 0.0010	0.9112 ± 0.0001	0.9121 ± 0.0002
TimeSformer	0.9087 ± 0.0021	0.9119 ± 0.0005	0.9129 ± 0.0006	0.9138 ± 0.0002

↑ indicates that higher values correspond to better performance. Bold values denote the best performance.

Table 10. Comparison of model performance (MSE↓) across different proportions of style-abstracted data on the UDIVA v0.5 dataset.

Model	G1 (0% Style-Abs.)	G2 (33% Style-Abs.)	G3 (66% Style-Abs.)	G4 (100% Style-Abs.)
ViViT	1.1708 ± 0.0072	1.1432 ± 0.0112	1.1147 ± 0.0082	1.0950 ± 0.0165
VST	1.2214 ± 0.0070	1.2107 ± 0.0178	1.1321 ± 0.0044	1.1368 ± 0.0085
TimeSformer	1.2914 ± 0.0232	1.2235 ± 0.0046	1.2210 ± 0.0013	1.2250 ± 0.0090

↓ indicates that lower values correspond to better performance. Bold values denote the best performance.

Table 11. Comparison of model performance (

1 - M A E

↑) across different proportions of style-abstracted data on the KETI dataset.

Table 11. Comparison of model performance (

1 - M A E

↑) across different proportions of style-abstracted data on the KETI dataset.

Model	G1 (0% Style-Abs.)	G2 (33% Style-Abs.)	G3 (66% Style-Abs.)	G4 (100% Style-Abs.)
ViViT	0.9126 ± 0.0004	0.9121 ± 0.0002	0.9130 ± 0.0004	0.9141 ± 0.0002
VST	0.9080 ± 0.0005	0.9079 ± 0.0007	0.9089 ± 0.0007	0.9095 ± 0.0007
TimeSformer	0.9082 ± 0.0007	0.9087 ± 0.0004	0.9098 ± 0.0004	0.9105 ± 0.0011

↑ indicates that higher values correspond to better performance. Bold values denote the best performance.

Table 12. Comparison of model performance (Accuracy↑) across different proportions of style-abstracted data on the Emotion dataset.

Model	G1 (0% Style-Abs.)	G2 (33% Style-Abs.)	G3 (66% Style-Abs.)	G4 (100% Style-Abs.)
ViViT	62.14 ± 1.08	63.00 ± 0.65	63.97 ± 1.44	63.92 ± 1.45
VST	53.96 ± 1.09	55.29 ± 1.04	54.67 ± 0.93	56.58 ± 0.76
TimeSformer	54.67 ± 2.21	57.15 ± 0.88	57.84 ± 1.9	58.11 ± 1.55

↑ indicates that higher values correspond to better performance. Bold values denote the best performance.

Table 13. Performance comparison on the UDIVA v0.5 dataset (MSE ↓) under different training data compositions.

Model	Orig. (100%)	Balanced Mixture (50%:50%)	Style-Abs. (100%)
VST	1.4058	1.1255	1.4249
ViViT	1.2882	1.0851	1.3578
TimeSformer	1.4218	1.2193	1.3910

↓ indicates that lower values correspond to better performance. Bold values denote the best performance.

Table 14. Performance comparison on the KETI dataset (

1 - M A E

↑) under different training data compositions.

Table 14. Performance comparison on the KETI dataset (

1 - M A E

↑) under different training data compositions.

Model	Orig. (100%)	Balanced Mixture (50%:50%)	Style-Abs. (100%)
VST	0.9100	0.9105	0.9083
ViViT	0.9127	0.9142	0.9146
TimeSformer	0.9109	0.9114	0.9118

↑ indicates that higher values correspond to better performance. Bold values denote the best performance.

Table 15. Performance comparison on the First Impression v2 dataset (

1 - M A E

↑) under different training data compositions.

Table 15. Performance comparison on the First Impression v2 dataset (

1 - M A E

↑) under different training data compositions.

Model	Orig. (100%)	Balanced Mixture (50%:50%)	Style-Abs. (100%)
VST	0.9138	0.9124	0.9078
ViViT	0.9166	0.9147	0.9085
TimeSformer	0.9157	0.9141	0.9050

↑ indicates that higher values correspond to better performance. Bold values denote the best performance.

Table 16. Performance comparison on the Emotion dataset (Accuracy ↑) under different training data compositions.

Model	Orig. (100%)	Balanced Mixture (50%:50%)	Style-Abs. (100%)
VST	60.83	62.09	40.36
ViViT	64.47	68.25	44.51
TimeSformer	62.02	63.13	40.28

↑ indicates that higher values correspond to better performance. Bold values denote the best performance.

Table 17. Comparison of model performance between rotation and style-abstraction-based data augmentation across four datasets.

Dataset	Model	Original + Rotation	Original + Style-Abs. (Ours)	Metric
UDIVA v0.5	VST	1.1951	1.1255	MSE
	ViViT	1.1716	1.0851
	TimeSformer	1.3093	1.2193
KETI	VST	0.9079	0.9105	$1 - M A E$
	ViViT	0.9140	0.9142
	TimeSformer	0.9080	0.9114
First Impression v2	VST	0.9116	0.9124	$1 - M A E$
	ViViT	0.9142	0.9147
	TimeSformer	0.9138	0.9141
Emotion	VST	58.09	62.09	Accuracy
	ViViT	67.80	68.25
	TimeSformer	59.72	63.13

Bold values denote the best performance.

Table 18. Comparison of model performance (

1 - M A E

) across style abstraction intensities on the First Impression v2 dataset.

Table 18. Comparison of model performance (

1 - M A E

) across style abstraction intensities on the First Impression v2 dataset.

Model	Intensity	$1 - M A E$
ViViT	Low	0.9131
	Medium *	0.9137
	High	0.9126
TimeSformer	Low	0.9117
	Medium *	0.9122
	High	0.9120
VST	Low	0.9101
	Medium *	0.9109
	High	0.9095

* All other experiments were conducted using the default intensity setting used in the previous sections.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qiu, X.; Kim, T.; Kim, B. Style-Abstraction-Based Data Augmentation for Robust Affective Computing. Appl. Sci. 2026, 16, 3109. https://doi.org/10.3390/app16063109

AMA Style

Qiu X, Kim T, Kim B. Style-Abstraction-Based Data Augmentation for Robust Affective Computing. Applied Sciences. 2026; 16(6):3109. https://doi.org/10.3390/app16063109

Chicago/Turabian Style

Qiu, Xu, Taewan Kim, and Bongjae Kim. 2026. "Style-Abstraction-Based Data Augmentation for Robust Affective Computing" Applied Sciences 16, no. 6: 3109. https://doi.org/10.3390/app16063109

APA Style

Qiu, X., Kim, T., & Kim, B. (2026). Style-Abstraction-Based Data Augmentation for Robust Affective Computing. Applied Sciences, 16(6), 3109. https://doi.org/10.3390/app16063109

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Style-Abstraction-Based Data Augmentation for Robust Affective Computing

Abstract

1. Introduction

2. Related Works

2.1. Generative and Simulation-Based Data Augmentation

2.2. Shortcut Learning and Domain Generalization Challenges

2.3. Deep Learning for Affective Behavior Recognition

2.4. Recent Advances in Multimodal Personality and Emotion Recognition

3. Proposed Augmentation Technique

4. Experiments

4.1. Datasets

4.1.1. First Impression v2 Dataset

4.1.2. UDIVA v0.5 Dataset

4.1.3. KETI Dataset

4.1.4. Emotion Dataset

4.1.5. Identity Overlap Analysis in First Impression v2

4.2. Preprocessing

4.2.1. First Impression v2 Preprocessing

4.2.2. UDIVA v0.5 Preprocessing

4.2.3. KETI Dataset Preprocessing

4.2.4. Emotion Dataset Preprocessing

4.3. Models Used in Experiments

4.3.1. Video Vision Transformer (ViViT)

4.3.2. TimeSformer

4.3.3. Video Swin Transformer (VST)

4.4. Model Training

4.5. Experimental Environment

4.6. Experiment Results

4.7. Qualitative Spatial Attention Visualization

4.8. Performance Comparison of Data Augmentation Methods

4.9. Impact of Style Abstraction Intensity on Model Performance

5. Discussion, Limitations, and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI