Generating Synthetic Facial Expression Images Using EmoStyle

Darne, Clément Gérard Daniel; Quan, Changqin; Luo, Zhiwei

doi:10.3390/app151910636

Open AccessArticle

Generating Synthetic Facial Expression Images Using EmoStyle

by

Clément Gérard Daniel Darne

^1,2,*

,

Changqin Quan

^1,* and

Zhiwei Luo

¹

Graduate School of Systems Informatics, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan

²

Grenoble Institute of Technology–Ensimag, Grenoble Alpes University, 681 Rue de la Passerelle–BP 72, 38402 Saint Martin d’Hères CEDEX, France

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10636; https://doi.org/10.3390/app151910636

Submission received: 21 July 2025 / Revised: 21 September 2025 / Accepted: 26 September 2025 / Published: 1 October 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Featured Application

An open-source framework for generating synthetic data of facial expression images using the EmoStyle generative model.

Abstract

Synthetic data has emerged as a significant alternative to more costly and time-consuming data collection methods. This assertion is particularly salient in the context of training facial expression recognition (FER) and generation models. The EmoStyle model represents a state-of-the-art method for editing images of facial expressions in the latent space of StyleGAN2, using a continuous valence–arousal (VA) representation of emotions. While the model has demonstrated promising results in terms of high-quality image generation and strong identity preservation, its accuracy in reproducing facial expressions across the VA space remains to be systematically examined. To address this gap, the present study proposes a systematic evaluation of EmoStyle’s ability to generate facial expressions across the full VA space, including four levels of emotional intensity. While prior work on expression manipulation has mainly focused its evaluations on perceptual quality, diversity, identity preservation, or classification accuracy, to the best of our knowledge, no study to date has systematically evaluated the accuracy of generated expressions across the VA space. The evaluation’s findings include a consistent weakness in the VA direction range of 242–329°, where EmoStyle demonstrates the inability to produce distinct expressions. Building on these findings, we outline recommendations for enhancing the generation pipeline and release an open-source EmoStyle-based toolkit that integrates fixes to the original EmoStyle repository, an API wrapper, and our experiment scripts. Collectively, these contributions furnish both novel insights into the model’s capacities and practical resources for further research.

Keywords:

facial expression generation; synthetic data; valence–arousal; StyleGAN2; EmoStyle; facial expression accuracy

1. Introduction

Facial expression recognition (FER) and generation are active topics in machine learning (ML) research [1]. ML models have grown increasingly complex, with larger parameter counts over the years in order to deal with more challenging problems [2]. In ML training, larger networks not only increase the number of computations required but also exponentially increase the volume of data needed to achieve good results—an issue known as the curse of dimensionality [3]. Typically, as the number of features or dimensions grows, the amount of information represented by each training sample compared to the information required to cover the entire feature space becomes exponentially smaller. This leads to the necessity of collecting enormous amounts of training data in order to ensure accurate generalization [4]. In the field of facial expressions, multiple datasets have been created in the past [5,6,7]. They often consist of labeled in-the-wild images of people or of images of professional actors specifically recruited to perform a set of different facial expressions. However, both approaches are highly time-consuming [8,9]. Therefore, it is difficult to quickly produce a large volume of data. Furthermore, the use of real-world data raises privacy and confidentiality issues, especially regarding face images [10,11]. In the past, several cases of widely used datasets, such as MegaFace [12], MS-Celeb-1M [13], and VGGFace2 [14], have been withdrawn due to ethical or privacy reasons and are no longer publicly available [10,11,15]. These concerns should therefore be carefully considered when dealing with facial image data.

Synthetic data—artificially generated via algorithms, simulations or generative models—offers an alternative to real-world data [16,17]. A major benefit of synthetic data is that it can be generated in large quantities, which is particularly useful for training ML algorithms when real-world data is scarce or difficult to produce [16]. Synthetic data can then completely replace the need for real-world data or be used to augment existing datasets [18,19]. Another benefit is the greater control over data distribution, which allows, for instance, data to be evenly distributed across different classes or tailored to a specific task [17]. Synthetic data can also help with the privacy issues mentioned earlier by containing anonymized or de-identified data, which reduces the risk of exposing sensitive personal information [16,17].

However, generating synthetic data comes with its own challenges. The primary challenge is ensuring that the synthetic data is sufficiently realistic to faithfully represent real-world data [17,18]. If the synthetic data does not reflect the complexity and diversity of the real world, then the trained models can struggle and achieve poor performance when applied to real-world data [17]. Another challenge is avoiding the amplification of existing biases or the introduction of new ones, such as an inequitable representation of demographic groups [17,20]. If these challenges are not addressed, the generated synthetic data could produce models that exhibit behaviors misaligned with human expectations [17].

A commonly used model today for generating realistic human face images is StyleGAN2 [21], a generative adversarial network (GAN) [22] capable of high-quality results. The benefit of GANs for generating images is the ability to produce slight variations of the same image through small manipulations of its latent code [23]. These variations can affect features such as age, gender, hair color or even facial expression [24]. The recent state-of-the-art EmoStyle model [25] uses StyleGAN2’s latent space to edit facial expressions. EmoStyle is based on a continuous representation of emotions that uses two variables called valence and arousal (VA). Valence measures how positive or negative an emotion is, while arousal indicates the emotion’s level of activity or passivity. This continuous, two-dimensional representation permits a wider range of emotions than discrete representations can. Figure 1 illustrates the circumplex model of affect [26], which maps 28 different emotions based on their valence and arousal values.

Most prior studies on facial expression generation have focused on evaluating the perceptual quality, diversity, identity preservation, or classification accuracy of the generated images [25,27,28]. However, the extent to which generated facial expressions reproduce the targeted continuous VA coordinates across the VA space has not been systematically evaluated. This assessment is of crucial importance since downstream applications—such as training FER models—require synthetic data to be accurately labeled with the intended facial expressions.

This article makes the following contributions:

We hereby propose a systematic evaluation protocol of EmoStyle’s facial expression accuracy across 28 VA directions and 4 intensity levels, with the objective of quantifying the alignment between the intended and perceived emotions.
Our evaluation identifies multiple weaknesses in EmoStyle’s ability to generate distinct and accurate facial expressions, notably a consistent failure region within the 242°–329° VA direction range.
We explore simple image-quality-based filtering methods to improve the quality of the generated images.
We release an open-source EmoStyle-based toolkit that comprises fixes to the original EmoStyle repository, an API wrapper, and experiment scripts (https://github.com/ClementDrn/emostyle-wrapper, accessed on 29 September 2025).
Finally, the study provides practical recommendations for enhancing the image generation pipeline, including expression accuracy correction, artifact removal, sunglasses masking, and temporal consistency improvements.

2. Related Work

2.1. Facial Expression Generation

Different approaches exist for generating numerous facial expressions with AI models, such as facial expression transfer and facial expression editing.

One such approach is emotion transfer from one image to another [29,30]. It uses two input images: one provides the source facial expression and the other one contains the target face. The model then transfers the facial expression from the former image onto the face of the latter image. The benefit of this method is the ability to reproduce any complex emotion from a single image onto different face images. Moreover, there are methods to generate high-quality face images with AI, making it straightforward to provide the target identity image. However, the downside is that one of the source images must contain a facial expression. As a result, labeled images of facial expressions are still necessary to generate new images. Nonetheless, it requires a smaller number of actors or image annotations, since a single source facial expression can be replicated across many faces.

A second approach represents the input emotion as a variable rather than as an image of a facial expression. This is a technique called facial expression editing [23]. On one hand, the variable can hold a discrete and categorical value representing a specific emotion, such as one of the seven widely used: anger, disgust, fear, happiness, sadness, surprise, and the neutral state [31]. On the other hand, a continuous representation of emotions is also possible with one or more continuous emotional parameters, allowing for the representation of a wider range of emotions [25,31,32]. Examples of continuous representations include the circumplex model of affect, which uses VA [26], and a representation based on facial action units (AUs) [33], with each representing a facial muscle’s movement as a continuous value rather than binary. As a result of requiring one fewer image to input, this model only requires a single face image and a set of values to represent the desired emotion. Therefore, if the face image is also generated by AI, image manual labeling or actors are no longer needed, and the data generation process can be fully automated to create synthetic data.

EmoStyle [25] is a state-of-the-art model that falls into the facial expression editing category. It uses VA parameters to continuously represent emotions. EmoStyle is a generative model designed to separate emotions from other facial characteristics in order to edit facial expressions in images. A pre-trained generator from StyleGAN2 serves as the backbone of EmoStyle due to its rich latent space, which typically provides the ability to separate face attributes from other facial features. However, using StyleGAN2’s latent space implies that in order to edit an image, EmoStyle also needs the latent vector in StyleGAN2’s latent space of the image to be edited. In the case where the input image is generated by StyleGAN2, the latent vector can be thus easily retrieved. Otherwise, for a real image, a latent vector can be estimated by using an adapted inversion method provided by EmoStyle’s authors [25].

In contrast to facial expression generation methods that utilize discrete representations of emotions, such as StarGAN [34], EmoStyle uses a continuous representation of emotions, enabling it to generate controlled variations of a given emotion. The generated images are of high quality as well thanks to the use of StyleGAN2’s generator. EmoStyle has also been demonstrated to be effective in identity preservation, meaning that the depicted faces in the input and output images look very similar even though the facial expressions are different. This enables generating multiple visually coherent facial expressions of the same person, which is needed for training identity-invariant FER models [35]. A comparative analysis of EmoStyle’s performance was conducted with other facial expression editing methods [25]. This analysis indicates that EmoStyle outperforms other methods in terms of image quality, similarity between generated and real images, and ID preservation, while maintaining high VA standard deviation. Regarding the VA distribution of the generated images, EmoStyle is as well able to generate images within a wider range of VA vectors compared to the initial StyleGAN2 model [25].

However, EmoStyle’s distribution is not perfect either. As indicated in its original study, the model currently exhibits deficiencies in its ability to generate emotions with arousal lower than

- 0.7

and positive valence [25]. Nonetheless, the accuracy of the facial expressions generated by EmoStyle has not been systematically evaluated. An evaluation of emotional accuracy would provide a quantification to the model’s ability to generate the intended facial expressions.

2.2. Synthetic Data Generation

A number of recent research works have explored synthetic data generation for facial expression-related tasks such as FER. An early investigation utilized 3D synthetic video sequences that were generated based on facial AU [36]. The study’s findings demonstrated that classifiers trained solely on synthetic data could outperform state-of-the-art models that are trained on real datasets [36]. Subsequently, another study advanced this direction by procedurally rendering approximately 100,000 high-fidelity and diverse synthetic faces by combining a procedurally-generated parametric 3D model with other assets (e.g., expression, hair, clothing, etc.) [18]. Their findings indicated that models trained with this synthetic data could match the real data versions’ accuracy in the wild for landmark localization and face parsing tasks, confirming the viability of synthetic FER data without real images [18]. Most recently, SynFER [37], a diffusion-based approach, synthesized expression-rich images using high-level text descriptions and facial AU control. SynFER achieved 67.2% accuracy on AffectNet when trained on synthetic data, which increased to 69.8% when scaling the synthetic dataset up to 5 times the original size, demonstrating consistent and superior performances compared to real-data-only baselines [37].

However, these works represent emotions with AU or text labels, thereby leaving unexplored synthetic data based on VA representation.

3. Method

The employed method involves the procedural generation of facial expression images through the editing of StyleGAN2-generated face images with EmoStyle [25]. The pipeline consists of two primary components: first, the generation of face images, and second, the editing of the images to produce different facial expressions.

3.1. Face Image Generation

Face image generation was performed with a StyleGAN2 model pretrained on the FFHQ dataset [38], which contains 70,000 images at 1024 × 1024 resolution from Flickr. Along with the generated images, the latent vectors of the images are also saved to allow later editing. The images can optionally be filtered with the state-of-the-art Facial Image Quality Assessment (FIQA) [39] model called MagFace [40]. This model can extract images’ face features (

f_{i}

), which can then give a facial-recognition-driven quality score (

s_{i}

) for each image

i \in I

by calculating the magnitudes of the feature vectors. Note that MagFace is only able to input images of size 112×112 pixels, so the generated images are first duplicated and resized to this resolution. We then generate again the images that have a quality score below a certain threshold (

τ

). The following equations express which images (

I^{'}

) are kept after filtering:

I^{'} = {i \in I ∣ s_{i} \geq τ}

where

s_{i} = | | f_{i} {| |}_{2}

.

3.2. Facial Expression Editing

Facial expression editing was performed with EmoStyle. We used the official pretrained weights provided in the original work [25]. Previously generated face images are provided along with their corresponding latent vector to EmoStyle in addition to VA vectors that represent the desired facial expressions. Since EmoStyle operates in StyleGAN2’s latent space, it can directly edit the images by manipulating their latent vectors. EmoStyle works here in the

W +

latent space, since it was proven to be more effective than the W space [25]. For ease of use, the discrete emotion labels defined in [26] can also be used as input to our framework instead of VA vectors. The labels are then translated to the corresponding VA vector directions using correlations from Table 1. The values in the table are based on measurements from the circumplex model of affect [26] and were reported in [41]. This way, VA vectors can be represented by a direction (i.e., angle), being the emotion type, and a magnitude, being the intensity or strength of the emotion, as shown in the following equation:

VA = (\begin{matrix} cos (direction) \\ sin (direction) \end{matrix}) \times magnitude

4. Experimental Results

Despite the fact that EmoStyle was evaluated in terms of image quality (i.e., FID, LPIPS), identity preservation and VA distribution, it lacks evaluation in terms of facial expression accuracy to ensure that it is able to produce reliable synthetic data. The present article thus evaluates how well the generated images by EmoStyle match the VA vectors that were used to generate them.

4.1. Experimental Settings

Evaluating EmoStyle’s facial expression accuracy required several prior steps which are (1) generating a set of random face images with StyleGAN2, (2) editing the facial expressions of the generated images using EmoStyle, and (3) predicting the expected VA vectors of the newly generated images with another model.

For our tests, we manually selected 100 different face images generated by StyleGAN2, that did not present strong visual artifacts.

We then provided these images and their corresponding latent vectors to EmoStyle. For each of the 100 face images, we generated facial expression images for each of the 28 VA vector directions listed in Table 1, and to each of them were applied different vector magnitudes. The evaluation was conducted at different predefined emotion intensity levels (i.e., vector magnitudes) to assess how the model performs across varying intensities. To the best of our knowledge, no previous study has explicitly evaluated facial expression accuracy over several predefined emotion intensity levels. Therefore, we established through our experiments that a number of 4 equidistant levels offered a good balance between the number of evaluation samples and the observed difference between adjacent samples; 3 producing too few images compared to the high discriminability, and 5 giving images between which the observed difference becomes unsignificant compared to the extra cost in sample size, as you can see on Figure A3. The different vector magnitudes were thus 0.00, 0.33, 0.66 and 1.00. Through this editing process, we obtained 11,200 images including 8401 unique facial expressions after removing duplicates with a VA vector magnitude of 0.00.

The last step was to predict the VA vectors of the generated images. For this, we used a state-of-the-art model capable of estimating the VA levels from face images [42]. Note that this model was also used in the training of EmoStyle to assist the network in producing realistic outputs [25]. After passing the 11,200 images to the VA prediction model, we obtained the same number of predicted VA vectors. With these predicted VA vectors, it is now possible to evaluate EmoStyle’s facial expression accuracy.

4.2. Results

Images generated by EmoStyle in our results achieved high realism and quality, as shown in Figure 2. Also, as illustrated in Figure 3, the facial expressions change while preserving the identity of the faces, one of the main advantages of using EmoStyle. The different facial expressions also seemed to well represent the emotions that were intended by the VA vectors. Additionally, close emotions such as “sad”, “gloomy” and “depressed” are visually very similar, which can be explained by their proximity in the VA space.

Moreover, we observed artifacts in the background or the face of some images, as seen in Figure 2. These artifacts originate from the StyleGAN2 generation step and greatly decrease the realism and the quality of the generated dataset. As a workaround, we tried to filter out images with low quality scores as defined by the MagFace model. However, this quality score actually does not effectively evaluate realism but rather how recognizable the face is. Thus, filtering images based on the MagFace quality score mainly removes images containing baby faces or sunglasses rather than images with artifacts. Therefore, the MagFace quality score is not a good metric for the purpose of filtering out artifacts generated by StyleGAN2.

Regarding the quantitative evaluation of EmoStyle’s facial expression accuracy, we compared the predicted VA vectors with the original VA vectors used to generate the images.

As seen in Figure 4, the distribution of the original VA vectors and the predicted VA vectors are really different for some emotions. This is typically the case for the emotions with VA direction angles between 242–329° for which the predicted VA vectors are all clustered around outer VA directions and tend to have lower magnitudes than the original VA vectors. These emotions are “Bored”, “Droopy”, “Tired”, “Sleepy”, “Calm”, “Relaxed”, “Satisfied”, “At ease”, “Content” and “Serene”. As shown in Figure A2, an additional evaluation was conducted on a 144-cell grid covering the entire VA space, which confirms the previous observation.

Moreover, Figure 5 shows that this lack in magnitude generalizes to all tested VA vectors with an expected magnitude of 1.0, and nearly to the entire VA space as shown in Figure A2c. However, some direction angles such as those between 208° and 211° perform better and have a magnitude error close to 0. Table 2 shows that the emotions that perform the best in terms of direction error are “Astonished”, “Aroused”, “Distressed”, “Frustrated”, “Sad”, “Gloomy” and “Depressed” with direction errors below the 25% percentile. In terms of distance error, those that perform the best are “Excited”, “Astonished”, “Aroused”, “Frustrated”, “Sad”, “Gloomy” and “Depressed”. Nevertheless, a best mean absolute error (MAE) does not necessarily mean a good standard deviation (SD); for instance, the “Frustrated” emotion gets a distance MAE of 0.25 with a SD of 0.14, whereas “Miserable” gets a distance MAE of 0.36 with a SD of 0.08, almost twice less. The well-performing areas in which these emotions are located can be as well easily identified in Figure A2d.

Furthermore, the images produced for the neutral emotion—or in other words the zero VA vectors (i.e., VA magnitude of 0.0)—are not accurately measured as neutral. The predicted VA vectors for the neutral emotion have a mean position of (0.0666, 0.1390) in the VA space, a slightly happier average facial expression, as illustrated in Figure 4c. It is important to note that there are also big outliers for the neutral emotion, with some predicted VA vectors being further away than a distance of 0.75 from the (0, 0) origin.

Based on our supplementary evaluation on the FFHQ dataset (i.e., the training dataset of StyleGAN2) that is illustrated in Figure A1, the dataset’s VA distribution is similar to StyleGAN2’s VA distribution, where the positive valence and low arousal area is under-represented. Therefore, the EmoStyle’s performance gap in accuracy for this specific area can be linked to the lack of strong facial expressions for the same area in the training data of StyleGAN2. Note that the VA prediction model’s ability to estimate the VA vectors is also induced by its training data distribution.

Depending on the criticalness of the application the generated images are used for and what scope of facial expressions are generated, the observed errors can be more or less acceptable. For instance, if the images are only generated on the seven basic emotions (i.e., happiness, sadness, anger, surprise, disgust, fear, and neutral), the model performs well enough for most applications. However, for a wider range of emotions, the observed errors can become problematic. In any case, accuracy for the low arousal and positive valence area of the VA space, but also the lack of intensity in the generated facial expressions, should clearly be improved.

Another issue with EmoStyle’s editing can as well be observed when editing images that include accessories on face such as sunglasses. As illustrated in Figure 6, EmoStyle can make them fade away, and especially when generating sad facial expressions. EmoStyle provides a masking solution on some parts of images so that elements such as the background do not change when editing the facial expressions. However, this masking solution is not applied to sunglasses or other face accessories [25].

5. Discussion

5.1. Main Contribution

The present study makes several contributions to the understanding and practical use of EmoStyle for generating synthetic facial expression data.

First, a systematic evaluation of EmoStyle’s expression accuracy was conducted across 28 VA directions and 4 intensity levels, complemented by a broader 144-cell grid evaluation. Our evaluation demonstrates that while the majority of facial expressions are accurately generated, EmoStyle consistently underperforms in multiple areas of the VA space, particularly in the region between 242–329°. The observed performance disparity was attributed to disparities in the distribution of training data in StyleGAN2. This finding underscores a concrete limitation that can be addressed in future research.

Secondly, we explored methods to enhance the quality of generated images. Artifact filtering based on MagFace’s quality score was found to be ineffective, as it evaluates face recognizability rather than image quality and realism. This finding underscores the necessity for the development of more suitable quality assessment methods in future research. Moreover, an issue was identified with EmoStyle, wherein sunglasses vanish during expression editing, particularly for expressions close to the “Sad” emotion. This finding indicates that, while EmoStyle demonstrates efficacy in identity preservation, it lacks support for accessory awareness.

Finally, we release an open-source toolkit at https://github.com/ClementDrn/emostyle-wrapper (accessed on 29 September 2025) that integrates fixes to the original EmoStyle repository, an API wrapper, and experiment scripts. This resource enables reproducibility and facilitates future research on the topic.

In summary:

This study proposes a systematic VA-space evaluation protocol that is applied to EmoStyle across a broad discretization of the VA space.
We identify and analyze a consistent weakness in VA directions 242–329°, linked to training data imbalance.
We provide practical insights on limitations of current quality filtering and accessory handling.
An open-source toolkit is released with repository fixes, an API wrapper, and experiment scripts to enable reproducible research.

5.2. Limitations

Even though the current evaluation provides a good insight into the facial expression accuracy of EmoStyle, since the same model is used for both training and evaluation, the evaluation is then not fully independent and therefore not entirely reliable. Still, if the evaluation is biased, it is in a way that is actually favorable to EmoStyle. Since the evaluation already shows some of EmoStyle’s flaws, the results would only be worse with a different VA prediction model, and therefore the issues identified through this evaluation remain valid. The same evaluation can still be replicated with alternative VA prediction models to strengthen the findings of this study and provide a more reliable evaluation. Alternatives include for instance the well-performing EmotiEffLib [43,44] and AffectNet [31] models.

Another limitation of this evaluation is the use of a fixed set of VA vectors to generate the images instead of using random VA vectors. This suggests that the evaluation is not fully representative of the entire VA space. Using random VA vectors would have allowed to evaluate EmoStyle’s facial expression accuracy on a wider range of emotions. The current evaluation focuses primarily on the emotions that would be used in the practical application of this study, which is dealing with a defined and finite set of emotions. An additional 144-cell grid evaluation was also conducted which indeed allows for a broader evaluation of the VA space. However, it still discretizes the VA space and does not cover it entirely, while random VA vectors can better approximate the full VA space. Although the model’s performance on untested VA vectors is uncertain, StyleGAN models are capable of interpolating seamlessly between different classes of their latent space [38]. Consequently, the performance of nearby VA vectors should similarly generalize to untested VA vectors, with only a few small irregularities, as observed in Figure A2.

The proposed filtering step based on MagFace’s quality score [40] is not capable of removing images with artifacts that could reduce the quality of the final synthetic dataset. Instead, this filtering step can reduce the diversity of the generated images. Therefore, it should not be used for this purpose on this pipeline and alternative filtering methods should be explored in future work. Other preliminary experiments that we conducted suggest that the GIQA image quality assessment method [45] is a promising candidate for this task. Additionally, this method is designed to evaluate image quality and realism, as opposed to face recognizability.

The selection of the 100 generated face images as a database for conducting the evaluation was based on manual inspection to ensure the absence of strong visual artifacts. This selection method allows to assess the quality of the EmoStyle’s results in the real-world scenario where the input images are filtered. However, it introduces a selection bias that may not accurately represent the raw performances of EmoStyle.

5.3. Future Work

One of the main observations highlighted by this study is that some VA vectors might not be well represented. Ideally, every different area of the VA domain space should produce clearly distinct image results to allow the model to output a wider spectrum of emotions. This suggests that future work should investigate the reasons behind these observations and try to fix them.

In order to improve the accuracy of facial expressions, we propose oversampling vectors in the poorly performing areas of the VA space during the training, while maintaing a uniform sampling elsewhere. We suggest the incorporation of an adapter module [46] within the EmoStyle architecture [25] that undergoes training using the adjusted training dataset, while the remaining model parameters will be frozen. This new small network would take the output of EmoStyle’s EmoExtract module [25] and apply a correction to the latent vector before passing it to the StyleGAN2 generator [21]. The adapter would have its own emotion correction losses while retaining the original identity, background and reconstruction losses. This adapter method’s advantage is that it enables the isolation of the correction from the rest of the model, thereby preventing the latter from forgetting what it has already learned [46]. Additional changes such as basing the adapter losses on direction and magnitude values rather than valence and arousal values could as well be explored. It is also advisable to consider an alternative VA predictor model and assess its impact on the performance.

Additionally, ways to prevent StyleGAN2 from producing artifacts in the generated images should be explored. One possible solution is to find a filtering method suited for spotting artifacts in the generated images. We strongly believe that the GIQA [45] assessment method is a promising candidate for filtering out images with low realism and quality. This could as well be done directly by improving the training data or the model architecture of StyleGAN2.

Another area for improvement is the “vanishing sunglasses” issue. We believe that the masking solutions in EmoStyle’s pipeline [25] can be extended to cover face accessories, such as sunglasses, to prevent them from being altered when editing the facial expressions. The area to mask can be identified through the use of a sunglasses detection model. In the event that the masking solution results in a reduction of the VA predictor model’s accuracy during the training phase, a potential solution involves the reapplication of the sunglasses area once the facial expression has been edited.

Finally, since our framework can be used to generate multiple facial expressions for the same face image, it is important that the depicted identity remains consistent. In StyleGAN3’s article [47], StyleGAN2’s architecture was improved to allow for better temporal consistency when multiple frames, based on the same original image are generated, with the purpose of generating videos. We believe that replacing StyleGAN2 with StyleGAN3 [47] in the EmoStyle pipeline would improve the generated dataset quality for applications where consistency is important between different facial expressions of the same person.

6. Conclusions

This article presents a systematic evaluation of EmoStyle’s capacities in generating synthetic facial expression images [25]. A novel evaluation method is proposed to assess the accuracy of generated expressions, across 28 VA directions and 4 intensity levels. This evaluation method was complemented by a broader 144-cell grid analysis. The findings demonstrate that EmoStyle is capable of producing results that are generally realistic and identity-preserving. However, they also demonstrate a consistent weak accuracy for VA directions within the 242–329° range. This performance disparity corresponds notably to emotions “Bored”, “Droopy”, “Tired”, “Sleepy”, “Calm”, “Relaxed”, “Satisfied”, “At ease”, “Content“, and “Serene”. It is likely linked to imbalances in the distribution of StyleGAN2’s training data [38].

The findings thus confirm that EmoStyle possesses the potential to produce high-quality synthetic datasets for FER-related tasks. Nevertheless, the observed limitations highlight the current inevitable trade-offs. While realism and identity preservation are particularly strong, accuracy and expressiveness are limited in certain VA regions, which may negatively impact downstream tasks that require precise emotion labeling. Therefore, it is essential to balance priorities between visual realism and expression accuracy prior to the utilization of data generated by this framework.

To further facilitate research, an open-source toolkit was released at https://github.com/ClementDrn/emostyle-wrapper (accessed on 29 September 2025). It integrates fixes to the original EmoStyle repository, an API wrapper, and scripts from our experiments. Furthermore, we identified areas for future improvements including (1) the implementation of more suitable quality-based filtering such as GIQA [45]; (2) migration to StyleGAN3 [47] for enhanced temporal consistency; (3) extension of EmoStyle’s masking solutions to preserve accessories such as sunglasses during editing [25]; and (4) the incorporation of an adapter module to address underrepresented areas of the VA space [46].

Author Contributions

Conceptualization, methodology, investigation, software, and data analysis were performed by C.G.D.D.; original draft of the article was written by C.G.D.D. under the supervision of C.Q. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Number JP25K15078.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code of the framework presented in this article is openly available on GitHub at https://github.com/ClementDrn/emostyle-wrapper (accessed on 29 September 2025). The original data presented in the study can be reproduced using the provided source code.

Acknowledgments

During the preparation of this study, the authors used code generated with OpenAI ChatGPT-5 for the purpose of generating the article’s diagrams. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AU	Action Unit
FER	Facial Expression Recognition
FID	Fréchet Inception Distance
FIQA	Face Image Quality Assessment
GAN	Generative Adversarial Network
LPIPS	Learned Perceptual Image Patch Similarity
MAE	Mean Absolute Error
ML	Machine Learning
SD	Standard Deviation
VA	Valence and Arousal

Appendix A

Figure A1. Distribution of the VA vectors of the FFHQ dataset. VA vectors were predicted using a state-of-the-art model [42]. The distribution resembles that of StyleGAN2, showing a notable under-representation of low-arousal expressions, especially those with positive valence.

Figure A2. Mean valence, arousal and distance (i.e., VA vector’s magnitude) errors depending on the expected VA vectors. 100 images were edited with 441 different evenly spaced VA vectors (every 0.1 on both axes). The mean errors were then calculated based on the predicted VA vectors of the edited images. Generally, we observe that the errors are higher in the positive valence and low arousal area. (a) Valence error tends to be positive in the negative valence area and negative in the positive valence area, with exceptions, notably in the low arousal area. (b) Arousal error tends to be positive in the low arousal area and negative or smaller in the high arousal area; the error reaches a peak higher than

0.8

in the positive valence and low arousal area. (c) Arrows help visualize the direction and the magnitude of the error vectors; arrow lengths are logarithmically scaled. Errors tend to be towards the center of the VA space and never diverge away from it. (d) High distance error is clearly distinguishable in the positive valence and low arousal area. Four regions with small error appear near

(0.0, 0.8)

,

(- 0.6, 0.8)

,

(0.4, - 0.8)

and

(0.3, 0.8)

and approximate respectively to emotions “Depressed”, “Frustrated”, “Astonished” and “Pleased”.

Figure A2. Mean valence, arousal and distance (i.e., VA vector’s magnitude) errors depending on the expected VA vectors. 100 images were edited with 441 different evenly spaced VA vectors (every 0.1 on both axes). The mean errors were then calculated based on the predicted VA vectors of the edited images. Generally, we observe that the errors are higher in the positive valence and low arousal area. (a) Valence error tends to be positive in the negative valence area and negative in the positive valence area, with exceptions, notably in the low arousal area. (b) Arousal error tends to be positive in the low arousal area and negative or smaller in the high arousal area; the error reaches a peak higher than

0.8

in the positive valence and low arousal area. (c) Arrows help visualize the direction and the magnitude of the error vectors; arrow lengths are logarithmically scaled. Errors tend to be towards the center of the VA space and never diverge away from it. (d) High distance error is clearly distinguishable in the positive valence and low arousal area. Four regions with small error appear near

(0.0, 0.8)

,

(- 0.6, 0.8)

,

(0.4, - 0.8)

and

(0.3, 0.8)

and approximate respectively to emotions “Depressed”, “Frustrated”, “Astonished” and “Pleased”.

Figure A3. Sample of generated images for different numbers of discretization classes of the VA magnitude. With 5 classes, the magnitudes are 0.00, 0.25, 0.50, 0.75 and 1.00. With 4 classes, the magnitudes are 0.00, 0.33, 0.66 and 1.00. With 3 classes, the magnitudes are 0.00, 0.50 and 1.00. Sampled images were generated for the distinct emotions “Happy”, “Angry”, “Sad” and “Tired”. A number of 3 classes produces highly different images that could benefit from an additional class. A number of 5 classes allows for more subtle differences between the generated images and thereby a thinner analysis. However, it does not provide a significant improvement compared to 4 classes, especially for low-arousal facial expressions.

References

Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2022, 13, 1195–1215. [Google Scholar] [CrossRef]
Villalobos, P.; Sevilla, J.; Besiroglu, T.; Heim, L.; Ho, A.; Hobbhahn, M. Machine Learning Model Sizes and the Parameter Gap. arXiv 2022, arXiv:2207.02852. [Google Scholar] [CrossRef]
Bellman, R. Dynamic Programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef] [PubMed]
Crespo Márquez, A. The curse of dimensionality. In Digital Maintenance Management: Guiding Digital Transformation in Maintenance; Springer Series in Reliability Engineering; Springer: Cham, Switzerland, 2022; pp. 67–86. [Google Scholar]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Comput. 2019, 10, 18–31. [Google Scholar] [CrossRef]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
Kollias, D.; Zafeiriou, S. Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition. arXiv 2019, arXiv:1811.07770. [Google Scholar]
Fabian Benitez-Quiroz, C.; Srinivasan, R.; Martinez, A.M. EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Ramis, S.; Buades Rubio, J.M.; Perales, F.; Manresa-Yee, C. A Novel Approach to Cross dataset studies in Facial Expression Recognition. Multimed. Tools Appl. 2022, 81, 39507–39544. [Google Scholar] [CrossRef]
Boutros, F.; Huber, M.; Siebke, P.; Rieber, T.; Damer, N. SFace: Privacy-friendly and Accurate Face Recognition using Synthetic Data. arXiv 2022, arXiv:2206.10520. [Google Scholar] [CrossRef]
Harvey, A. Exposing.ai. Available online: https://exposing.ai (accessed on 29 September 2025).
Kemelmacher-Shlizerman, I.; Seitz, S.; Miller, D.; Brossard, E. The MegaFace Benchmark: 1 Million Faces for Recognition at Scale. arXiv 2015, arXiv:1512.00596. [Google Scholar] [CrossRef]
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. arXiv 2016, arXiv:1607.08221. [Google Scholar]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A dataset for recognising faces across pose and age. arXiv 2018, arXiv:1710.08092. [Google Scholar] [CrossRef]
Peng, K.; Mathur, A.; Narayanan, A. Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers. arXiv 2021, arXiv:2108.02922. [Google Scholar] [CrossRef]
Nikolenko, S.I. Synthetic Data for Deep Learning. arXiv 2019, arXiv:1909.11512. [Google Scholar] [CrossRef]
Liu, R.; Wei, J.; Liu, F.; Si, C.; Zhang, Y.; Rao, J.; Zheng, S.; Peng, D.; Yang, D.; Zhou, D.; et al. Best Practices and Lessons Learned on Synthetic Data. arXiv 2024, arXiv:2404.07503. [Google Scholar] [CrossRef]
Wood, E.; Baltrušaitis, T.; Hewitt, C.; Dziadzio, S.; Johnson, M.; Estellers, V.; Cashman, T.J.; Shotton, J. Fake It Till You Make It: Face analysis in the wild using synthetic data alone. arXiv 2021, arXiv:2109.15102. [Google Scholar] [CrossRef]
Jaipuria, N.; Zhang, X.; Bhasin, R.; Arafa, M.; Chakravarty, P.; Shrivastava, S.; Manglani, S.; Murali, V.N. Deflating Dataset Bias Using Synthetic Data Augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Xu, T.; White, J.; Kalkan, S.; Gunes, H. Investigating Bias and Fairness in Facial Expression Recognition. arXiv 2020, arXiv:2007.10075. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. arXiv 2020, arXiv:1912.04958. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Melnik, A.; Miasayedzenkau, M.; Makaravets, D.; Pirshtuk, D.; Akbulut, E.; Holzmann, D.; Renusch, T.; Reichert, G.; Ritter, H. Face Generation and Editing with StyleGAN: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3557–3576. [Google Scholar] [CrossRef]
Abdal, R.; Zhu, P.; Mitra, N.J.; Wonka, P. StyleFlow: Attribute-conditioned Exploration of StyleGAN-Generated Images using Conditional Continuous Normalizing Flows. ACM Trans. Graph. 2021, 40, 1–21. [Google Scholar] [CrossRef]
Azari, B.; Lim, A. EmoStyle: One-Shot Facial Expression Editing Using Continuous Emotion Parameters. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6373–6382. [Google Scholar] [CrossRef]
Russell, J. A Circumplex Model of Affect. J. Personal. Soc. Psychol. 1980, 39, 1161–1178. [Google Scholar] [CrossRef]
Pumarola, A.; Agudo, A.; Martinez, A.M.; Sanfeliu, A.; Moreno-Noguer, F. Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 818–833. [Google Scholar]
Han, S.; Guo, Y.; Zhou, X.; Huang, J.; Shen, L.; Luo, Y. A chinese face dataset with dynamic expressions and diverse ages synthesized by deep learning. Sci. Data 2023, 10, 878. [Google Scholar] [CrossRef] [PubMed]
Yang, C.; Lim, S.N. Unconstrained Facial Expression Transfer using Style-based Generator. arXiv 2019, arXiv:1912.06253. [Google Scholar] [CrossRef]
Hu, X.; Aldausari, N.; Mohammadi, G. 2CET-GAN: Pixel-Level GAN Model for Human Facial Expression Transfer. arXiv 2022, arXiv:2211.11570. [Google Scholar]
Kollias, D.; Zafeiriou, S. Affect Analysis in-the-wild: Valence-Arousal, Expressions, Action Units and a Unified Framework. arXiv 2021, arXiv:2103.15792. [Google Scholar]
Ding, H.; Sricharan, K.; Chellappa, R. ExprGAN: Facial Expression Editing with Controllable Expression Intensity. arXiv 2017, arXiv:1709.03842. [Google Scholar] [CrossRef]
Ekman, P.; Friesen, W.V. Facial Action Coding System: A Technique for the Measurement of Facial Movement; Consulting Psychologists Press: Palo Alto, CA, USA, 1978. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. arXiv 2018, arXiv:1711.09020. [Google Scholar]
Kim, D.; Song, B.C. Optimal Transport-based Identity Matching for Identity-invariant Facial Expression Recognition. arXiv 2022, arXiv:2209.12172. [Google Scholar]
Abbasnejad, I.; Sridharan, S.; Nguyen, D.; Denman, S.; Fookes, C.; Lucey, S. Using Synthetic Data to Improve Facial Expression Analysis with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, Venice, Italy, 22–29 October 2017. [Google Scholar]
He, X.; Luo, C.; Xian, X.; Li, B.; Song, S.; Khan, M.H.; Xie, W.; Shen, L.; Ge, Z. SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data. arXiv 2024, arXiv:2410.09865. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv 2019, arXiv:1812.04948. [Google Scholar] [CrossRef]
Schlett, T.; Rathgeb, C.; Henniger, O.; Galbally, J.; Fierrez, J.; Busch, C. Face Image Quality Assessment: A Literature Survey. ACM Comput. Surv. 2022, 54, 1–49. [Google Scholar] [CrossRef]
Meng, Q.; Zhao, S.; Huang, Z.; Zhou, F. MagFace: A Universal Representation for Face Recognition and Quality Assessment. arXiv 2021, arXiv:2103.06627. [Google Scholar] [CrossRef]
Paplu, S.H.; Mishra, C.; Berns, K. Real-time Emotion Appraisal with Circumplex Model for Human-Robot Interaction. arXiv 2022, arXiv:2202.09813. [Google Scholar]
Toisoul, A.; Kossaifi, J.; Bulat, A.; Tzimiropoulos, G.; Pantic, M. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nat. Mach. Intell. 2021, 3, 42–50. [Google Scholar] [CrossRef]
Savchenko, A.V.; Savchenko, L.V.; Makarov, I. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Trans. Affect. Comput. 2022, 13, 2132–2143. [Google Scholar] [CrossRef]
Savchenko, A. Facial Expression Recognition with Adaptive Frame Rate based on Multiple Testing Correction. In Proceedings of the 40th International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; JMLR, Inc.: Cambridge, MA, USA, 2023; Volume 202, pp. 30119–30129. [Google Scholar]
Gu, S.; Bao, J.; Chen, D.; Wen, F. GIQA: Generated Image Quality Assessment. arXiv 2020, arXiv:2003.08932. [Google Scholar] [CrossRef]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; JMLR, Inc.: Cambridge, MA, USA; pp. 2790–2799. [Google Scholar]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-Free Generative Adversarial Networks. arXiv 2021, arXiv:2106.12423. [Google Scholar] [CrossRef]

Figure 1. The circumplex model of affect [26]. 28 different emotions are represented based on valence (positive-negative) and arousal (excited-calm).

Figure 2. Sample images of random faces and random expressions generated by our framework based on EmoStyle [25]. The images depict high realism and quality, as well as a variety of facial expressions. In the rightmost column’s images, artifacts can be observed on the faces, which originate from StyleGAN2’s generation step [21].

Figure 3. Images generated by EmoStyle for all emotions in Table 1 and for strengths 0.0, 0.33, 0.66 and 1.0. Facial expressions change smoothly and represent well the intended emotions, while preserving the identity of the faces. Some facial expressions are visually very similar, such as “sad”, “gloomy” and “depressed”, which can be explained by their proximity in the VA space.

Figure 4. Distribution of the predicted VA vectors for each of the 28 original VA vectors. For each circle representing one original VA pair, there are 100 triangles of the same color representing the predicted VA vectors from the image generated with the original VA vector. (a,b) show that vector directions between 242–329° produce results which predictions are all clustered around outer VA directions and tend to have lower magnitudes than expected. (c) shows that the predicted VA vectors for the neutral expression (i.e., expected magnitude 0.0) are slightly biased towards positive valence and high arousal, and have a mean of (0.0666, 0.1390).

Figure 5. Direction (left, green) and magnitude (right, orange) errors of the predicted VA vectors for each of the 28 emotions with strength 1.0. The predicted magnitude is always lower than expected. However, emotions between 208° and 211° perform better and have a lower magnitude error. Direction error is especially high for emotions between 242–329°, and particularly from 316°, which confirms the observations from Figure 4.

Figure 6. Images generated by EmoStyle where sunglasses fade away. The left image is the original image, while on the right side are sad (208°) facial expressions generated with EmoStyle from strength 0.0 to strength 1.0 (left to right).

Table 1. List of VA vectors used for the EmoStyle editing. The VA coordinates were calculated from a magnitude of 1.0 with the addition of the direction angles that are defined in the circumplex model of affect [26] and which were reported in [41]. Angles in the table were rounded to the nearest integer.

Emotion	Angle (°)	Valence	Arousal
Happy	8	$0.991$	$0.136$
Delighted	25	$0.907$	$0.421$
Excited	49	$0.661$	$0.750$
Astonished	70	$0.345$	$0.938$
Aroused	74	$0.279$	$0.960$
Tense	93	$- 0.049$	$0.999$
Alarmed	97	$- 0.113$	$0.994$
Angry	99	$- 0.156$	$0.988$
Afraid	116	$- 0.438$	$0.899$
Annoyed	123	$- 0.545$	$0.839$
Distressed	138	$- 0.743$	$0.669$
Frustrated	141	$- 0.777$	$0.629$
Miserable	189	$- 0.988$	$- 0.151$
Sad	208	$- 0.887$	$- 0.462$
Gloomy	209	$- 0.875$	$- 0.485$
Depressed	211	$- 0.857$	$- 0.515$
Bored	242	$- 0.469$	$- 0.883$
Droopy	257	$- 0.230$	$- 0.973$
Tired	268	$- 0.040$	$- 0.999$
Sleepy	272	$0.033$	$- 0.999$
Calm	316	$0.722$	$- 0.692$
Relaxed	318	$0.743$	$- 0.669$
Satisfied	319	$0.755$	$- 0.656$
At ease	321	$0.777$	$- 0.629$
Content	323	$0.799$	$- 0.602$
Serene	329	$0.854$	$- 0.521$
Glad	349	$0.982$	$- 0.191$
Pleased	353	$0.993$	$- 0.118$

Table 2. Mean absolute errors (MAEs) of the predicted VA vectors for each of the 28 VA emotions with strength 1.0. Presented errors are direction error, magnitude error, valence error, arousal error and distance error. Distance error is the Euclidean distance between the predicted VA vector and the expected VA vector. Results in the 25% lowest percentile are highlighted in bold with a ↓ marker. Standard deviations are included for all errors as well.

Emotion	Angle (°)	Dir. MAE (°)	Mag. MAE	Val. MAE	Arous. MAE	Dist. MAE
Happy	8	10.2 ± 8.3	0.22 ± 0.15	0.23 ± 0.17	0.14↓± 0.08	0.30 ± 0.14
Delighted	25	14.9 ± 7.9	0.19 ± 0.12	0.16 ± 0.13	0.25 ± 0.13	0.33 ± 0.11
Excited	49	10.7 ± 9.3	0.15↓± 0.09	0.12↓± 0.10	0.19 ± 0.15	0.25↓± 0.14
Astonished	70	7.0↓± 6.5	0.16↓± 0.09	0.11↓± 0.08	0.16 ± 0.10	0.21↓± 0.11
Aroused	74	7.0↓± 5.6	0.17↓± 0.09	0.10↓± 0.08	0.17 ± 0.09	0.22↓± 0.10
Tense	93	9.6 ± 7.0	0.23 ± 0.09	0.13↓± 0.09	0.24 ± 0.10	0.29 ± 0.10
Alarmed	97	10.7 ± 7.8	0.24 ± 0.10	0.15 ± 0.10	0.25 ± 0.10	0.31 ± 0.11
Angry	99	11.4 ± 8.8	0.24 ± 0.09	0.16 ± 0.11	0.25 ± 0.10	0.31 ± 0.11
Afraid	116	13.8 ± 10.0	0.26 ± 0.10	0.21 ± 0.16	0.24 ± 0.10	0.35 ± 0.13
Annoyed	123	11.7 ± 9.7	0.26 ± 0.11	0.20 ± 0.16	0.23 ± 0.09	0.33 ± 0.13
Distressed	138	9.2↓± 7.5	0.20 ± 0.14	0.17 ± 0.15	0.18 ± 0.07	0.27 ± 0.13
Frustrated	141	8.6↓± 7.3	0.19↓± 0.15	0.16 ± 0.16	0.16↓± 0.07	0.25↓± 0.14
Miserable	189	12.0 ± 7.9	0.27 ± 0.11	0.29 ± 0.09	0.16↓± 0.12	0.36 ± 0.08
Sad	208	6.8↓± 4.1	0.15↓± 0.09	0.14 ± 0.08	0.11↓± 0.09	0.20↓± 0.09
Gloomy	209	6.1↓± 3.9	0.14↓± 0.09	0.13 ± 0.08	0.11↓± 0.09	0.19↓± 0.09
Depressed	211	5.2↓± 3.8	0.14↓± 0.09	0.12↓± 0.07	0.10↓± 0.09	0.17↓± 0.09
Bored	242	20.9 ± 6.0	0.20 ± 0.13	0.18 ± 0.13	0.33 ± 0.13	0.41 ± 0.09
Droopy	257	29.8 ± 8.5	0.36 ± 0.16	0.23 ± 0.15	0.50 ± 0.12	0.58 ± 0.09
Tired	268	32.8 ± 12.8	0.50 ± 0.16	0.26 ± 0.16	0.60 ± 0.12	0.68 ± 0.09
Sleepy	272	33.0 ± 15.1	0.55 ± 0.16	0.27 ± 0.16	0.63 ± 0.12	0.71 ± 0.09
Calm	316	41.2 ± 7.1	0.20 ± 0.15	0.16 ± 0.10	0.64 ± 0.08	0.67 ± 0.08
Relaxed	318	39.6 ± 7.1	0.20 ± 0.15	0.15 ± 0.09	0.62 ± 0.08	0.65 ± 0.08
Satisfied	319	38.8 ± 7.1	0.20 ± 0.15	0.14 ± 0.09	0.61 ± 0.08	0.64 ± 0.08
At ease	321	37.1 ± 7.3	0.19 ± 0.15	0.14 ± 0.10	0.59 ± 0.08	0.61 ± 0.09
Content	323	35.3 ± 7.5	0.19 ± 0.15	0.13↓± 0.10	0.57 ± 0.09	0.59 ± 0.09
Serene	329	30.3 ± 7.6	0.20 ± 0.14	0.12↓± 0.11	0.49 ± 0.09	0.52 ± 0.11
Glad	349	12.9 ± 10.0	0.24 ± 0.15	0.23 ± 0.16	0.20 ± 0.11	0.32 ± 0.17
Pleased	353	9.7 ± 10.8	0.24 ± 0.16	0.25 ± 0.17	0.13↓± 0.11	0.29 ± 0.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Darne, C.G.D.; Quan, C.; Luo, Z. Generating Synthetic Facial Expression Images Using EmoStyle. Appl. Sci. 2025, 15, 10636. https://doi.org/10.3390/app151910636

AMA Style

Darne CGD, Quan C, Luo Z. Generating Synthetic Facial Expression Images Using EmoStyle. Applied Sciences. 2025; 15(19):10636. https://doi.org/10.3390/app151910636

Chicago/Turabian Style

Darne, Clément Gérard Daniel, Changqin Quan, and Zhiwei Luo. 2025. "Generating Synthetic Facial Expression Images Using EmoStyle" Applied Sciences 15, no. 19: 10636. https://doi.org/10.3390/app151910636

APA Style

Darne, C. G. D., Quan, C., & Luo, Z. (2025). Generating Synthetic Facial Expression Images Using EmoStyle. Applied Sciences, 15(19), 10636. https://doi.org/10.3390/app151910636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generating Synthetic Facial Expression Images Using EmoStyle

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Facial Expression Generation

2.2. Synthetic Data Generation

3. Method

3.1. Face Image Generation

3.2. Facial Expression Editing

4. Experimental Results

4.1. Experimental Settings

4.2. Results

5. Discussion

5.1. Main Contribution

5.2. Limitations

5.3. Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI