Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Generating Synthetic Facial Expression Images Using EmoStyle

Appl. Sci. 2025, 15(19), 10636; https://doi.org/10.3390/app151910636

by Clément Gérard Daniel Darne^1,2,*

, Changqin Quan^1,* and Zhiwei Luo¹

Reviewer 1: Anonymous

Reviewer 2:

Gabriel Gonzalez-Serna

Reviewer 3: Anonymous

Appl. Sci. 2025, 15(19), 10636; https://doi.org/10.3390/app151910636

Submission received: 21 July 2025 / Revised: 21 September 2025 / Accepted: 26 September 2025 / Published: 1 October 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Although the paper provides a systematic assessment of the performance of the EmoStyle model throughout the valence-arousal (VA) space, the genuine contributions of the work seem exaggerated. The majority of the paper relies on the execution of pre-existing models (StyleGAN2 for image synthesis, EmoStyle for facial expression manipulation, and Toisoul et al. for VA estimation), with the absence of new methodology or meaningful technical advancement.

The purported contributions are as follows:

(1) Creating synthetic facial expression images with EmoStyle

(2) Assessing EmoStyle's expression accuracy throughout the VA space

(3) Seeing vulnerabilities in certain VA sectors (242°–328.6°)

(4) Providing an open-source wrapper for EmoStyle

statement (1) simply repeats what had been already proved in the paper of EmoStyle.

Conclusion (2) is solely based on existing VA predictive models with no methodological novelty.

Point (3) amounts to empirical observations without further analysis or modeling to explain or resolve the identified issues.

Point (4) is limited to packaging existing functionalities in a wrapper, which is helpful but not a technical contribution.

In summary, the paper is at the level of a test report of an existing model, not providing new algorithmic intuitions, theory contributions, or new applications. Presenting this as a key contribution is thus potentially misleading. In order for the paper to be in suitable form for publication, the authors could add either (a) more thorough testing involving multiple models instead of the use of a single model, (b) practical suggestions for enhancing the performance of the basic model, or (c) new use cases and applications that go beyond the original intent of EmoStyle.

Author Response

Thank you for your review.

I attached the "latexdiff" between the last submission and the current one. It highlights what has changed in the text. However, note that it seems that only the body text (not the abstract or subcaptions) is concerned.

Comment 1: Although the paper provides a systematic assessment of the performance of the EmoStyle model throughout the valence-arousal (VA) space, the genuine contributions of the work seem exaggerated. The majority of the paper relies on the execution of pre-existing models (StyleGAN2 for image synthesis, EmoStyle for facial expression manipulation, and Toisoul et al. for VA estimation), with the absence of new methodology or meaningful technical advancement.

Response 1: I understand the confusion. Advancements presented in the article are now better highlighted. See Abstract from L13 and from L17, Introduction paragraph at L74, Main Contribution from L286, and Conclusion at L384 and from L399.

Comment 2: (1) Creating synthetic facial expression images with EmoStyle
statement (1) simply repeats what had been already proved in the paper of EmoStyle.

Reponse 2: Yes you are totally right.

Comment 3: (2) Assessing EmoStyle's expression accuracy throughout the VA space
Conclusion (2) is solely based on existing VA predictive models with no methodological novelty.

Reponse 3: Yes, the utilized models are not a novelty. However, the expression accuracy evaluation of EmoStyle is a novelty, as declared at L12.

Comment 4: (3) Seeing vulnerabilities in certain VA sectors (242°–328.6°)
Point (3) amounts to empirical observations without further analysis or modeling to explain or resolve the identified issues.

Response 4: Thank you for this comment. I added analysis and explanation to this observation at L288.

Comment 5: (4) Providing an open-source wrapper for EmoStyle
Point (4) is limited to packaging existing functionalities in a wrapper, which is helpful but not a technical contribution.

Response 5: The source code is here to help further research. It includes fixes from EmoStyle's repository, the API wrapper, and the code of the experiments on artifact filtering. I changed the text so it is clearer (see L77 and L308).

Comment 6: In summary, the paper is at the level of a test report of an existing model, not providing new algorithmic intuitions, theory contributions, or new applications. Presenting this as a key contribution is thus potentially misleading. In order for the paper to be in suitable form for publication, the authors could add either (a) more thorough testing involving multiple models instead of the use of a single model, (b) practical suggestions for enhancing the performance of the basic model, or (c) new use cases and applications that go beyond the original intent of EmoStyle.

Response 6: Thank you for your critical review. I added suggestions for solving every issue highlighted in the article. This includes expression accuracy improvement (L349), artifact removal (L363), sunglasses masking (L369), and temporal consistency (L376).

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript makes a relevant and timely contribution to the field of synthetic facial expression generation by utilizing EmoStyle’s latent space editing capabilities and a continuous valence-arousal (VA) representation. The proposed evaluation of EmoStyle’s accuracy in generating facial expressions addresses a gap in the literature and holds potential value for both the research community and practical applications in facial expression recognition (FER). However, several issues related to presentation, methodology, and clarity need to be addressed to enhance the overall quality and impact of the work.

From a content perspective, the paper is generally well-structured and easy to follow, with clear sections for the introduction, related work, methodology, results, discussion, and conclusions. Nonetheless, some sections contain inconsistencies and editorial issues that could hinder comprehension. For instance, in line 28, the term “curse of dimensionality” is mentioned, but its explanation is brief and could benefit from additional context for non-specialist readers. Additionally, there is a typographical error in the caption for Figure 1 (line 56), where “original” should be corrected.

Regarding figures and tables, there are notable inconsistencies. Figures 2, 3, 4, 5, and 6 are referenced in the text, but their captions vary in detail; some are self-explanatory, while others require cross-referencing with the main text for full understanding. Figures A1 and A2 in the appendix provide useful supplementary information, yet the sub-figures in Figure A2 could be labeled more clearly to indicate which error metric is represented. Table 1 is well-organized, but the values in the “Valence” and “Arousal” columns should be verified for precision and consistency with the corresponding angles; uniform rounding conventions should be applied throughout the table. In Table 2, the mention of highlighting the lowest 25th percentile results is not visually evident in the provided format. The authors should ensure proper formatting so that these highlights are visible in the final publication.

Several issues were identified regarding references. For example, references [21] and [39] both refer to the circumplex model of affect but appear redundant; consolidating them would reduce repetition. Additionally, references [12], [13], and [14] cite datasets that have been withdrawn for privacy reasons; the text should clarify their current availability status to avoid confusion. In some instances, such as with references [36] and [41], the citation style for arXiv is inconsistent; uniform formatting should be ensured according to Applied Sciences guidelines.

On the methodological side, the choice to use the VA predictor by Toisoul et al. [40] for both training assistance and evaluation introduces a potential bias, as acknowledged by the authors in Section 5.2. While the manuscript notes this limitation, a more robust discussion of alternative evaluation methods, even if not implemented, would strengthen the argument and offer readers a clearer path for replication using independent models. Additionally, the evaluation is limited to a fixed set of VA vectors; although justified within the context of practical applications, the implications for generalization to unseen VA combinations should be elaborated further.

Some minor but important textual issues were identified:

Line 120: "fo training" should be corrected to "for training."
Line 175: "an direction" should be corrected to "a direction."
Line 251: The sentence "the input emotions are accurately generated expressions are accurate…" is grammatically incorrect and needs to be restructured for clarity.
Several figure references (e.g., Figure 4(c) in line 235) appear as text fragments instead of being fully integrated into the sentence.

In the results section, the analysis of VA direction and magnitude errors is informative, but it could benefit from additional statistical summaries, such as standard deviations and confidence intervals, to better quantify variability across samples. Furthermore, while the claim that MagFace filtering is ineffective for artifact removal is well-supported, it would be beneficial to explicitly suggest alternative approaches, such as perceptual loss-based metrics and CLIP-based filtering, in the discussion.

Finally, the conclusions effectively summarize the findings but could more strongly emphasize the practical implications for FER dataset creation, particularly the trade-offs between realism, expression accuracy, and identity preservation. This would enhance the manuscript's applicability to real-world scenarios.

Recommendation: The study presents original contributions but requires major revisions before it can be considered for publication in Applied Sciences. The authors should focus on the following:

Correcting typographical and grammatical errors.
Standardizing table and figure formatting, ensuring all highlights and captions are clear.
Addressing reference redundancies and inconsistencies in style.
Expanding the methodological discussion to include alternative evaluation strategies and the implications of the fixed VA set.
Strengthening the conclusions to better connect the results to practical applications in FER.

Addressing these revisions will significantly improve the paper’s clarity, methodological rigor, and overall impact.

Comments on the Quality of English Language

The manuscript is generally understandable; however, it contains several grammatical errors, awkward phrasings, and typographical mistakes that affect its clarity and flow. For instance, there are incorrect articles, such as “an direction” instead of “a direction,” missing prepositions, and occasional run-on sentences (e.g., line 251). Additionally, some sentences are repetitive or include redundant phrases, such as “the input emotions are accurately generated expressions are accurate,” which need restructuring.

Moreover, certain figure captions and table descriptions lack grammatical consistency and could be rephrased for better readability.

It is recommended to have the manuscript carefully proofread by a native or fluent English speaker or to use a professional editing service. This will help ensure consistency in tense usage, subject–verb agreement, and punctuation, as well as standardize terminology throughout the text.

Author Response

Thank you for your review.

Comments 1: The manuscript makes a relevant and timely contribution to the field of synthetic facial expression generation by utilizing EmoStyle’s latent space editing capabilities and a continuous valence-arousal (VA) representation. The proposed evaluation of EmoStyle’s accuracy in generating facial expressions addresses a gap in the literature and holds potential value for both the research community and practical applications in facial expression recognition (FER). However, several issues related to presentation, methodology, and clarity need to be addressed to enhance the overall quality and impact of the work.

Response 1: Thank you. For instance, advancements presented in the article are now better highlighted. See Abstract from L13 and from L17, Introduction paragraph at L74, Main Contribution from L286, and Conclusion at L384 and from L399.

Comments 2: From a content perspective, the paper is generally well-structured and easy to follow, with clear sections for the introduction, related work, methodology, results, discussion, and conclusions. Nonetheless, some sections contain inconsistencies and editorial issues that could hinder comprehension. For instance, in line 28, the term “curse of dimensionality” is mentioned, but its explanation is brief and could benefit from additional context for non-specialist readers. Additionally, there is a typographical error in the caption for Figure 1 (line 56), where “original” should be corrected.

Response 2: Curse of dimensionality has now a bit more context (L30). The typography disappeared.

Comments 3: Regarding figures and tables, there are notable inconsistencies. Figures 2, 3, 4, 5, and 6 are referenced in the text, but their captions vary in detail; some are self-explanatory, while others require cross-referencing with the main text for full understanding. Figures A1 and A2 in the appendix provide useful supplementary information, yet the sub-figures in Figure A2 could be labeled more clearly to indicate which error metric is represented. Table 1 is well-organized, but the values in the “Valence” and “Arousal” columns should be verified for precision and consistency with the corresponding angles; uniform rounding conventions should be applied throughout the table. In Table 2, the mention of highlighting the lowest 25th percentile results is not visually evident in the provided format. The authors should ensure proper formatting so that these highlights are visible in the final publication.

Response 3: All figures and tables are now self-explanatory. I extended Figure A2's sub-captions (and added an extra sub-figure for better clarity). Since some angles in Table 1 had less decimal precision, I scaled down the precision of every angle to a common precision. I added a down arrow marker to bold values for better readability in Table 2.

Comments 4: Several issues were identified regarding references. For example, references [21] and [39] both refer to the circumplex model of affect but appear redundant; consolidating them would reduce repetition. Additionally, references [12], [13], and [14] cite datasets that have been withdrawn for privacy reasons; the text should clarify their current availability status to avoid confusion. In some instances, such as with references [36] and [41], the citation style for arXiv is inconsistent; uniform formatting should be ensured according to Applied Sciences guidelines.

Response 4: I consolidated references at L182 and in Figure 1's caption. At L41, it is now clear that datasets are no longer available. The citation style for arXiv now follows the format: Authors. Title. arXiv DATE, arXiv:XXXX.XXXXX

Comments 5: On the methodological side, the choice to use the VA predictor by Toisoul et al. [40] for both training assistance and evaluation introduces a potential bias, as acknowledged by the authors in Section 5.2. While the manuscript notes this limitation, a more robust discussion of alternative evaluation methods, even if not implemented, would strengthen the argument and offer readers a clearer path for replication using independent models. Additionally, the evaluation is limited to a fixed set of VA vectors; although justified within the context of practical applications, the implications for generalization to unseen VA combinations should be elaborated further.

Response 5: Alternative prediction models are suggested at L318. I explained implications for generalization to unseen VA values at L325.

Comments 6: Some minor but important textual issues were identified:

Line 120: "fo training" should be corrected to "for training."
Line 175: "an direction" should be corrected to "a direction."
Line 251: The sentence "the input emotions are accurately generated expressions are accurate…" is grammatically incorrect and needs to be restructured for clarity.
Several figure references (e.g., Figure 4(c) in line 235) appear as text fragments instead of being fully integrated into the sentence.

Response 6: Typography was corrected (L128, L184, and L287). I integrated every figure and table without braces into the text (e.g. L263).

Comments 7: In the results section, the analysis of VA direction and magnitude errors is informative, but it could benefit from additional statistical summaries, such as standard deviations and confidence intervals, to better quantify variability across samples. Furthermore, while the claim that MagFace filtering is ineffective for artifact removal is well-supported, it would be beneficial to explicitly suggest alternative approaches, such as perceptual loss-based metrics and CLIP-based filtering, in the discussion.

Response 7: I added SD to Table 2. I suggested the GIQA quality assessment method as an alternative to MagFace (L334).

Comments 8: Finally, the conclusions effectively summarize the findings but could more strongly emphasize the practical implications for FER dataset creation, particularly the trade-offs between realism, expression accuracy, and identity preservation. This would enhance the manuscript's applicability to real-world scenarios.

Response 8: I mentioned the importance to reflect trade-offs at L393.

Comments 9:

Recommendation: The study presents original contributions but requires major revisions before it can be considered for publication in Applied Sciences. The authors should focus on the following:

Correcting typographical and grammatical errors.
Standardizing table and figure formatting, ensuring all highlights and captions are clear.
Addressing reference redundancies and inconsistencies in style.
Expanding the methodological discussion to include alternative evaluation strategies and the implications of the fixed VA set.
Strengthening the conclusions to better connect the results to practical applications in FER.

Addressing these revisions will significantly improve the paper’s clarity, methodological rigor, and overall impact.

Response 9: Thank you a lot for your insightful review.

Comments 10:

Moreover, certain figure captions and table descriptions lack grammatical consistency and could be rephrased for better readability.

Response 10: Thank you. I reviewed multiple parts of the text and asked for the help of native English speakers to find language mistakes. You can see the latexdiff file for more information on what has changed.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript entitled “Generating Synthetic Facial Expression Images Using EmoStyle” presents an open-source EmoStyle evaluation framework. EmoStyle is a state-of-the-art model based on StyleGAN2 for generating synthetic face image datasets for adapting expressed emotions. The authors focus on evaluating the accuracy of the generated expressions. The proposed work is relevant, timely and seems promising, as it explores the quality of Emostyle’s generation of images. Authors also published the evaluation framework (Emostyle Wrapper) on GitHub for public use.

Major concerns:

While the abstract claims that EmoStyle maintains strong identity preservation in the generated images, the evaluation focuses solely on expression accuracy. As noted in Lines 215–216, MagFace is deemed unsuitable for assessing identity preservation, yet no alternative metric is proposed. Since identity preservation is presented as an important quality aspect, this limitation should be more clearly acknowledged, or possible metrics should be discussed.
Authors declared use of ChatGPT for figure generation, which is commendable and transparent. Perhaps they could further clarify the use – if the tool was used to generate the code for the graphs or for the image examples in Figure 2.
The evaluation was conducted on 100 images that were manually selected not to include strong visual artifacts. The impact of selection bias should be mentioned in the limitation section. Authors should also consider publishing the dataset with the selection of the initial photos, and their adaptations with EmoStyle and multiple magnitudes to support future benchmarking and comparative evaluation efforts.
The selection of vector magnitudes (0.0, 0.33, 0.66 and 1.0) used in the evaluation should be justified, ideally with reference to prior work or exploratory analysis. While there is no strict benchmark in the field, common practice would involve validating the selection of such values empirically.
On page 8 (L 220), the description of the results in Figure 4 is general and should be extended to further capture the essence of all three subfigures (a-c). The contribution and point of Figure 4c are not discussed at the moment.
Additionally, a discussion of VA errors in figure A2 could be extended (L220) to include the additional interpretation about acceptable, expected and actual errors to give more context to the reader.
In the conclusion section, it would be helpful to again list the emotions with low accuracy (L307), that lie between the listed angles.

Minor concerns:

IEEE style of citation is not consistently used thought the manuscript. The citation type of using the last name only appears multiple times (L167, 172, 197 etc.) and should be standardized.
In the abbreviation section, AU (Action unit) should be added, as it is used in the manuscript.

This manuscript offers a valuable contribution by critically assessing an existing facial expression synthesis model EmoStyle. The open-source release of the evaluation framework supports transparency and reproducibility. Nonetheless, a more comprehensive evaluation (or comment of its exclusion) regarding identity preservation, and clearer interpretation of the results would strengthen the manuscript’s overall impact.

Author Response

Thank you for your review.

Comments 1:

While the abstract claims that EmoStyle maintains strong identity preservation in the generated images, the evaluation focuses solely on expression accuracy. As noted in Lines 215–216, MagFace is deemed unsuitable for assessing identity preservation, yet no alternative metric is proposed. Since identity preservation is presented as an important quality aspect, this limitation should be more clearly acknowledged, or possible metrics should be discussed.

Response 1: Artifact removal is indeed not related to identity preservation, only image quality. I suggested an alternative quality assessment method called GIQA (L334).

Comments 2:

Authors declared use of ChatGPT for figure generation, which is commendable and transparent. Perhaps they could further clarify the use – if the tool was used to generate the code for the graphs or for the image examples in Figure 2.

Response 2: I improved the statement at L418.

Comments 3:

The evaluation was conducted on 100 images that were manually selected not to include strong visual artifacts. The impact of selection bias should be mentioned in the limitation section. Authors should also consider publishing the dataset with the selection of the initial photos, and their adaptations with EmoStyle and multiple magnitudes to support future benchmarking and comparative evaluation efforts.

Response 3: Selection bias is now discussed at L341. Since the source code is available and allows for easy generation of images I prefered not publishing large amounts of additional data on the cloud. However, if you still think it would be better, I will try to publish the datasets.

Comments 4:

The selection of vector magnitudes (0.0, 0.33, 0.66 and 1.0) used in the evaluation should be justified, ideally with reference to prior work or exploratory analysis. While there is no strict benchmark in the field, common practice would involve validating the selection of such values empirically.

Response 4: I added an explanation about the choice of these magnitude values at L203. There is also an additional Figure A3.

Comments 5:

On page 8 (L 220), the description of the results in Figure 4 is general and should be extended to further capture the essence of all three subfigures (a-c). The contribution and point of Figure 4c are not discussed at the moment.

Response 5: Figure 4's caption was extended.

Comments 6:

Additionally, a discussion of VA errors in figure A2 could be extended (L220) to include the additional interpretation about acceptable, expected and actual errors to give more context to the reader.

Response 6: Figure A2's caption was extended with also a new sub-figure.

Comments 7:

In the conclusion section, it would be helpful to again list the emotions with low accuracy (L307), that lie between the listed angles.

Response 7: I listed the emotions in the Conclusion (L391).

Comments 8:

IEEE style of citation is not consistently used thought the manuscript. The citation type of using the last name only appears multiple times (L167, 172, 197 etc.) and should be standardized.
In the abbreviation section, AU (Action unit) should be added, as it is used in the manuscript.

Response 8: I removed the citations with Author names. I added the AU abbreviation. Thank you.

Comments 9:

Response 9: Advancements are now clearer. See Abstract from L13 and from L17, Introduction paragraph at L74, Main Contribution from L286, and Conclusion at L384 and from L399.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I understand the novelty of the paper mentioned by the author. I believe the author has sufficiently explained it in the response. However, I still have doubts about whether the novelty of the research is sufficient.

Author Response

Thank you for your honest review. As usual, I attached the "diff" PDF to facilitate the next review. Note that it can miss edits, such as in the abstract.

Comments: I understand the novelty of the paper mentioned by the author. I believe the author has sufficiently explained it in the response. However, I still have doubts about whether the novelty of the research is sufficient.

Response: Thank you for these comments.

First, to the best of our knowledge, no prior work has conducted an accuracy evaluation across the VA space. This includes EmoStyle (L11) — which does not assess accuracy at all — but also previous studies on the field that primarily focused on perceived quality, diversity, or identity preservation (as explained at L14 and L77). The closest we could find is classification accuracy evaluation on 7 different emotions [28]. We therefore included more explanation about the importance of the presented novelty. For the same purpose, the additional evaluation across the VA space that resulted in Figure A2, is now further explained (L264, L310, L354, and L418).

Secondly, the open-source toolkit implications are further described: it is not only an API wrapper, but also includes fixes to the original EmoStyle repository and experiment scripts (notably on artifact removal) thereby enabling reproducibility and facilitating future research on the topic. This is mentioned at L21, L93, L324, L335, and L432.

The other novelties are the potential avenues for improvement well mentioned in the conclusion (from L434) and further detailed in the future work section. Practical recommendations are provided, such as the implementation of an adapter module (L382), the introduction of the GIQA filtering method (L396), the utilization of masking or override strategies to fix accessory vanishing (L401 and L405), and the implementation of StyleGAN3 for temporal consistency (L411). Apart from practical recommendations, explanations to findings are also provided (e.g. L287 explains the possible reason behind the lack of accuracy in specific VA regions).

Finally, the abstract (L4), introduction (L27), main contributions (L306), and conclusions (L414) were improved to better grasp the importance of the novelties. These edits include previously cited points, as well as bullet lists and numbered lists for more clarity.

In conclusion, the article now clearly proposes multiple "practical suggestions for enhancing the performance of the basic model" (as you asked in your previous comments), a systematic evaluation across the VA space that fills the gap in the accuracy assessment of EmoStyle and in the literature accuracy assessment methods.

In summary, the article clearly contributes with the following:

A systematic accuracy evaluation protocol across the VA space;
New empirical findings on where EmoStyle fails and why;
Actionable recommendations derived from those findings;
A reproducibility-focused open-source toolkit.

Please tell me if you have any further recommendations.

Author Response File: Author Response.pdf

Article Menu

Generating Synthetic Facial Expression Images Using EmoStyle

Further Information

Guidelines

MDPI Initiatives

Follow MDPI