Review Reports
- Katherine Lin Shu1 and
- Mu-Jiang-Shan Wang2,*
Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis manuscript presents a highly robust, well-written, and experimentally rigorous investigation addressing critical challenges in Facial Expression Recognition (FER) under unconstrained conditions. The authors' proposal of the Multi-Domain Feature Enhancement and Fusion Transformer (MDFEFT) framework is particularly novel, complementing the global modeling capabilities of the Vision Transformer (ViT) with fine-grained features extracted from complementary frequency, spatial, and channel domains.
The main achievement of this work is demonstrating the proposed MDFEFT model's consistently superior performance over strong baselines, including CNNs, standard Transformers (ViT), and ensemble-based methods, across both controlled (KDEF) and "in-the-wild" (FER2013 and RAF-DB) datasets. Furthermore, the Cross-Domain Feature Enhancement and Fusion (CDFEF) module effectively enhances generalization and robustness against noise by adaptively integrating heterogeneous features. The ablation studies convincingly validate the individual and synergistic contributions of the CDFEF and the multi-branch feature extraction module to discriminative feature representation.
To fully substantiate the model's multiclass classification capability and its reliability under conditions prone to class imbalance, the following additions are necessary:
The authors must calculate and incorporate the Matthews Correlation Coefficient (MCC) into Table 1 (Quantitative comparison) and Table 2 (Statistical stability results).
The authors are required to add the Confusion Matrix for the proposed full model on at least one of the two most challenging datasets: FER2013 or RAF-DB.
Please review the references again. For instance, the DOI link for Reference 2 is erroneous.
Author Response
Please find the attached response report.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript proposes a Multi-Domain Feature Enhancement and Fusion Transformer for facial expression recognition (FER). The framework combines a ViT-based global encoder with three complementary branches in the channel, spatial, and frequency domains, and an adaptive Cross-Domain Feature Enhancement and Fusion (CDFEF) module. The model is evaluated on three benchmark datasets (KDEF, FER2013, RAF-DB), with extensive comparisons against CNN-, Transformer-, and ensemble-based baselines, as well as ablation studies, statistical stability analysis, Grad-CAM visualizations, and cross-domain experiments. verall, the work is technically solid and experimentally thorough. However, some aspects of the experimental design and presentation.
1. Use of validation sets and unclear data splitting
You state that early stopping is based on validation accuracy and that, unless otherwise specified, all quantitative results are obtained from the validation sets. This raises concerns:
-
Reporting performance on the same validation sets that are used for model selection (early stopping and hyperparameter tuning) can lead to optimistic bias.
-
Many prior FER works report scores on standard test splits, so mixing validation-based and test-based numbers may not be comparable.
Requests:
-
Clearly describe, for each dataset, how data are split into training, validation, and (if used) test sets.
-
If possible, provide results on official or commonly used test splits and treat those as your main reported numbers.
-
If you must rely only on validation sets, please explicitly state this limitation and explain how you mitigated overfitting to the validation data.
- In addition, to more thoroughly evaluate the proposed model, I strongly recommend reporting class-wise performance for each dataset (e.g., per-class precision, recall, and F1-score, or at least normalized confusion matrices). This would allow readers to verify that the gains are not dominated by a subset of majority classes, to see how well the model handles the more difficult emotion categories, and to more convincingly support the claim that the proposed method improves FER performance across all classes.
2. Inconsistent naming of the proposed method
The manuscript uses multiple acronyms for the proposed method (e.g., MDFEF vs. MDFEFT). This is confusing for the reader.
Please choose a single official name and acronym for the method and use it consistently throughout the manuscript (title, abstract, main text, figures, and tables).
3. Presentation and formatting
-
Ensure consistent capitalization of section titles (e.g., “Methodology” rather than “methodology”).
-
Use consistent notation and numerical format (e.g., always report accuracy either as percentages or as fractions, and keep this consistent across the text and tables).
4. Hyperparameter explanation
-
Some hyperparameter choices (e.g., relatively large weight decay values) could be briefly justified. Indicating whether you performed a hyperparameter search would increase transparency.
Author Response
Please find the attached response report.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe comments and recommendations for the authors are provided in the file.
Comments for author File:
Comments.pdf
Author Response
Please find the attached response report.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsOverall, the paper is now clear, technically sound, and suitable for publication in Symmetry. I have no further substantive comments and am satisfied with the current version. A light language polishing would be optional but not strictly necessary.
I therefore support the acceptance of this manuscript.