Efficient Dynamic Emotion Recognition from Facial Expressions Using Statistical Spatio-Temporal Geometric Features
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper introduces a low-cost facial-expression recognizer that converts every pair of facial landmark tracks into simple statistical descriptors, then selects the most informative distances for a linear SVM that runs in real-time and tops 94.65 % accuracy on front-facing datasets. While the idea is elegant and hardware-friendly, its novelty over prior geometry-only pipelines is modest, and the evaluation is weakened by subject leakage, missing deep-learning baselines, and a sharp drop to 76% on the more varied MMI set. With stricter cross-subject protocols, deeper comparisons, and tests on in-the-wild video, the work could be a solid contribution to the Computer Vision field. Also, there is one typo in the Fig. 5 caption "Tehe".
Author Response
Comment 1: … While the idea is elegant and hardware-friendly, its novelty over prior geometry-only pipelines is modest, and the evaluation is weakened by subject leakage …
Response 1: Thank you for your valuable comment. You are correct that our pipeline appears relatively simple—this was an intentional design choice aimed at reducing complexity, ensuring ease of implementation, and improving computational efficiency. In previous work, we explored appearance-based descriptors. However, we found them to be highly sensitive to illumination conditions, which motivated our shift toward geometry-based features.
While the architecture may not appear radically novel compared to prior geometry-only pipelines, our main contribution lies in proposing a lightweight and hardware-friendly approach that remains effective without relying on high-performance computing infrastructure. This is particularly suitable for real-time applications and small- to medium-scale datasets, which are common in the domain we target.
Regarding the evaluation protocol: we strictly ensured that each video sequence appears exclusively in either the training or the test set, following standard practices in the literature. We explicitly avoided splitting frames from the same sequence across both sets; a practice observed in some prior studies that may introduce data leakage. However, we acknowledge that our current validation protocol is not fully subject-independent. This choice was guided by the class imbalance across subjects and emotions, which we feared might introduce other biases if a purely subject-independent strategy were enforced.
That said, we agree with your point and plan to include a subject-independent evaluation in future work to better assess generalization capabilities.
Comment 2: … missing deep-learning baselines …
Response 2: Thank you for raising this important point. Initially, our manuscript included only a limited selection of deep learning-based methods. However, we have now addressed this shortcoming by incorporating a broader set of recent works. In total, we have added 9 additional deep learning-based studies, bringing the total number of compared methods to 11, as reflected in Tables 4, 7, and 10.
These works were selected based on a commonly accepted standard in the field, which involves reviewing literature from the past five years (2019–2024). Furthermore, we specifically included only those studies that used at least one of the same datasets as ours for both training and evaluation. This criterion ensures a fair and objective comparison with our method, avoiding performance discrepancies due to differences in datasets or evaluation protocols.
We believe that this extended benchmark provides a more comprehensive and meaningful context to assess the performance of our approach relative to recent deep learning baselines.
Comment 3: … a sharp drop to 76% on the more varied MMI set …
Response 3: Thank you for pointing this out. Indeed, our method achieves lower performance on the MMI dataset (75.59%) compared to CK+ (94.64%) and MUG (94.19%). This drop can be attributed to the intrinsic differences among the datasets. Each database was collected under distinct acquisition protocols and environmental conditions, which can significantly impact system performance—especially for models that do not rely on extensive data augmentation or domain adaptation techniques.
To further investigate this issue, we analyzed whether this performance drop is specific to our method or a general trend. As shown in Table 7, several state-of-the-art methods published in international journals between 2019 and 2024 exhibit similar limitations on the MMI dataset. Notably, even the best-performing method in that comparison achieved only 76.34%, which is very close to our result. It's worth emphasizing that our approach does not rely on deep convolutional neural networks, making this level of performance particularly encouraging given its lower computational complexity.
These observations suggest that the MMI dataset remains a challenging benchmark, and the performance drop is not unique to our method.
Comment 4: … With stricter cross-subject protocols, deeper comparisons, and tests on in-the-wild video, the work could be a solid contribution to the Computer Vision field …
Response 4: We sincerely thank you for this encouraging and constructive comment. We fully agree that implementing stricter cross-subject validation protocols and including evaluations on more challenging, in-the-wild video data would further strengthen the robustness and generalizability of our approach.
Regarding the validation protocol, we adopted a strategy in line with several previous works, ensuring that each video sequence appears exclusively in either the training or testing set. However, we acknowledge that a subject-independent protocol would provide a more rigorous assessment of generalization. We are currently extending our work to include such evaluations and plan to report these results in future studies.
As for the use of in-the-wild data, this is indeed a valuable direction. The current study focuses on controlled datasets (CK+, MUG, MMI), which, although widely used in the field, do not fully reflect the variability of real-world conditions. We are exploring the integration of more diverse datasets, including spontaneous and unconstrained recordings, to better evaluate the practical applicability of our method.
We also took your suggestion on deeper comparisons seriously and expanded our experimental benchmarks by adding several recent state-of-the-art methods (see revised Tables 4, 7, and 10), ensuring a more comprehensive evaluation.
Once again, we appreciate your insightful feedback and see it as an important guide for enhancing our current and future work.
Comment 5: … Also, there is one typo in the Fig. 5 caption "Tehe" …
Response 5: Thank you for pointing this out. The typo in the caption of Figure 5 has been corrected in the revised version. We appreciate your careful review.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsYou have proposed strategies to solve the problems of Automatic Facial Expression Recognition.
I have some questions as below.
1, in the Introduction part, you should introduce other works and show the current problems instead of pointing out the challenges directly.
2,in the Introduction part, what is your innovation? What’s the difference with other similar works.
3,”2. Fundamentals”should be put after the related work.
4, since there is no 3.2, ”3.1 is not necessary.
5,in the related work, cite more papers published in 2023~2025.
6,in the related work, show the difference of your paper and other papers deeply. What’s the problems of other paper.
7, in table.3,5,7, do the comparison with the latest papers published in 2025.
Thank you.
Author Response
Comment 1: … in the Introduction part, you should introduce other works and show the current problems instead of pointing out the challenges directly …
Response 1: Thank you for your insightful observation. We agree that highlighting existing works and discussing their limitations provides a more contextualized and compelling motivation for our study. In response, we have revised the Introduction to include several representative recent studies, emphasizing both their contributions and their shortcomings. This new structure helps clarify the specific research gaps our work aims to address, and better situates our approach within the current state of the art.
We sincerely appreciate your suggestion, which has helped us improve the clarity and relevance of the introductory section.
Comment 2: … in the Introduction part, what is your innovation? What’s the difference with other similar works …
Response 2: Thank you for your question. We understand the importance of clearly articulating the novelty of our contribution, especially in comparison with prior work. While our initial introduction included a list of three key contributions, we realize that the distinction with related approaches may not have been sufficiently emphasized at that stage.
To clarify: our main innovation lies in proposing a purely statistical and lightweight spatio-temporal representation of facial expressions, built exclusively from facial landmarks, without relying on deep neural networks or appearance-based features. This approach offers a good trade-off between accuracy and computational cost, making it suitable for real-time or resource-constrained applications.
Unlike prior methods such as those of Perveen et al., which use kernel-based models, or Ngoc et al., which rely on graph neural networks and require substantial training data and compute power, our method achieves competitive results through simple yet effective statistical descriptors derived from geometric features. We have revised the Introduction accordingly to better highlight this distinction.
We appreciate your comment, which helped us refine the presentation of our contributions.
Comment 3: … “2. Fundamentals” should be put after the related work …
Response 3: Thank you for your suggestion. We understand that, in some cases, presenting the related work earlier can help situate the proposed contribution within the existing literature from the outset. However, in our case, we chose to present the “Fundamentals” section first in order to introduce key concepts and notations that are essential for understanding both the proposed method and the comparative analysis of prior work.
This structure was designed to guide the reader progressively: first by establishing the geometric and temporal descriptors used in AFER systems, then by showing how these have been employed or extended in related studies. We believe this order improves clarity, especially for readers less familiar with geometry-based FER approaches.
That said, we remain open to adapting the structure in future revisions if needed.
Comment 4: … since there is no 3.2, 3.1 is not necessary …
Response 4: Thank you for your observation. We agree with your comment, and the section numbering has been adjusted accordingly in the revised version.
Comment 5: … in the related work, cite more papers published in 2023~2025 …
Response 5: Thank you for your comment. In the revised version, we have added nine additional recent works, most of which are journal articles published between 2020 and 2024. These were included in addition to the nine works already cited in the initial submission, which covered the following distribution: 3 papers from 2019, 2 from 2020, 1 from 2021, 2 from 2022, and 1 from 2023.
In the revised manuscript, we now include: 3 papers from 2020, 2 from 2022, 2 from 2023, and 2 from 2024 — bringing the total to 18 references spanning the last five years (2019–2024). This time window aligns with the standard practice in literature reviews for this research area.
It is worth noting that we also aimed to include only studies that used the same datasets as ours for training and evaluation (CK+, MUG, or MMI), in order to ensure a fair and objective performance comparison. This constraint limits the pool of very recent publications, but helps maintain consistency in the experimental analysis.
Comment 6: … in the related work, show the difference of your paper and other papers deeply. What’s the problems of other paper …
Response 6: Thank you for your valuable comment. In the revised version of the paper, we expanded the Related Work section to provide a more detailed comparison between existing approaches and our proposed method. Specifically, we highlighted the key limitations of previous works—such as the need for sequence normalization, the loss of temporal information, the use of large feature vectors, or the reliance on deep and computationally intensive architectures. Our approach was designed precisely to address these issues. Thus, the main difference lies in how we overcome the limitations identified in the literature by proposing a lightweight, compact, and efficient system that preserves sequence integrity and remains suitable for real-time applications.
Comment 7: … in table.3,5,7, do the comparison with the latest papers published in 2025 …
Response 7: Thank you for your insightful comment. In response, we have added nine additional recent works to the revised version of the manuscript, most of which are journal articles published between 2020 and 2024. These complement the nine references already cited in the original submission, distributed as follows: 3 from 2019, 2 from 2020, 1 from 2021, 2 from 2022, and 1 from 2023.
The revised manuscript now includes 3 papers from 2020, 2 from 2022, 2 from 2023, and 2 from 2024 — totaling 18 references from the last five years (2019–2024). This five-year window aligns with standard practices in recent literature reviews within this research domain.
We acknowledge the importance of including very recent papers (e.g., from 2025). However, we encountered a limitation: few of the most recent publications rely on the same benchmark datasets used in our study (CK+, MUG, or MMI). To ensure fair and meaningful comparisons, we selected works that use the same evaluation protocol as ours.
As a result, Tables 3, 5, and 7 have been updated accordingly. We also highlight in the text how our method performs favorably in comparison with these state-of-the-art approaches, particularly in terms of recognition accuracy.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe author proposed an efficient and dynamic automatic facial expression recognition method based on novel spatiotemporal representation in the manuscript. Although the method is innovative and practical, it still has the following problems:
1. In the experimental part of the manuscript, the generalization ability across data sets is insufficient. The author only tested on three data sets, CK+, MMI, and MUG, and did not verify the performance of the model on cross-cultural or cross-scenario data sets (such as RAF-DB and AFEW).
2. Insufficient interpretability and robustness analysis, lack of feature importance explanation.
Although ExtRa-Trees are used for feature selection, it is not visualized which spatiotemporal features contribute most to classification (such as which facial area movements are most critical to "surprise"); noise robustness is not tested, and the performance of the model under noise interference (such as blur, low resolution) or incomplete expressions (such as turning the head in the middle) is not evaluated.
3. The depth of the comparative experiment is insufficient. The author's research only compares the work before 2020, and does not compare with the latest methods.
4. The generalization ability of the model is limited. The performance on the MMI dataset is poor (75.59%), indicating that it has poor adaptability to complex environments (such as different lighting, backgrounds, and performance styles). It is recommended to use data enhancement, multi-source transfer learning, or cross-database training to improve generalization capabilities.
Author Response
Comment 1: … In the experimental part of the manuscript, the generalization ability across data sets is insufficient. The author only tested on three data sets, CK+, MMI, and MUG, and did not verify the performance of the model on cross-cultural or cross-scenario data sets (such as RAF-DB and AFEW) …
Response 1: Thank you for this insightful comment. In this work, we focused on three well-established benchmark datasets—CK+, MMI, and MUG—that are widely used in the literature for evaluating facial expression recognition methods. These datasets were collected under controlled conditions, which aligns with our target application scenarios where data is captured in consistent and structured environments (e.g., healthcare or training interfaces).
We acknowledge the importance of cross-cultural and cross-scenario generalization. To address this aspect partially, we note that the CK+ dataset includes subjects from diverse age groups and ethnic backgrounds, offering a degree of cross-cultural variability. However, our method is specifically designed for video sequences, as it relies on the generation of spatiotemporal representations from frame sequences.
Based on our research, RAF-DB primarily contains still images rather than temporal sequences, which makes it less compatible with our current approach. Similarly, while AFEW is a valuable dataset for evaluating performance "in the wild," it presents challenges such as uncontrolled lighting, head pose variations, and occlusions—conditions that fall outside the scope of this study. Nevertheless, we recognize the value of such datasets and consider their use an interesting direction for future work to assess the model's robustness in more unconstrained environments.
Comment 2: … Insufficient interpretability and robustness analysis, lack of feature importance explanation. Although ExtRa-Trees are used for feature selection, it is not visualized which spatiotemporal features contribute most to classification (such as which facial area movements are most critical to "surprise"); noise robustness is not tested, and the performance of the model under noise interference (such as blur, low resolution) or incomplete expressions (such as turning the head in the middle) is not evaluated …
Response 2: Thank you for your valuable comment. Regarding feature importance and interpretability, we used the ExtRa-Trees algorithm for feature selection, which ranks features based on information gain. While we did not include a detailed visualization of individual features, we reported the optimal number of selected attributes (see Table 12). Due to the high dimensionality of the spatiotemporal features, providing an intuitive visualization of all relevant features—such as specific facial areas or movement patterns—is not straightforward and was not feasible in the current study.
We acknowledge that interpretability could be improved by incorporating visual analyses (e.g., identifying which regions of the face contribute most to each emotion). This is a relevant and interesting direction that we plan to explore in future work, possibly by using model-agnostic explanation techniques or heatmap-based approaches.
As for robustness analysis, our method was evaluated on datasets acquired under controlled conditions, which is consistent with the targeted application domain. Therefore, we did not explicitly test the model's behavior under noise, blur, low resolution, or occlusion. While these conditions are indeed important for in-the-wild scenarios, they fall outside the scope of this work. However, we agree that assessing robustness under such challenging conditions would be a valuable extension in future research.
Comment 3: … The depth of the comparative experiment is insufficient. The author's research only compares the work before 2020, and does not compare with the latest methods …
Response 3: Thank you for your insightful comment. In response, we have added nine additional recent works to the revised version of the manuscript, most of which are journal articles published between 2020 and 2024. These complement the nine references already cited in the original submission, distributed as follows: 3 from 2019, 2 from 2020, 1 from 2021, 2 from 2022, and 1 from 2023.
The revised manuscript now includes 3 papers from 2020, 2 from 2022, 2 from 2023, and 2 from 2024 — totaling 18 references from the last five years (2019–2024). This five-year window aligns with standard practices in recent literature reviews within this research domain.
We acknowledge the importance of including very recent papers (e.g., from 2025). However, we encountered a limitation: few of the most recent publications rely on the same benchmark datasets used in our study (CK+, MUG, or MMI). To ensure fair and meaningful comparisons, we selected works that use the same evaluation protocol as ours.
As a result, Tables 3, 5, and 7 have been updated accordingly. We also highlight in the text how our method performs favorably in comparison with these state-of-the-art approaches, particularly in terms of recognition accuracy.
Comment 4: … The generalization ability of the model is limited. The performance on the MMI dataset is poor (75.59%), indicating that it has poor adaptability to complex environments (such as different lighting, backgrounds, and performance styles). It is recommended to use data enhancement, multi-source transfer learning, or cross-database training to improve generalization capabilities …
Response 4: Thank you for your constructive feedback and valuable suggestions. We fully acknowledge the importance of improving the model's generalization capability, and we plan to explore strategies such as data augmentation, cross-database training, and multi-source transfer learning in future work.
Regarding the performance on the MMI dataset, we would like to clarify that this dataset was acquired under conditions that differ significantly from those of the CK+ and MUG datasets — particularly in terms of lighting conditions, background variability, and expression performance style. These differences contribute to the lower recognition rate (75.59%), compared to the >90% achieved on the other two datasets.
However, this limitation is consistent across the literature: as shown in Table 7, nearly all baseline methods also experience a performance drop on the MMI dataset. For instance, the best-performing method we compared against achieves only 76.34% accuracy, and it is based on deep neural networks — which we deliberately avoided in our approach to maintain a lightweight and computationally efficient model.
We believe that our method offers a good trade-off between efficiency and performance, and the drop in performance on MMI reflects not only the challenging nature of this dataset, but also the limitations shared by many methods in controlled settings.
Nonetheless, we agree that improving robustness to diverse conditions remains an important goal, and we will incorporate the reviewer’s suggestions in future work to further enhance generalization.
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThis manuscript presents a dynamic AFER (Automatic Facial Expression Recognition) approach based on statistical modeling of spatio-temporal geometric features. The method involves computing all pairwise Euclidean distances from facial landmarks across sequences and summarizing their temporal evolution using variance, skewness, and kurtosis. Feature selection is performed via the ExtRa-Trees algorithm, and classification is conducted using SVM and k-NN. Experiments on CK+, MUG, and MMI datasets demonstrate competitive recognition accuracy.
The proposed approach is technically sound and contributes to the ongoing effort to develop efficient FER systems. However, several critical issues must be addressed before the manuscript can be considered for publication.
- Overreliance on Accuracy as the Sole Evaluation Metric: Throughout the experimental section, the evaluation of classifiers and comparison with state-of-the-art methods rely almost exclusively on accuracy. This is insufficient, particularly in multi-class and potentially imbalanced settings. For instance, in Table 3 (CK+ dataset), multiple methods achieve similar accuracy (e.g., 94.90% vs. 94.64%), making the comparison inconclusive without additional metrics.
- Inadequate Analysis of Poor Performance on the MMI Dataset (Especially the 'Fear' Class): The confusion matrix for the MMI dataset reveals extremely poor classification performance on the "Fear" (FE) class—only 42.86% accuracy (Table 4). However, the authors neither acknowledge nor explain this substantial degradation.
- Lack of Explanation and Interpretation of Confusion Matrices: The confusion matrices provided (Tables 2, 4, 6, and 8) are not properly introduced or discussed. The manuscript fails to define the layout (e.g., whether rows represent actual labels and columns predictions) and does not interpret the implications of common misclassifications.
- Statistical significance: Given the relatively small performance gaps between the proposed method and competing methods, no statistical significance testing is performed to verify whether differences in accuracy are meaningful.
- Classifier Performance Justification: The authors claim that SVM and k-NN are the best-performing classifiers, but deeper insights into why simpler models like DT and MLP perform worse are missing. Given that DT is conceptually similar to the ExtRa-Trees used for feature selection, a clearer explanation is warranted.
- Discussion of Limitations is Minimal: Although Section 8 briefly mentions the frontal-view limitation, other constraints—such as the inability to generalize across age, ethnicity, or occlusions—are not addressed. Additionally, the sensitivity of the model to facial landmark detection errors, which the method heavily relies on, deserves more attention.
- Typo in figure caption: “Accuracy of tehe proposed dynamic AFER using different classifiers” should be corrected.
Author Response
Comment 1: … Overreliance on Accuracy as the Sole Evaluation Metric: Throughout the experimental section, the evaluation of classifiers and comparison with state-of-the-art methods rely almost exclusively on accuracy. This is insufficient, particularly in multi-class and potentially imbalanced settings. For instance, in Table 3 (CK+ dataset), multiple methods achieve similar accuracy (e.g., 94.90% vs. 94.64%), making the comparison inconclusive without additional metrics …
Response 1: Thank you for raising this important point. While accuracy is indeed a widely used metric in facial expression recognition studies, we agree that relying solely on it would not be sufficient for a comprehensive evaluation, particularly in multi-class and potentially imbalanced scenarios.
To address this, our manuscript includes several additional evaluation metrics beyond accuracy:
- For each dataset (CK+, MUG, MMI), we provide detailed confusion matrices (see Tables 2, 5, 8, and 11), which allow for class-wise performance analysis.
- We report standard classification metrics including Precision, Recall, and F1-Score, as well as the number of selected features, which are summarized in Table 12.
- Furthermore, to assess the computational efficiency of our method, we include training and prediction times in Table 13.
It is important to note that in many of the compared state-of-the-art works, only a limited set of metrics (often only accuracy) are reported, making full multi-metric comparisons difficult. For instance, in the CK+ dataset comparison you referenced (94.90% vs. 94.64%), Shahid et al. only provided accuracy and F1-score, whereas we report a broader range of evaluation metrics for a more comprehensive performance assessment.
Additionally, when confusion matrices were provided in prior work, we included comparative visualizations for those as well, to facilitate a deeper analysis of per-class performance.
We hope these additions and clarifications adequately address your concern and strengthen the robustness of our experimental evaluation.
Comment 2: … Inadequate Analysis of Poor Performance on the MMI Dataset (Especially the 'Fear' Class): The confusion matrix for the MMI dataset reveals extremely poor classification performance on the "Fear" (FE) class—only 42.86% accuracy (Table 4). However, the authors neither acknowledge nor explain this substantial degradation …
Response 2: Thank you for pointing out this important observation. In the revised version of the manuscript, we have addressed the issue of poor performance on the "Fear" (FE) class in the MMI dataset more thoroughly.
As shown in Table 5, our method achieves 42.86% accuracy for the FE class, which is notably lower than for other classes. This degradation is indeed acknowledged and can be explained by the fact that fear expressions are often confused with other emotions due to their subtle visual cues and inter-class similarities. Additionally, the conditions under which the MMI dataset was collected — including different participants and recording protocols — differ significantly from those of CK+ or MUG, which likely affects the generalization capacity of classifiers trained on limited data.
To provide context and better analyze this result, we have added three comparative confusion matrix tables (Tables 3, 6 and 9), showing side-by-side comparisons between our method and recent baseline approaches. These tables highlight that poor classification performance on the FE class is a common challenge across studies. For example, some recent works from 2022 and 2023 reported only 31% and 33.5% accuracy respectively for the FE class. The highest reported performance among the compared methods is 54.00%, which remains relatively close to our result (42.86%).
This additional analysis helps to put our findings into perspective and shows that, despite the challenges, our method achieves competitive performance on this particularly difficult class.
Comment 3: … Lack of Explanation and Interpretation of Confusion Matrices: The confusion matrices provided (Tables 2, 4, 6, and 8) are not properly introduced or discussed. The manuscript fails to define the layout (e.g., whether rows represent actual labels and columns predictions) and does not interpret the implications of common misclassifications …
Response 3: Thank you for your observation regarding the confusion matrices. In the revised version of the manuscript, we have explicitly clarified the layout of the confusion matrices—specifying that rows represent the true (actual) labels and columns represent the predicted labels. We have also added labels and axis titles to each matrix table to ensure clarity for the reader.
In addition, we have significantly expanded the discussion and interpretation of the confusion matrices (Tables 2, 4, 6, and 8) to better highlight recurring misclassification patterns and explain their potential causes. For example, we now detail how certain expressions such as fear and surprise may be confused due to overlapping facial features, particularly in the MMI dataset. These insights provide a more nuanced understanding of our model’s strengths and limitations across different classes and datasets.
We appreciate your helpful comment, which allowed us to improve the clarity and depth of our evaluation section.
Comment 4: … Statistical significance: Given the relatively small performance gaps between the proposed method and competing methods, no statistical significance testing is performed to verify whether differences in accuracy are meaningful …
Response 4: Thank you for your valuable comment regarding statistical significance. In this study, we followed the standard practice commonly observed in the field of facial expression recognition, as reflected in the 18 baseline works we cited. The vast majority of these references do not include statistical significance testing and instead rely on comparative performance metrics such as accuracy, F1-score, precision, recall, and confusion matrices to demonstrate improvements.
That said, we fully acknowledge the added value of statistical analysis in confirming whether observed differences are indeed meaningful, especially when performance margins are narrow. While such testing could enhance the robustness of our conclusions, it is often constrained by the lack of access to the raw experimental results of prior methods, which are required to perform meaningful significance comparisons.
Nonetheless, our method outperforms recent state-of-the-art approaches in multiple datasets and across multiple evaluation metrics, as detailed in the different tables in “Results & Discussion” section. These consistent improvements across benchmarks lend credibility to the strength of our approach. We have also added a note in the manuscript acknowledging the absence of statistical testing and suggesting it as a valuable enhancement for future work.
We sincerely appreciate your suggestion and will consider integrating statistical analysis in follow-up studies where comparative data is fully accessible.
Comment 5: … Classifier Performance Justification: The authors claim that SVM and k-NN are the best-performing classifiers, but deeper insights into why simpler models like DT and MLP perform worse are missing. Given that DT is conceptually similar to the ExtRa-Trees used for feature selection, a clearer explanation is warranted …
Response 5: Thank you for this valuable comment. In the revised version of the manuscript, we have added a more detailed justification of the classifier performance differences.
The classification results were obtained empirically by testing all classifiers under the same conditions and with optimized hyperparameters. We observed that SVM and k-NN consistently achieved the best performance across datasets. One explanation is that SVM handles high-dimensional and non-linear data well thanks to its use of kernel functions, which makes it particularly effective when the number of selected features is high. Similarly, k-NN, when properly tuned (e.g., using optimal values of k and appropriate distance metrics), can perform very well, especially in settings where the feature space captures discriminative patterns effectively.
In contrast, Decision Trees (DT) and Multilayer Perceptrons (MLP) are more sensitive to feature space complexity and can suffer from overfitting or convergence issues, particularly in cases with limited training samples or imbalanced classes, as is often the case in facial expression datasets. The MLP, for instance, requires careful architecture tuning and sufficient data to generalize well, which can be challenging under our experimental setup.
Regarding your observation on Extra-Trees (used for feature selection) and its conceptual similarity to DT: indeed, Extra-Trees is an ensemble method that aggregates multiple randomized decision trees, which typically improves robustness and generalization compared to a single DT. This ensemble nature explains why it performs better in feature selection tasks, even though we did not use it for classification. We agree that exploring Extra-Trees or similar ensembles (e.g., Random Forests) as classifiers could be a promising direction for future work.
We thank you again for this suggestion, which has helped us clarify an important aspect of our methodological choices.
Comment 6: … Discussion of Limitations is Minimal: Although Section 8 briefly mentions the frontal-view limitation, other constraints—such as the inability to generalize across age, ethnicity, or occlusions—are not addressed. Additionally, the sensitivity of the model to facial landmark detection errors, which the method heavily relies on, deserves more attention …
Response 6: Thank you for this insightful observation. We agree that the discussion of limitations can be further elaborated, and we have accordingly expanded Section 8 in the revised manuscript.
In our study, we focused primarily on frontal facial expressions, as the datasets used (CK+, MUG, and MMI) were collected in controlled environments and under frontal-view conditions. This choice aligns with the specific applications we are targeting, where user positioning can be reasonably constrained (e.g., desktop-based systems or clinical settings).
Regarding the generalization across age, ethnicity, and gender, we acknowledge the importance of these factors. While our work does not explicitly address demographic variability, we note that the CK+ dataset, in particular, includes participants with a diverse range of ages and ethnic backgrounds, which provides some degree of variability in training. However, we recognize that further evaluation on larger and more diverse datasets is necessary to robustly assess cross-demographic generalization.
As for facial occlusions, we agree that this remains a significant challenge in real-world applications. However, it falls beyond the scope of the current work, which did not include occluded samples. We consider occlusion-robust facial expression recognition a valuable direction for future research.
Finally, with respect to the sensitivity of our approach to facial landmark detection errors, we used a fast and lightweight method for landmark extraction to maintain low computational overhead. Nonetheless, we acknowledge that this step introduces a dependency, and more robust alternatives—such as deep-learning-based detectors (e.g., MediaPipe by Google)—could enhance performance, especially in the presence of variability in head pose or lighting conditions. We have added this consideration to the revised limitations section.
Once again, we thank you for this helpful suggestion, which led us to clarify and expand the discussion of the model's constraints.
Comment 7: … Typo in figure caption: “Accuracy of tehe proposed dynamic AFER using different classifiers” should be corrected …
Response 7: Thank you for pointing this out. The typo in the caption of Figure 5 has been corrected in the revised version. We appreciate your careful review.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for Authors1, For the reply to my question 1, you should show the current problems (the latest problems, paper published in 2024-2025.
2, I found that your proposed method does not perform better in some case of table 4,6,7,9, so I do not think your method is effective.
Thank you.
Author Response
Comment 1: … For the reply to my question 1, you should show the current problems (the latest problems, paper published in 2024-2025 …
Response 1: Thank you for your valuable comment. We have substantially revised the “Limitations & Challenges” section to reflect the most recent issues and trends reported in the dynamic AFER literature.
To support our analysis, we have incorporated eight new references from high-quality journals published in 2024 and 2025 (e.g., Pattern Recognition, Information Fusion, Neural Computing & Applications, Signal, Image and Video Processing). These include:
- Poster++ (Mao et al., 2025), highlighting the trade-off between simplicity and robustness;
- MetaFormer approaches (Khelifa et al., 2025), which target complex compound emotions but remain sensitive to noise and data scarcity;
- RS-Xception (Liao et al., 2024) and MVT-CEAM (Wang et al., 2024), proposing lightweight networks that still struggle with generalization in “in-the-wild” scenarios;
- Recent surveys such as Kopalidis et al. (2024), which consolidate open challenges across datasets, models, and modalities.
These works point to persistent and emerging problems such as:
(1) overfitting on small and imbalanced datasets;
(2) poor recognition of subtle or compound expressions;
(3) inefficiency of large or deep models for real-time use;
(4) difficulty in preserving authentic sequence dynamics;
(5) limited generalization across datasets and deployment conditions.
We have accordingly updated our Essential Requirements and refined our discussion to emphasize how our proposed method addresses these current issues — through a lightweight, geometry-based representation that maintains sequence integrity, minimizes complexity, and performs robustly across benchmarks.
We hope these modifications meet the reviewer’s expectations and clarify the relevance and novelty of our contribution within the latest research landscape.
Comment 2: … I found that your proposed method does not perform better in some case of table 4,6,7,9, so I do not think your method is effective …
Response 2: Thank you for this important observation. We respectfully acknowledge that our method does not always outperform all deep learning-based approaches in terms of absolute accuracy, particularly in Tables 4 and 7. However, this accuracy gap is minimal in most cases (e.g., 94.64% vs. 95.10% on CK+ in Table 4; 75.59% vs. 76.34% on MMI in Table 7), and remains within 0.5–1% of the best-performing methods, which typically rely on computationally intensive architectures such as 3D-CNNs or ConvLSTMs.
We would like to emphasize that our goal is not to surpass all methods in accuracy, but to propose a solution that balances recognition performance with key practical constraints such as:
- Low computational complexity (no deep CNNs or recurrent layers),
- No need for GPU inference or large-scale training datasets,
- Compact spatio-temporal representation based only on facial landmarks,
- Suitability for real-time or low-resource deployment (e.g., embedded systems, mobile devices).
These aspects are particularly important for real-time FER applications in resource-constrained environments, where many high-accuracy deep models cannot be deployed in practice.
We have clarified this point further in the first revised manuscript (see Section "Results & Discussion") and now explicitly discuss the trade-offs between accuracy, complexity, and runtime feasibility. We hope this addresses the reviewer’s concern and better highlights the practical value and robustness of our approach despite the marginal drop in accuracy in some cases.
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors have addressed the key concerns raised in the first review, and the manuscript has been substantially improved. The responses are clear, and the additions—such as expanded evaluation metrics and improved result interpretation—are appropriate. While statistical significance testing is not included, the authors have reasonably explained this and acknowledged it as a limitation. A brief clarification in the final version would be sufficient. I recommend acceptance with minor revision.
Author Response
Comment 1: … The authors have addressed the key concerns raised in the first review, and the manuscript has been substantially improved. The responses are clear, and the additions—such as expanded evaluation metrics and improved result interpretation—are appropriate …
Response 1: We sincerely thank the reviewer for their positive feedback and appreciation of our efforts to improve the manuscript. We are grateful for the constructive comments provided during the review process, which helped us strengthen both the experimental analysis and the clarity of our presentation.
Comment 2: … While statistical significance testing is not included, the authors have reasonably explained this and acknowledged it as a limitation. A brief clarification in the final version would be sufficient. I recommend acceptance with minor revision …
Response 2: We thank the reviewer for their insightful comment and positive recommendation. We would like to clarify that, since the comparative results reported for other methods are taken from published studies without access to their individual run data or evaluation splits, it is not possible to perform formal statistical significance testing on the comparisons.
Therefore, our evaluation reports single average accuracy values per dataset and method. We acknowledge this limitation and have added a brief clarification in the manuscript (see Conclusion & Future Work) to explicitly state this point.
Author Response File:
Author Response.pdf
Round 3
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for you modifications.
For my comment 2, although it may sound acceptable, I do not think the gap is small. Most of the results are not better than other papers.

