Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Comparing CNN and ViT for Open-Set Face Recognition^†

Electronics 2025, 14(19), 3840; https://doi.org/10.3390/electronics14193840

by Ander Galván^*

, Mariví Higuero

, Ane Sanz

, Asier Atutxa

, Eduardo Jacob

and Mario Saavedra

Reviewer 1: Anonymous

Reviewer 2:

Rajesh Kumar

Reviewer 3: Anonymous

Electronics 2025, 14(19), 3840; https://doi.org/10.3390/electronics14193840

Submission received: 9 July 2025 / Revised: 23 September 2025 / Accepted: 25 September 2025 / Published: 27 September 2025

(This article belongs to the Special Issue Security Challenges and Opportunities of Artificial Intelligence/Big Data Scenarios)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Is the reference citation in Figure 1 of the article written incorrectly? Please check.

The literature summarized and cited in Table 1 needs to be supplemented with at least 5 relevant studies published in the past three years.

The author mentioned that the ViT model has better recognition and detection performance than the CNN model when dealing with large-scale training set data. However, in practical operation, not all studies can obtain large-scale training data, so the advantages of CNN still exist widely. Therefore, the author needs to provide a more comprehensive and objective analysis to compare the application value of VIT and CNN models.

The references cited in the author's article need to be supplemented with research from the past three years to support the conclusions analyzed in the article.

Please show the loss value (LOSS), P-value, and R-value curves of each model trained in the article during the training phase to objectively evaluate the training performance of the model.

The article lacks a visual display of the comparative analysis of different models on various facial recognition effects. Please provide recognition effect diagrams for each model and compare and analyze them.

Author Response

Comment 1: Is the reference citation in Figure 1 of the article written incorrectly? Please check.

Response 1: The citation in Figure 1 was indeed incorrect. The error has now been corrected.

Comment 2: The literature summarized and cited in Table 1 needs to be supplemented with at least 5 relevant studies published in the past three years.

Response 2: The first paragraph of Section 4 (Features of the CNN and ViT Models) has been modified. In addition, several recent studies published within the past three years have been incorporated to strengthen the discussion of the models presented in Table 1, as suggested in Comment 2.

Comment 3: The author mentioned that the ViT model has better recognition and detection performance than the CNN model when dealing with large-scale training set data. However, in practical operation, not all studies can obtain large-scale training data, so the advantages of CNN still exist widely. Therefore, the author needs to provide a more comprehensive and objective analysis to compare the application value of VIT and CNN models.

Response 3: Section 6 (Results and Discussion) has been reorganized into different subsections for improved clarity. In Subsection 6.4 (Open-Set Recognition Performance), an additional paragraph has been included to address Comment 3: “Nevertheless, it is important to note that CNNs remain highly competitive and widely used, especially in contexts...”. Likewise, in Section 7 (Conclusions), the following sentence has been incorporated into the penultimate paragraph: “Nonetheless, it is important to mention that CNNs remain highly...”.

Comment 4: The references cited in the author's article need to be supplemented with research from the past three years to support the conclusions analyzed in the article.

Response 4: Section 3 (Related Work) has been slightly modified, and additional studies published within the past three years have been incorporated to support the conclusions. Furthermore, research on hybrid models has also been included, as suggested by another reviewer, to enrich the context.

Comment 5: Please show the loss value (LOSS), P-value, and R-value curves of each model trained in the article during the training phase to objectively evaluate the training performance of the model.

Response 5: Subsection 6.2 (Fine-Tuning Performance) within Section 6 (Results and Discussion) has been added to present the requested curves (loss, precision and recall) for each model during the fine-tuning phase, thereby addressing Comment 5.

Comment 6: The article lacks a visual display of the comparative analysis of different models on various facial recognition effects. Please provide recognition effect diagrams for each model and compare and analyze them.

Response 6: In Section 6 (Results and Discussion), Subsection 6.4 (Open-Set Recognition Performance), the paragraph “To enhance our understanding…” has been added, together with a figure, to provide a clearer visual representation and illustrative examples of how each model classifies known and unknown individuals. Additionally, a new figure has been included to summarize the overall performance comparison between models, as referenced in the paragraph: “The experimental evaluation carried out highlights…”.

Reviewer 2 Report

Comments and Suggestions for Authors

While the results clearly show that ViT-B-16 outperforms the CNN models in both CSR and OSR scenarios, there is no mention of statistical tests (e.g., confidence intervals or p-values) to confirm whether the performance differences are statistically significant. This limits the reliability of the claimed superiority of ViT in OSR settings.
The authors are encouraged to include and discuss recent work (2025) such as the following, which may provide valuable insights on transfer learning and facial expression recognition:

Kumar, R.; Corvisieri, G.; Fici, T.F.; Hussain, S.I.; Tegolo, D.; Valenti, C. Transfer Learning for Facial Expression Recognition. Information 2025, 16, 320. https://doi.org/10.3390/info16040320
Although the paper uses VGGFace2 and CASIA-WebFace datasets, there is no discussion on the demographic diversity of these datasets. A brief note on dataset bias and its potential influence on OSR performance would strengthen the paper.
The abstract mentions a GitHub repository, must mention the github link in data availability section full link.
The paper applies OpenMax for OSR handling but does not provide an ablation study to assess how much OpenMax contributes compared to vanilla SoftMax thresholds. This would help isolate the effect of the OpenMax algorithm.

Author Response

Comment 1: While the results clearly show that ViT-B-16 outperforms the CNN models in both CSR and OSR scenarios, there is no mention of statistical tests (e.g., confidence intervals or p-values) to confirm whether the performance differences are statistically significant. This limits the reliability of the claimed superiority of ViT in OSR settings.

Response 1: The tests (fine-tuning and evaluation in both CSR and OSR scenarios) were repeated 10 times with different seeds. Section 6 (Results and Discussion) now reports the mean, standard deviation, and 95% confidence intervals, reflecting variability across runs and providing a more reliable statistical assessment of the performance differences between ViT and CNN models.

Comment 2: The authors are encouraged to include and discuss recent work (2025) such as the following, which may provide valuable insights on transfer learning and facial expression recognition: Kumar, R.; Corvisieri, G.; Fici, T.F.; Hussain, S.I.; Tegolo, D.; Valenti, C. Transfer Learning for Facial Expression Recognition. Information 2025, 16, 320. https://doi.org/10.3390/info16040320

Response 2: Section 3 (Related Work) has been updated to include recent studies relevant to this work, as well as research on hybrid models, in line with suggestions from another reviewer to enrich the context. The study mentioned in this comment (Kumar et al., 2025) has not been included, as it is considered to be not fully aligned with the focus and scope of the present work.

Comment 3: Although the paper uses VGGFace2 and CASIA-WebFace datasets, there is no discussion on the demographic diversity of these datasets. A brief note on dataset bias and its potential influence on OSR performance would strengthen the paper.

Response 3: A paragraph has been added in Section 7 (Conclusions): “However, it is important to note that while the implemented…”, addressing the potential influence of dataset bias and providing a brief discussion on the demographic diversity of the VGGFace2 and CASIA-WebFace datasets.

Comment 4: The abstract mentions a GitHub repository, must mention the github link in data availability section full link.

Response 4: The GitHub link has been added to the Data Availability section as requested.

Comment 5: The paper applies OpenMax for OSR handling but does not provide an ablation study to assess how much OpenMax contributes compared to vanilla SoftMax thresholds. This would help isolate the effect of the OpenMax algorithm.

Response 5: Section 6 (Results and Discussion) has been reorganized into different subsections for improved clarity. Within this section, Subsection 6.5 (Ablation Study: SoftMax Thresholding and OpenMax) has been added to demonstrate that using SoftMax Thresholding instead of OpenMax leads to the same conclusions, confirming that ViT outperforms CNN models. This shows that the results are not dependent on the use of OpenMax.

Reviewer 3 Report

Comments and Suggestions for Authors

The following is comments. Authors need to carefully solve these:

Although the paper clearly states the selected CNN and ViT models, it lacks sufficient justification for the choice of these specific CNN architectures. Consider providing explicit criteria or a rationale explaining why these CNN models were selected over other available options, ensuring the reader understands their significance in your comparison context.
In the methodology (pre-training phase), several data augmentation techniques were mentioned (page 8). It would enhance readability and methodological transparency if you briefly justify why these specific augmentations (e.g., rotations, grayscale conversion, random cropping) were chosen and how each particularly benefits face recognition tasks in OSR scenarios.
For description of line 360 to 361 “To understand how the models handle both known and unknown individuals, a detailed analysis of the False Positive Rate (FPR), True Negative Rate (TNR), True Positive Rate (TPR), and False Negative Rate (FNR) was conducted”, authors need to cite paper since this is not fully described: Automated fault diagnosis detection of air handling units using real operational labelled data and transformer-based methods at 24-hour operation hospital
While the OpenMax parameters are listed clearly (page 10), a more thorough explanation or justification of these parameter values (η = 5, α = 2, ε = 0.9) would strengthen the manuscript. Discussing the sensitivity of the results to these parameters would also improve the manuscript's depth.
The literature review (pages 6-7) presents several relevant studies but primarily focuses on direct comparisons between CNN and ViT. Expanding this section slightly to include more discussion on recent developments or hybrid methods would enrich the context and highlight areas for potential future work.
Although the OSR results (page 11-12) clearly identify ViT as the best-performing model, a deeper error analysis would greatly enhance the results' interpretation. Consider including examples or a visual analysis of specific misclassification cases (e.g., unknown individuals classified as known). This would provide insights into model behavior and help readers understand potential limitations.
The paper highlights model accuracy extensively (pages 11-12) but could benefit from an expanded discussion on the computational complexity and practical considerations (inference speed, resource demands, and deployment challenges) of the evaluated models. This practical perspective would add significant value, particularly for readers considering real-world implementation of these face recognition systems.

Author Response

Comment 1: Although the paper clearly states the selected CNN and ViT models, it lacks sufficient justification for the choice of these specific CNN architectures. Consider providing explicit criteria or a rationale explaining why these CNN models were selected over other available options, ensuring the reader understands their significance in your comparison context.

Response 1: The first paragraph of Section 4 (Features of the CNN and ViT Models) has been modified to include the selection criteria for the chosen CNN architectures, as requested. Additionally, recent studies supporting the use of these models have been incorporated to reinforce their relevance in the comparison context.

Comment 2: In the methodology (pre-training phase), several data augmentation techniques were mentioned (page 8). It would enhance readability and methodological transparency if you briefly justify why these specific augmentations (e.g., rotations, grayscale conversion, random cropping) were chosen and how each particularly benefits face recognition tasks in OSR scenarios.

Response 2: The paragraph in Section 5 (Baseline Conditions for Analytical Comparison) beginning with “Before pre-training the models, the images underwent…” has been modified to explain why the specific data augmentations were chosen and how each particularly benefits face recognition tasks in OSR scenarios.

Comment 3: For description of line 360 to 361 “To understand how the models handle both known and unknown individuals, a detailed analysis of the False Positive Rate (FPR), True Negative Rate (TNR), True Positive Rate (TPR), and False Negative Rate (FNR) was conducted”, authors need to cite paper since this is not fully described: Automated fault diagnosis detection of air handling units using real operational labelled data and transformer-based methods at 24-hour operation hospital

Response 3: The suggested reference has not been included, since it does not mention FPR, TNR, TPR, or FNR. Therefore, its contribution to the current discussion is unclear.

Comment 4: While the OpenMax parameters are listed clearly (page 10), a more thorough explanation or justification of these parameter values (η = 5, α = 2, ε = 0.9) would strengthen the manuscript. Discussing the sensitivity of the results to these parameters would also improve the manuscript's depth.

Response 4: In Subsection 5.3 (Evaluation) of Section 5 (Baseline Conditions for Analytical Comparison), a paragraph beginning with “The selected OpenMax parameters…” has been added to provide justification for these parameter values (η = 5, α = 2, ε = 0.9) and to discuss the sensitivity of the results to these parameters.

Comment 5: The literature review (pages 6-7) presents several relevant studies but primarily focuses on direct comparisons between CNN and ViT. Expanding this section slightly to include more discussion on recent developments or hybrid methods would enrich the context and highlight areas for potential future work.

Response 5: Although the present work does not focus on hybrid model comparisons, Section 3 (Related Work) has been updated to include studies on facial recognition using hybrid models, enriching the context. Additionally, potential future work involving hybrid models is mentioned in Section 7 (Conclusions).

Comment 6: Although the OSR results (page 11-12) clearly identify ViT as the best-performing model, a deeper error analysis would greatly enhance the results' interpretation. Consider including examples or a visual analysis of specific misclassification cases (e.g., unknown individuals classified as known). This would provide insights into model behavior and help readers understand potential limitations.

Response 6: Section 6 (Results and Discussion) has been reorganized into different subsections for improved clarity. Specifically, in Subsection 6.4 (Open-Set Recognition Performance), an image-based analysis has been included for three known and three unknown individuals, showing the true labels and model predictions. This is presented in the paragraph beginning “To enhance our understanding of the models’ performance” and illustrated in Figures 7 and 8, providing a deeper interpretation of model behavior and potential misclassifications.

Comment 7: The paper highlights model accuracy extensively (pages 11-12) but could benefit from an expanded discussion on the computational complexity and practical considerations (inference speed, resource demands, and deployment challenges) of the evaluated models. This practical perspective would add significant value, particularly for readers considering real-world implementation of these face recognition systems.

Response 7: To address this comment, Section 6 (Results and Discussion) now includes Subsection 6.6 (Computational Cost and Inference Time Analysis), which presents an evaluation of the floating-point operations (FLOPs) and inference time per image for each model. This addition provides practical insights into the computational complexity and deployment considerations of the evaluated face recognition systems.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

There is no further comment. I recommend that this paper is published.

Article Menu

Comparing CNN and ViT for Open-Set Face Recognition^†

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Comparing CNN and ViT for Open-Set Face Recognition†

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Comparing CNN and ViT for Open-Set Face Recognition^†