Comparing CNN and ViT for Open-Set Face Recognition

Galván, Ander; Higuero, Mariví; Sanz, Ane; Atutxa, Asier; Jacob, Eduardo; Saavedra, Mario

doi:10.3390/electronics14193840

Open AccessArticle

Comparing CNN and ViT for Open-Set Face Recognition^†

by

Ander Galván

^*

,

Mariví Higuero

,

Ane Sanz

,

Asier Atutxa

,

Eduardo Jacob

and

Mario Saavedra

Faculty of Engineering of Bilbao, Department of Communications Engineering, University of the Basque Country (UPV/EHU), Plaza Ingeniero Torres Quevedo, n. 1, 48013 Bilbao, Spain

^*

Author to whom correspondence should be addressed.

^†

The figure in Section 2.1 of the manuscript was previously included in the following work, which was accepted and presented at the ICAISC 2025 conference, but has not been formally published: Higuero, M.; Galván, A.; Astorga, J.; Atutxa, A.; Jacob, E. Optimizing OpenMax Parameters for Open-Set Face Recognition. In Proceedings of the ICAISC 2025, Zakopane, Poland, 22–26 June 2025.

Electronics 2025, 14(19), 3840; https://doi.org/10.3390/electronics14193840

Submission received: 9 July 2025 / Revised: 23 September 2025 / Accepted: 25 September 2025 / Published: 27 September 2025

(This article belongs to the Special Issue Security Challenges and Opportunities of Artificial Intelligence/Big Data Scenarios)

Download

Browse Figures

Review Reports Versions Notes

Abstract

At present, there is growing interest in automated biometric identification applications. For these, it is crucial to have a system capable of accurately identifying a specific group of people while also detecting individuals who do not belong to that group. In face identification models that use Deep Learning (DL) techniques, this context is referred to as Open-Set Recognition (OSR), which is the focus of this work. This scenario presents a substantial challenge for this type of system, as it involves the need to effectively identify unknown individuals who were not part of the system’s training data. In this context, where the accuracy of this type of system is considered crucial, selecting the model to be used in each scenario becomes key. It is within this context that our work arises. Here, we present the results of a rigorous comparative analysis examining the precision of some of the most widely used models today for face identification, specifically some Convolutional Neural Network (CNN) models compared with a Vision Transformer (ViT) model. All models were pre-trained on the same large dataset and evaluated in an OSR scenario. The results show that ViT achieves the highest precision, outperforming CNN baselines and demonstrating better generalization for unknown identities. These findings support recent evidence that ViT is a promising alternative to CNN for this type of application.

Keywords:

face recognition; open-set recognition; deep learning; convolutional neural network; vision transformer

1. Introduction

In recent years, advances in Artificial Intelligence (AI), and especially in the field of Deep Learning (DL), have significantly impacted various sectors. One of the areas where these innovations have been particularly relevant is biometrics, with a prominent focus on face recognition. Based on DL models, these systems allow current solutions to carry out the complex tasks of person identification and verification, maintaining a high level of accuracy even under adverse conditions. Thanks to its high accuracy, face recognition has become a fundamental component in applications like access control for company premises and airport boarding lounges.

Beyond its influential role in key sectors, face recognition represents a booming technology. According to [1], its market revenue demonstrates a consistently growing trend, with an estimated Compound Annual Growth Rate (CAGR) of approximately 14% over the forecast horizon. Initiating at USD 5 billion in 2022, the market is projected to expand to approximately USD 9 billion by 2026 and exceed USD 13 billion by 2029. As technological maturation progresses, revenue streams are expected to increase further, reaching USD 17 billion by 2031 and USD 19 billion by 2032. This growth trajectory is illustrated in Figure 1.

This rapid market growth underscores the growing reliance on face recognition systems across diverse sectors. However, despite their widespread adoption, current systems continue to encounter significant technical challenges when implemented in real-world scenarios. In these cases, most of these systems are implemented as Closed-Set Recognition (CSR) models and trained on a predefined set of labeled face images corresponding to known identities. Under this paradigm, the models are designed to assign each input face to one of the known identities, operating under the assumption that all faces to be observed during testing are part of the training set. However, in many real-world scenarios, it is common for the system to encounter people who were not involved during training. This situation, formerly known as Open-Set Recognition (OSR) [2,3], requires not only the correct identification of known identities but also the ability to detect unknown faces.

Furthermore, face recognition systems have traditionally relied on Convolutional Neural Networks (CNNs). However, due to their inherent limitations in capturing long-range dependencies and global context, recent research has explored alternative architectures with improved capabilities for modeling global representations. Among these, the Vision Transformer (ViT) [4], introduced in 2020, has garnered significant interest within the computer vision community, as it captures global relationships between different parts of an image rather than focusing solely on local patterns (as CNNs do).

In light of the above, the present study conducts a comparative analysis of CNN and ViT architectures for face recognition in OSR scenarios. The main objective is to compare the performance of both architectures in scenarios that require accurate facial recognition from known identities and the detection of unknown faces with high precision.

The rest of the document is structured as follows: Section 2 presents an overview of CNN, ViT, and OSR. Section 3 reviews relevant existing literature in the domain of face recognition. Section 4 and Section 5, respectively, describe the models selected for comparison in this study and detail the experimental methodology employed. Section 6 offers a discussion of the results, along with an in-depth analysis, before finally, Section 7 summarizes the key conclusions.

2. Background

This section provides an overview of CNNs and ViTs, emphasizing their main features and suitability for biometric face recognition. Furthermore, the concept of OSR is introduced, as is the OpenMax algorithm used in this study to address the associated challenges.

2.1. Convolutional Neural Networks

CNNs are a type of DL architecture commonly utilized in computer vision applications. These networks operate through a sequence of layers that apply filters, also known as kernels, to input images [5], progressively extracting features ranging from basic elements such as edges, colors, and gradients to more complex, high-level representations that can distinguish specific objects or categories. The initial convolutional layers focus on feature extraction by scanning the image with filters, while subsequent pooling layers reduce the spatial dimensions of the data, emphasizing the most relevant features [6]. The final steps involve flattening the data and passing it through fully connected layers to perform the classification task. An overview of a typical CNN architecture for face recognition is illustrated in Figure 2.

Despite their widespread success, CNNs have certain limitations. According to [7], a primary concern is their lack of interpretability, as these models often operate as black boxes with limited transparency regarding their decision-making processes. Additionally, CNNs are susceptible to domain shift, which can lead to performance deterioration when applied to data that differs from the original training set.

2.2. Vision Transformers

ViTs employ an entirely different methodology for image processing, dividing them into smaller patches, which are then flattened and converted into a sequence of tokens through linear projection. These tokens are subsequently processed through multiple transformer encoder layers, which are effective at capturing relationships across different image regions. This is achieved via the multi-head self-attention mechanism [5]. Figure 3 provides a visual representation of the ViT architecture for face recognition.

As reported by [7], ViTs have achieved state-of-the-art performance on several benchmark datasets, demonstrating excellent results on large-scale datasets such as ImageNet-21k [8] and JFT-300M [9] and outperforming many CNNs in terms of accuracy. For example, ViT has achieved accuracies of 88.55% on ImageNet and 94.55% on CIFAR-100 [4].

Nevertheless, it is important to consider some of the limitations associated with ViTs. They generally require larger training datasets compared to CNNs and tend to demand higher computational resources. Moreover, as a relatively newer architecture, there is less established knowledge and fewer best practices available for their implementation compared to CNNs [7].

2.3. Theoretical Analysis: Convolutional Neural Networks vs. Vision Transformers

Building upon the CNN and ViT architectures discussed in previous sections, the primary distinction between these models lies in their methods of processing spatial information within images. CNNs perform local, hierarchical analysis through convolutional layers, progressively extracting increasingly complex features from small, contiguous regions. Conversely, ViTs adopt a global processing approach by dividing images into patches and treating them as sequences of tokens. This methodology enables ViTs to model long-range dependencies across the image, overcoming the limitations inherent to local CNN operations.

As noted in [4], this fundamental difference influences their inductive biases. CNNs inherently incorporate biases such as locality, emphasizing the importance of nearby pixels, and translation equivariance, recognizing patterns regardless of their position (see Figure 4). These biases facilitate assumptions about image structure and enhance training efficiency, particularly when data resources are limited. In contrast, ViTs do not possess these domain-specific biases, providing greater flexibility to learn diverse spatial relationships. However, this lack of inherent biases necessitates larger volumes of training data to effectively learn image structures from scratch.

2.4. Open-Set Recognition and OpenMax

As previously mentioned, image classification models are typically trained within a CSR scenario. This means that the model is trained using images belonging to a predefined set of classes, referred to as known classes. Fundamentally, the training process for classification models relies on the assumption that all classes the model will encounter during testing have been previously observed during training.

However, during testing, images from unknown classes may be encountered in real-world applications. In such situations, the model incorrectly classifies these images as belonging to one of the known classes, since it lacks the ability to recognize images of unknown classes and does not consider the possibility that an image may not belong to any of the classes with which it was trained.

To address this issue, one potential solution involves applying a threshold to the maximum probability produced by the SoftMax layer. The concept is that when an image of unknown classes is introduced, the probabilities assigned to known classes would be low; implementing a threshold can help classify such images as unknown. However, this approach has been shown to be insufficient, as discussed in [10]. Their study demonstrates that even images entirely unrelated to known classes can sometimes produce high probabilities that surpass the threshold, resulting in misclassifications.

Consequently, methods such as OpenMax [10] have been developed to enable the model to classify images of known classes while effectively handling those from unknown classes, thereby enhancing performance in OSR scenarios. This method improves the capability of the SoftMax layer to include unknown class prediction. Specifically, the scores obtained from the fully connected layer preceding the SoftMax layer are utilized to determine whether an image is significantly different from the training data. These scores are referred to as the Activation Vector (AV).

The OpenMax algorithm accounts for the likelihood of errors within the recognition system, which is used to assess whether an image belongs to the unknown class. To facilitate this assessment, an adaptation of Meta-Recognition [11] is used, involving the evaluation of model scores to identify prediction errors and determine whether an image originates from the unknown class. Additionally, previous research on Meta-Recognition [11] examined the final scores utilizing Extreme Value Theory (EVT), establishing that these scores conform to Weibull distribution. In this context, such distribution is fitted for each known class, a fitting process that is described in detail below.

For each correctly classified image, obtain the AV.
For each known class, compute the Mean Activation Vector (MAV) by averaging the AVs of all correctly classified training samples belonging to that known class.
Calculate a Weibull distribution to the $η$ largest distances (tail size) between the AVs and the MAV for each known class.

Once the Weibull distribution has been fitted for each known class, the OpenMax algorithm is employed during the testing phase to enhance the detection of images belonging to unknown individuals. When a new image is presented, its AV is computed, and the subsequent estimation process is carried out as described below.

Sort the known classes in the activation vector in descending order of their activation values.
Initialize the weights to 1 for all known classes.
For the $α$ known classes (top classes to revise) with the highest activation values, compute a weight using the distance between the image and the MAV for that known class, along with the corresponding Weibull Cumulative Distribution Function (CDF) probability.
Modify the original AV by applying the weights to obtain a revised AV.
Calculate the activation value for the unknown class by computing a pseudo-activation.
Calculate the new OpenMax probabilities using the revised AV (including the unknown class).

To conclude, OpenMax classifies an image as belonging to the unknown class either when the highest probability corresponds to the unknown class or when the highest probability among the known classes does not exceed a predefined threshold

ϵ

.

3. Related Work

Until five years ago, CNNs dominated the field of computer vision. However, in 2020, the landscape began to change with the introduction of ViTs by the authors of [4]. This work demonstrated that ViTs can achieve high levels of accuracy in image classification tasks, provided that proper training is carried out. In particular, when ViTs are trained on small- to medium-sized datasets, such as ImageNet with 1000 classes and 1.3 M images, without applying significant regularization, they fail to outperform CNNs due to the lack of certain inductive biases, such as locality and translational equivariance. However, when ViTs are pre-trained with large-sized datasets, such as ImageNet-21k (a superset of ImageNet with 21,000 classes and 14 M images) or JFT-300M, before being fine-tuned to smaller datasets, they can match or even surpass the results of the CNNs. In summary, large-scale training tends to outperform reliance solely on inductive biases.

In order to justify the superior performance of ViTs over CNNs, a comparative analysis of the representations generated by both architectures was carried out in [12]. The study concludes that ViTs present more uniform representations across all layers, unlike CNNs, which exhibit a hierarchical structure in which extracted features are progressively more abstract with increasing depth. These differences are attributed to the self-attention mechanism—which allows for capturing global information from the first layers—as well as to residual connections, which facilitate the efficient propagation of features from lower to higher layers.

In the field of face recognition, the authors of [5] conducted a comparative analysis of different architectures, demonstrating that ViTs outperform CNNs in accuracy for face identification and verification tasks across five diverse datasets. Specifically, ViTs were evaluated against widely used CNN architectures, including EfficientNet [13], Inception [14], MobileNet [15], ResNet [16], and VGG [17]. The study also indicates that ViTs’ inference speed is competitive: although approximately 24% slower than MobileNet, the fastest model tested, ViTs possess roughly seven times more parameters. Additionally, ViTs exhibit greater robustness to variations in image distance and are notably more effective in managing partial occlusions of the face, owing to their global receptive fields that are less impacted by local obstructions compared to CNNs. Overall, the results suggest that ViTs provide superior performance and enhanced adaptability under varying conditions.

Further studies [18] confirmed the advantage of transformer-based models, including ViT and Swin Transformer [19], in face recognition tasks involving masked faces. These models, pre-trained on ImageNet-21k, achieved accuracies above 90% for unmasked images and around 87% for masked images, illustrating the importance of large-scale pre-training in complex real-world scenarios. Similarly, the authors of [20] proposed the Sparse Vision Transformer (S-ViT) for face recognition, which incorporates relative positional encoding [21] to better capture spatial relationships. S-ViT outperformed conventional CNNs and standard ViTs, achieving up to 3.27% higher accuracy than ResNet50 using ArcFace loss [22]. ARTriViT [23], a ViT-based Siamese network with triplet loss, further demonstrated the effectiveness of ViTs in handling occlusions, pose variations, lighting changes, and limited sample sizes, achieving state-of-the-art performance on the Celeb-DF v2 dataset [24].

In addition, several hybrid architectures combining CNNs and ViTs have been proposed to capitalize on the strengths of both approaches. For example, the authors of [25] introduced EdgeFace, a lightweight face recognition model optimized for deployment on edge devices, achieving high accuracy on LFW [26], IJB-B [27], and IJB-C [28] datasets while maintaining low computational requirements. Similarly, ref. [29] developed MobileFaceFormer, which features parallel CNN and ViT branches with a feature fusion mechanism that retains both local and global facial features. Another approach, FaceLiVT [30], integrates a lightweight multi-head linear attention mechanism within a hybrid CNN-ViT architecture, delivering competitive performance across multiple benchmarks and enabling faster inference suitable for mobile applications. Lastly, ref. [31] proposed a hybrid model combining CNNs, ViTs, and Multi-Layer Perceptrons (MLPs) that utilizes CNNs for local feature extraction, ViTs for capturing global context, and an MLP classifier for decision-making, resulting in improved accuracy, robustness, and computational efficiency over pure architectures.

To date, no comprehensive comparison of CNNs and ViTs has been conducted in OSR scenarios for face recognition. This study therefore aims to evaluate the performance of pure CNN and pure ViT architectures in such a scenario, with hybrid models explicitly excluded from the scope of this research.

4. Features of the CNN and ViT Models

As noted earlier, this paper compares CNNs and ViTs for face recognition in OSR scenarios; these models are detailed in the following.

The selection of CNN architectures was guided by specific criteria: (i) historical significance, representing key milestones in the development of CNNs for computer vision; and (ii) architectural diversity, covering varying levels of complexity, connectivity patterns, and number of parameters. These criteria ensure that the comparison with ViTs captures a broad range of CNN models.

Based on these considerations, the following CNN models were selected, as summarized in Table 1: Inception-V3, ResNet50, DenseNet201, MobileNet-V3-Large, and EfficientNet-B0. These models have widespread use in the existing literature [5,31,32,33,34] and have been extensively validated across various computer vision applications [35,36,37,38], providing a relevant benchmark for comparing traditional CNNs with transformer-based architectures.

Since the introduction of CNN architectures, a series of increasingly advanced models have contributed to significant progress in the field of DL for computer vision. In 2015, the Inception-V3 model [14] introduced factorized convolutions and optimized module designs, improving both accuracy and computational efficiency. That same year, ResNet50 [16] revolutionized deep network training through residual connections, enabling significantly deeper architectures without compromising performance.

Later, DenseNet201 [39], published in 2017, proposed dense connectivity between layers to strengthen feature propagation and reduce parameter redundancy. In 2019, MobileNet-V3-Large [40] targeted efficient inference on mobile and embedded devices by combining neural architecture search with lightweight building blocks. EfficientNet-B0 [13], also released in 2019, achieved a balance between performance and efficiency through compound scaling and careful architecture design.

Finally, ViTs [4] (ViT-B-16), introduced in 2020, marked a paradigm shift by adapting Transformers [41] to computer vision, reaching competitive performance with CNNs when pre-trained on large-scale datasets.

5. Baseline Conditions for Analytical Comparison

This section provides a detailed account of the procedures followed in this study, which are divided into three main phases: pre-training, fine-tuning, and evaluation in an OSR scenario.

5.1. Pre-Training

All the models listed in Table 1 were initially pre-trained on a large-scale dataset to learn discriminative facial representations with strong generalization capabilities. For this pre-training phase, the VGGFace2 dataset [42] was employed, given its widespread adoption in the literature for its large volume of facial images. Developed by Oxford University, this dataset comprises a total of 3.3 M images representing 9131 individuals. The dataset is divided into two partitions: 8631 training classes containing 3.1 M images and 500 test classes containing 169,396 images. For this work, only the training split was used, which contains an average of 364 images per class. The distribution of the number of images per class in the training and testing splits, along with the total images across both splits, is illustrated in Figure 5.

Before pre-training the models, the images underwent a comprehensive pre-processing pipeline, including resizing, channel normalization and data augmentation techniques, with the aim of improving the generalization capacity and robustness of the model to variations in the capture conditions, such as changes in illumination or posture. The images were resized uniformly to 224 × 224 pixels, and normalization was applied using a mean and standard deviation of 0.5 per channel. In addition, various data augmentation techniques were incorporated, such as random horizontal flipping, rotations up to ±15 degrees, conversion to grayscale with a 10% probability, random adjustments in brightness, contrast, saturation, and hue, random cropping with scaling between 70% and 100% of the image area, and blurring. Each of these augmentations enhanced the models’ robustness: flipping and rotations improved resilience to pose variations, grayscale conversion and color adjustments mitigated biases related to skin tone and lighting conditions, random cropping and scaling directed the models’ attention to essential facial features rather than background details, and blurring increased resilience to low-quality or out-of-focus images. The application of these augmentations reduced overfitting, allowing the models to start the subsequent fine-tuning phase with weights that captured general facial representations.

Furthermore, the set of images used was divided into 95% for training and 5% for validation in order to evaluate and monitor the generalizability of the models. The pre-training process was carried out over a maximum of 50 epochs, with both training and validation phases performed in each epoch. In addition, an early stopping strategy was implemented with a patience parameter of 5, so that if the accuracy of the validation set did not improve for 5 consecutive epochs, training was stopped early to prevent overfitting and optimize the use of computational resources.

In relation to learning rates, a uniform value of

5 \times 10^{- 4}

was assigned, and a combined scheduler integrating a linear warm-up phase followed by a CosineAnnealingWarmRestarts readjustment was implemented to efficiently manage the learning rate during the pre-training process.

Finally, pre-training was performed using the CrossEntropyLoss loss function, and optimization was carried out using the AdamW algorithm [43], which combines the advantages of Adam with a weight decay regularization of

1 \times 10^{- 4}

.

5.2. Fine-Tuning

After gaining a broad understanding of human faces through extensive pre-training on the large-scale VGGFace2 dataset, the models were subsequently fine-tuned using a smaller, more specific dataset containing a limited number of individuals. Specifically, the fine-tuning dataset consisted of exactly 100 people. This second phase aimed to adapt the models to the targeted dataset, thereby enhancing their performance within this particular group of people. During the fine-tuning phase, all parameters from the pre-trained models were preserved, except for the final classification layer, which was replaced and reinitialized to accommodate the change in the number of output classes from 8631 to 100.

For this second phase, 100 individuals from the CASIA-WebFace [44] dataset were used. CASIA-WebFace [44] is a face recognition dataset developed by the Chinese Academy of Sciences, comprising 494,414 images representing 10,575 individuals.

In this case, each of the 100 known individuals was represented by 70 training images and 10 validation images. Prior to fine-tuning, all images underwent a pre-processing procedure to ensure consistency with the pre-training phase. This process included face detection using Multi-task Cascaded Convolutional Networks (MTCNN) [45], which cropped the facial region by removing background noise. Subsequently, each cropped face was resized to 224 × 224 pixels and normalized using a mean and standard deviation of 0.5.

Once the training and validation images of the 100 known individuals were pre-processed, the fine-tuning phase was conducted over 40 epochs using the CrossEntropyLoss function and the Adam optimizer, with a learning rate of

7 \times 10^{- 5}

. The final models were selected based on their validation set performance, specifically those with the highest classification accuracy.

5.3. Evaluation

After fine-tuning the models using images of 100 known individuals, their performance was first assessed in a CSR scenario to verify their ability to accurately identify these known individuals. In this scenario, the SoftMax function was used as the component responsible for converting the models’ output scores into probability distributions over the 100 known classes. This allowed for the evaluation of the models’ ability to differentiate and classify the previously mentioned 100 known people. For this purpose, 10 test images from each known person were used. These images were processed by applying MTCNN for face cropping, followed by resizing to 224 × 224 pixels and normalization using a mean and standard deviation of 0.5 in each channel.

Subsequently, the models were evaluated in an OSR scenario, which comprised both known and unknown individuals who were not part of the training set, neither during the pre-training nor the fine-tuning phases. For this scenario, the OpenMax algorithm was implemented, with the selected parameter values as summarized in Table 2.

The selected OpenMax parameters (

η = 5

,

α = 2

,

ϵ = 0.9

) were adopted from [46], who focused on the optimization of OpenMax parameters for OSR scenarios. More precisely, the tail size

η

controls the number of extreme values used to fit the Weibull distributions, with larger values reducing sensitivity to unknown faces, while smaller values could make the model more sensitive to unknown faces but also more prone to false positives. The top classes to revise

α

specify the number of highest-scoring known classes to be adjusted by OpenMax; increasing

α

may distort the probability distribution, whereas decreasing

α

might leave some overconfident predictions uncorrected. The threshold

ϵ

was set to 0.9 to effectively balance the detection of unknown faces while maintaining high accuracy on known identities; raising

ϵ

would make the model more conservative in identifying known faces, while lowering it could increase the risk of misclassifying unknown faces as known.

To simulate images of unknown people, 1000 images were used, each representing individuals from the CASIA-WebFace dataset who were not included among the known people. These images were incorporated into the test set of known people and labeled as belonging to an unknown class. As in previous steps, all images were first cropped using MTCNN, then resized to 224 × 224 pixels and subsequently normalized using a mean and standard deviation of 0.5 in each channel.

6. Results and Discussion

This section presents the results of the study, beginning with an overview of the hardware and software environment employed. Subsequently, the models’ learning behavior during the fine-tuning phase is examined through key metric training and validation curves, including loss, recall, and precision, providing insight into their performance evolution. Following this, the models’ accuracy is evaluated in both a CSR scenario, involving known individuals, and an OSR scenario, which includes unknown identities. In order to isolate the effect of the OpenMax algorithm, an ablation study is conducted by applying a threshold to the probabilities generated by SoftMax rather than implementing the OpenMax mechanism, demonstrating that the observed performance differences between models are not solely attributable to the choice of mechanism for OSR scenarios. Finally, the discussion includes an analysis of the computational cost of the evaluated models, considering both FLOPs and inference time per image.

All results reported correspond to 10 independent runs with different seeds. For each metric, the mean, standard deviation, and 95% confidence interval were computed to ensure reliability and enable a robust comparison of the models.

6.1. Hardware and Software Environment

The experiments were conducted on a Dell PowerEdge R730 (Dell Inc., Round Rock, TX, USA) server equipped with 80-core Intel Xeon Silver 4416+ processors (Intel Corporation, Santa Clara, CA, USA), 128 GB of RAM, and two NVIDIA A40 GPUs (NVIDIA Corporation, Santa Clara, CA, USA), each with 48 GB of dedicated GPU memory. The software environment was based on Ubuntu 22.04.5 LTS (Canonical Ltd., London, UK), using the Python programming language, version 3.10.12. Moreover, Python libraries, as detailed in Table 3, played a key role in this study.

6.2. Fine-Tuning Performance

Before delving into scenario-specific results, we should first examine how the models learned from the training images during the fine-tuning phase. Figure 6 illustrates the progression of loss, recall, and precision across epochs, with each plot corresponding to a specific metric and displaying both training and validation curves for all models. The curves represent the average performance across all runs of each model at each epoch, offering a comprehensive overview of the models’ overall behavior during the fine-tuning phase.

The training curves demonstrate effective model learning, with loss decreasing and recall and precision steadily increasing. The validation curves exhibit a similar trend, expectedly slightly below the training performance. This pattern indicates that the models effectively classify and differentiate known individuals while demonstrating a suitable generalization to unseen face images.

6.3. Closed-Set Recognition Performance

Having observed the fine-tuning performance, the next step is to assess how well the models classify known individuals. In the CSR scenario, all models perform strongly, as indicated by the macro-averaged F1 score statistics presented in Table 4. Notably, DenseNet201 and ViT-B-16 achieve the highest mean macro-averaged F1 scores, showing a slight edge over the other CNN-based architectures. EfficientNet-B0 also performs well, with a mean macro-averaged F1 score close to these leading models. ResNet50 and Inception-V3 yield robust results, while MobileNet-V3-Large, designed for lightweight deployment, shows slightly lower performance relative to the other architectures.

6.4. Open-Set Recognition Performance

Beyond known identities, it is critical to evaluate the models’ ability to handle unfamiliar faces. Table 5 summarizes the macro-averaged F1 score statistics obtained by the evaluated models in the OSR scenario. In this context, the ViT-B-16 model attains the highest mean macro-averaged F1 score of 90.50%, accompanied by minimal standard deviation and a narrow 95% confidence interval, indicating consistent performance. MobileNet-V3-Large and EfficientNet-B0 also demonstrate strong results, with mean macro-averaged F1 scores of 87.32% and 87.65%, respectively, and relatively low variability across runs. ResNet50 and DenseNet201 exhibit moderate performance, while Inception-V3 achieves the lowest mean macro-averaged F1 score and shows greater variability, as reflected by its larger standard deviation and wider confidence interval.

To further evaluate how the models manage both known and unknown individuals, a detailed analysis of the False Positive (FPR), True Negative (TNR), True Positive (TPR), and False Negative (FNR) Rates was conducted. As detailed in Table 6, the ViT-B-16 model demonstrates a favorable balance, achieving a high TPR of 98.50% combined with a low FPR of 15.08%, indicating strong generalization capabilities. The EfficientNet-B0 model also performs interestingly, with a TPR of 98.42% and an FPR of 19.35%. MobileNet-V3-Large is notable for having the highest TNR at 84.81%, suggesting a conservative approach in detecting unknown individuals, though this is accompanied by a lower TPR of 91.97%. ResNet50 and DenseNet201 exhibit very high TPRs of 99.23% and 99.45%, respectively; however, their higher FPRs of 22.37% and 24.16% may indicate a greater tendency to misclassify unknown faces as known. Inception-V3 displays a comparatively weaker performance, with the highest FPR at 36.98% and the lowest TNR at 63.02%, despite achieving a near-perfect TPR of 99.70%.

To enhance our understanding of the models’ performance in the OSR scenario, Figure 7 and Figure 8 illustrate three known and three unknown faces, along with their true labels and the predictions generated by each model, respectively. It can be observed that ViT-B-16 correctly classifies all six faces, demonstrating its superior accuracy and robustness compared to the other models on this set of images.

The experimental evaluation carried out highlights the challenges associated with OSR scenarios, where models must accurately distinguish between known and unknown individuals. Among all tested models, ViT-B-16 demonstrates the most balanced and robust performance, achieving the highest mean macro-averaged F1 score and maintaining an optimal trade-off between correctly identifying unknown faces and minimizing false recognitions. EfficientNet-B0 and MobileNet-V3-Large also perform strongly, with relatively low FPRs and high TNRs that make them effective at handling unfamiliar identities. By contrast, CNN-based models such as DenseNet201 and Inception-V3, despite reaching near-perfect TPRs, show higher FPRs, reflecting a tendency to misclassify unknown individuals as known. These findings suggest that ViT architectures offer superior performance in OSR scenarios and may be better suited for real-world applications where encountering unknown people is common. Figure 9 presents a visual summary of the metrics across all evaluated models in the OSR scenario, with FPR, TPR, TNR, and FNR displayed in the radar chart on the left, and mean macro-averaged F1 scores shown in the legend.

Nevertheless, it is important to note that CNNs remain highly competitive and widely used, especially in contexts where training data is limited, such as in small- and medium-scale datasets. In these cases, ViTs tend to fail to generalize, which negatively affects their performance. An illustrative example is the study presented by [50], which showed that when trained with approximately 20,000 images, ViTs achieved superior performance; however, when the dataset was halved, CNNs performed better. In summary, although ViTs offer clear advantages in large-scale training scenarios, CNNs continue to play a fundamental role in a variety of applications with limited data.

6.5. Ablation Study: SoftMax Thresholding and OpenMax

To evaluate whether the observed performance differences depend on the OpenMax algorithm, an ablation study was conducted using SoftMax Thresholding as an alternative. In this approach, images whose highest predicted probability did not exceed a predetermined threshold of 0.9—the same value used in OpenMax—were categorized as unknown. Table 7 presents the corresponding macro-averaged F1 score statistics.

The results clearly demonstrate that ViT-B-16 consistently surpasses all CNN architectures in the OSR scenario, independent of whether OpenMax or SoftMax Thresholding is employed. ViT-B-16 attains the highest mean macro-averaged F1 score with minimal variance, reflecting its enhanced capacity for identifying known and unknown identities. This finding indicates that the observed differences between models are not due to the choice of using OpenMax or another alternative but rather reflect the inherent robustness and accuracy of ViT-B-16.

6.6. Computational Cost and Inference Time Analysis

In addition to evaluating model accuracy, the deployment of DL models requires consideration of their computational cost. To this end, Figure 10 summarizes key metrics for all evaluated models, including floating-point operations (FLOPs) and inference time per image.

As illustrated in the graph on the left, the ViT-B-16 architecture has the highest computational cost at approximately 16.86 GFLOPs. This is mainly attributed to its attention-based design, which allows it to capture global relationships between all patches in the image, unlike CNNs, which focus on extracting local features through hierarchical convolutions. CNN architectures such as DenseNet-201 (4.29 GFLOPs), ResNet-50 (4.09 GFLOPs) and Inception-V3 (2.84 GFLOPs) offer a balance between depth and computational efficiency. Lighter models, such as MobileNet-V3-Large (0.22 GFLOPs) and EfficientNet-B0 (0.38 GFLOPs), significantly reduce the number of FLOPs, making them particularly suitable for resource-constrained environments.

The graph on the right shows the inference time per image, demonstrating that a higher number of FLOPs does not necessarily imply longer inference times. For example, the ViT-B-16 model requires an average of approximately 7.2 ms per image, remaining competitive thanks to the optimized parallelization of its attention mechanisms. Although DenseNet-201 has a lower FLOPs cost than ViT-B-16, it has a higher inference time. On the other hand, models such as ResNet-50, Inception-V3, EfficientNet-B0, and MobileNet-V3-Large show moderate inference times, indicating a balance between architectural depth and computational efficiency.

In summary, these results highlight the trade-offs between model accuracy and computational efficiency. While ViT-B-16 offers greater accuracy and robustness in capturing global dependencies in images, its high FLOPs requirements may limit its implementation in resource-constrained scenarios. On the other hand, lightweight architectures such as MobileNet-V3-Large and EfficientNet-B0 represent an attractive option, achieving competitive performance at a significantly lower computational cost, making them ideal for real-time applications or environments with limited hardware.

7. Conclusions

This paper presents a comparison between CNN and ViT architectures for face recognition in an OSR scenario. Specifically, five CNN models—Inception-V3, ResNet50, DenseNet201, MobileNet-V3-Large and EfficientNet-B0—are compared to the base ViT model, ViT-B-16. For the classification of individuals in such a scenario, the OpenMax algorithm is implemented, allowing the models to identify and assign an unknown category instead of forcing classification into one of the known identities from the training phase.

The results show that ViT-B-16 achieves the strongest performance in handling both known and unknown individuals, attaining a mean macro-averaged F1 score of 90.50% across multiple runs. These results are in line with the findings reported in related works, which demonstrate that ViTs constitute an emerging architecture capable of outperforming traditional CNNs when pre-trained on large-scale image datasets. In this work, the strong results achieved by ViT-B-16 can be attributed to such pre-training.

However, it is important to note that while the implemented VGGFace2 and CASIA-WebFace datasets are commonly used in face recognition research, they may not comprehensively represent the demographic diversity found in real-world populations. Variations in age, gender, and ethnicity can impact model performance, particularly in OSR scenarios, and may influence the overall identification accuracy of the evaluated models.

In practical applications such as surveillance and access control, the primary objective is typically to identify a specific group of individuals, rather than incorporating external individuals during training. Therefore, it is crucial to develop models that can reliably differentiate between known and unknown people, enhancing system robustness, security, and operational reliability. This work represents a further step in the development and deployment of such systems for real-world applications, where accuracy is a key requirement. As a result of the conducted study, we showed that ViTs emerge as a particularly promising alternative in OSR scenarios, especially when appropriate model pre-training is performed. Nonetheless, it is important to mention that CNNs remain highly effective and extensively utilized, especially in scenarios with limited training data, and therefore continue to serve a vital function in numerous practical applications.

In terms of future research directions, we propose conducting a comparative evaluation of different variants of ViTs, including Swin Transformers. In addition, researchers could explore hybrid models that take advantage of the complementary advantages of both approaches.

Author Contributions

Conceptualization, A.G. and M.H.; methodology, A.G. and M.H.; software, A.G. and M.S.; validation, A.G. and M.H.; investigation, A.G. and M.H.; writing—original draft preparation, A.G. and M.H.; writing—review and editing, M.H., A.S. and A.A.; supervision, M.H. and E.J.; project administration, M.H. and E.J.; funding acquisition, E.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been funded by the Spanish Ministry of Science project through the AIBioSurv-Tech project—Biosurveillance through Artificial Intelligence (AI) in the post-COVID era: Implications for architecture and cybersecurity—under reference TED2021-129975B-C22AEI/ AEI/10.13039/501100011033/Unión Europea NextGenerationEU/PRTR.

Data Availability Statement

Data is available in a publicly accessible repository: https://github.com/andergalvan/paper4_cnn_vit_comparison (accessed on 1 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Scoop Market.US. Facial Recognition Statistics and Market Trends. 2025. Available online: https://scoop.market.us/facial-recognition-statistics/ (accessed on 9 July 2025).
Scheirer, W.J.; de Rezende Rocha, A.; Sapkota, A.; Boult, T.E. Toward Open Set Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1757–1772. [Google Scholar] [CrossRef]
Geng, C.; Huang, S.J.; Chen, S. Recent Advances in Open Set Recognition: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3614–3631. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Rodrigo, M.; Cuevas, C.; García, N. Comprehensive Comparison Between Vision Transformers and Convolutional Neural Networks for Face Recognition Tasks. Sci. Rep. 2024, 14, 12345. [Google Scholar] [CrossRef]
Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
Takahashi, S.; Sakaguchi, Y.; Kouno, N.; Takasawa, K.; Ishizu, K.; Akagi, Y.; Aoyama, R.; Teraya, N.; Bolatkan, A.; Shinkai, N.; et al. Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review. J. Med. Syst. 2024, 48, 84. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 843–852. [Google Scholar]
Bendale, A.; Boult, T.E. Towards Open Set Deep Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1563–1572. [Google Scholar]
Scheirer, W.J.; Rocha, A.; Micheals, R.J.; Boult, T.E. Meta-Recognition: The Theory and Practice of Recognition Score Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1689–1695. [Google Scholar] [CrossRef] [PubMed]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See Like Convolutional Neural Networks? Adv. Neural Inf. Process. Syst. 2021, 34, 12116–12128. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Tafur Acenjo, B.X.; Tello Pariona, M.A.; Escobedo Cárdenas, E.J. Comparativa entre RESNET-50, VGG-16, Vision Transformer y Swin Transformer para el reconocimiento facial con oclusión de una mascarilla. Interfases 2023, 17, 56–78. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Kim, G.; Park, G.; Kang, S.; Woo, S.S. S-ViT: Sparse Vision Transformer for Accurate Face Recognition. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing (SAC ’23), New York, NY, USA, 27–31 March 2023; pp. 1130–1138. [Google Scholar] [CrossRef]
Wu, K.; Peng, H.; Chen, M.; Fu, J.; Chao, H. Rethinking and improving relative position encoding for vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10033–10041. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4685–4694. [Google Scholar] [CrossRef]
Khan, M.; Saeed, M.; El Saddik, A.; Gueaieb, W. ARTriViT: Automatic Face Recognition System Using ViT-Based Siamese Neural Networks with a Triplet Loss. In Proceedings of the 2023 IEEE 32nd International Symposium on Industrial Electronics (ISIE), Helsinki, Finland, 19–21 June 2023; pp. 1–6. [Google Scholar] [CrossRef]
Li, Y.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. In Proceedings of the IEEE Conference on Computer Vision and Patten Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
George, A.; Ecabert, C.; Shahreza, H.O.; Kotwal, K.; Marcel, S. EdgeFace: Efficient Face Recognition Model for Edge Devices. IEEE Trans. Biom. Behav. Identity Sci. 2024, 6, 158–168. [Google Scholar] [CrossRef]
Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. In Proceedings of the Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, 12–18 October 2008. [Google Scholar]
Whitelam, C.; Taborsky, E.; Blanton, A.; Maze, B.; Adams, J.; Miller, T.; Kalka, N.; Jain, A.K.; Duncan, J.A.; Allen, K.; et al. IARPA Janus Benchmark-B Face Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 592–600. [Google Scholar] [CrossRef]
Maze, B.; Adams, J.; Duncan, J.A.; Kalka, N.; Miller, T.; Otto, C.; Jain, A.K.; Niggel, W.T.; Anderson, J.; Cheney, J.; et al. IARPA Janus Benchmark—C: Face Dataset and Protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB), Gold Coast, QLD, Australia, 20–23 February 2018; pp. 158–165. [Google Scholar] [CrossRef]
Li, J.; Zhou, L.; Chen, J. MobileFaceFormer: A lightweight face recognition model against face variations. Multimed. Tools Appl. 2024, 83, 12669–12685. [Google Scholar] [CrossRef]
Setyawan, N.; Sun, C.C.; Hsu, M.H.; Kuo, W.K.; Hsieh, J.W. FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device. arXiv 2025, arXiv:2506.10361. [Google Scholar]
Bharath, M. Advancing Face Recognition with Hybrid CNN-ViT-MLP: A Comparative Study. Int. J. Res. Appl. Sci. Eng. Technol. 2025, 13, 1106–1111. [Google Scholar] [CrossRef]
Nemavhola, A.; Chibaya, C.; Viriri, S. A Systematic Review of CNN Architectures, Databases, Performance Metrics, and Applications in Face Recognition. Information 2025, 16, 107. [Google Scholar] [CrossRef]
Lee, J.H.; Song, K.S. Comparison and Analysis of CNN Models to Improve a Facial Emotion Classification Accuracy for Koreans and East Asians. Int. J. Adv. Sci. Eng. Inf. Technol. 2024, 14, 811–817. [Google Scholar] [CrossRef]
Pinzón-Arenas, J.; Moreno, R.; Martínez Baquero, J. Comparison of convolutional neural network models for user’s facial recognition. Int. J. Electr. Comput. Eng. (IJECE) 2024, 14, 192. [Google Scholar] [CrossRef]
Cuenat, S.; Couturier, R. Convolutional Neural Network (CNN) vs. Vision Transformer (ViT) for Digital Holography. In Proceedings of the 2022 2nd International Conference on Computer, Control and Robotics (ICCCR), Virtual, 18–20 March 2022; pp. 235–240. [Google Scholar] [CrossRef]
Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer Neural Network for Weed and Crop Classification of High Resolution UAV Images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
Wu, Y.; Qi, S.; Sun, Y.; Xia, S.; Yao, Y.; Qian, W. A vision transformer for emphysema classification using CT images. Phys. Med. Biol. 2021, 66, 245016. [Google Scholar] [CrossRef]
Gheflati, B.; Rivaz, H. Vision Transformers for Classification of Breast Ultrasound Images. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Scotland, UK, 11–15 July 2022; pp. 480–483. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A Dataset for Recognising Faces Across Pose and Age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 67–74. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning Face Representation from Scratch. arXiv 2014, arXiv:1411.7923. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Higuero, M.; Galván, A.; Astorga, J.; Atutxa, A.; Jacob, E. Optimizing OpenMax Parameters for Open-Set Face Recognition. In Proceedings of the International Conference on Artificial Intelligence and Soft Computing (ICAISC 2025), Zakopane, Poland, 22–26 June 2025. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
Marcel, S.; Rodriguez, Y. Torchvision The Machine-Vision Package of Torch. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 1485–1488. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Lu, K.; Xu, Y.; Yang, Y. Comparison of the potential between transformer and CNN in image classification. In Proceedings of the 2nd International Conference on Machine Learning and Computer Application (ICMLCA 2021), Shenyang, China, 17–19 December 2021; pp. 1–6. [Google Scholar]

Figure 1. Estimated global revenue growth of the facial recognition market from 2022 to 2032 [1].

Figure 2. Schematic representation of a Convolutional Neural Network (CNN) architecture for face recognition tasks. The colors indicate different processing stages: blue for convolutional layers, green for pooling layers, red for flattened feature maps, and purple for the feature vector.

Figure 3. Schematic representation of the Vision Transformer (ViT) architecture for face recognition tasks. The colors indicate different components of the model: purple for the positional embeddings, blue for the linear projection of flattened patches (patch embeddings), grey for the transformer encoder, and yellow for the Multilayer Perceptron (MLP) head. The symbol * denotes the special classification token (CLS) added to the patch embeddings.

Figure 4. Illustration of inductive biases, locality, and translation equivariance present in CNNs for face recognition.

Figure 5. Overview of the image distribution across training and testing splits in the VGGFace2 dataset.

Figure 6. Training and validation curves of the fine-tuned models over epochs, showing loss (left), recall (middle), and precision (right).

Figure 7. True labels and predicted labels for faces of three known people.

Figure 8. True labels and predicted labels for faces of three unknown people.

Figure 9. Comparison of FPR, TNR, TPR, FNR, and mean macro-averaged F1 scores for all evaluated models in the OSR scenario.

Figure 10. Analysis of the computational cost of the evaluated models, showing FLOPs (left) and inference time per image (right).

Table 1. Models, publication years, and number of parameters.

Model	Publication Year	Number of Parameters
Inception-V3 [14]	2015	≈22 M
ResNet50 [16]	2015	≈24 M
DenseNet201 [39]	2017	≈18 M
MobileNet-V3-Large [40]	2019	≈4 M
EfficientNet-B0 [13]	2019	≈4 M
ViT-B-16 [4]	2020	≈86 M

Table 2. Selected parameter values for OpenMax.

Parameter	Value
Tail size ( $η$ )	5
Top classes to revise ( $α$ )	2
Distance type	Euclidean
Threshold ( $ϵ$ )	0.9

Table 3. Key Python libraries used, with versions and roles in the implementation.

Library	Version	Role
PyTorch [47]	2.2.2	Build and train DL models
Torchvision [48]	0.17.2	Implement models and image transformations
LibMR [11]	0.1.9	Fit Weibull distributions for OpenMax
Scikit-learn [49]	1.6.0	Metrics evaluation

Table 4. Macro-averaged F1 score statistics for models evaluated in the Closed-Set Recognition (CSR) scenario.

Model	Mean (%)	Std (%)	95% CI (%)
Inception-V3	89.96	0.41	89.67–90.25
ResNet50	91.43	0.35	91.18–91.68
DenseNet201	93.25	0.26	93.06–93.43
MobileNet-V3-Large	88.34	0.40	88.05–88.62
EfficientNet-B0	92.44	0.47	92.10–92.77
ViT-B-16	93.05	0.59	92.63–93.48

Table 5. Macro-averaged F1 score statistics for models evaluated in the Open-Set Recognition (OSR) scenario.

Model	Mean (%)	Std (%)	95% CI (%)
Inception-V3	74.39	14.67	63.89–84.88
ResNet50	85.96	3.64	83.36–88.57
DenseNet201	84.78	5.45	80.89–88.68
MobileNet-V3-Large	87.32	0.52	86.95–87.69
EfficientNet-B0	87.65	2.94	85.54–89.75
ViT-B-16	90.50	0.41	90.21–90.80

Table 6. False Positive (FPR), True Negative (TNR), True Positive (TPR), and False Negative Rate (FNR) statistics for models evaluated in the OSR scenario.

Model	FPR (%)			TNR (%)			TPR (%)			FNR (%)
Model	Mean	Std	95% CI	Mean	Std	95% CI	Mean	Std	95% CI	Mean	Std	95% CI
Inception-V3	36.98	16.16	25.42–48.54	63.02	16.16	51.46–74.58	99.70	0.25	99.52–99.88	0.30	0.25	0.12–0.48
ResNet50	22.37	5.54	18.40–26.34	77.63	5.54	73.66–81.60	99.23	0.50	98.87–99.59	0.77	0.50	0.41–1.13
DenseNet201	24.16	7.61	18.72–29.60	75.84	7.61	70.40–81.28	99.45	0.25	99.27–99.63	0.55	0.25	0.37–0.73
MobileNet-V3-Large	15.19	0.85	14.58–15.80	84.81	0.85	84.20–85.42	91.97	1.43	90.95–92.99	8.03	1.43	7.01–9.05
EfficientNet-B0	19.35	4.80	15.92–22.78	80.65	4.80	77.22–84.08	98.42	0.83	97.83–99.01	1.58	0.83	0.99–2.17
ViT-B-16	15.08	0.60	14.65–15.51	84.92	0.60	84.49–85.35	98.50	0.36	98.25–98.75	1.50	0.36	1.25–1.75

Table 7. Macro-averaged F1 score statistics for models evaluated in the OSR scenario using SoftMax Thresholding.

Model	Mean (%)	Std (%)	95% CI (%)
Inception-V3	78.62	13.61	68.89–88.36
ResNet50	89.55	2.32	87.89–91.21
DenseNet201	87.88	4.95	84.34–91.42
MobileNet-V3-Large	88.54	0.48	88.20–88.89
EfficientNet-B0	89.22	3.56	86.68–91.77
ViT-B-16	92.60	0.38	92.33–92.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Galván, A.; Higuero, M.; Sanz, A.; Atutxa, A.; Jacob, E.; Saavedra, M. Comparing CNN and ViT for Open-Set Face Recognition. Electronics 2025, 14, 3840. https://doi.org/10.3390/electronics14193840

AMA Style

Galván A, Higuero M, Sanz A, Atutxa A, Jacob E, Saavedra M. Comparing CNN and ViT for Open-Set Face Recognition. Electronics. 2025; 14(19):3840. https://doi.org/10.3390/electronics14193840

Chicago/Turabian Style

Galván, Ander, Mariví Higuero, Ane Sanz, Asier Atutxa, Eduardo Jacob, and Mario Saavedra. 2025. "Comparing CNN and ViT for Open-Set Face Recognition" Electronics 14, no. 19: 3840. https://doi.org/10.3390/electronics14193840

APA Style

Galván, A., Higuero, M., Sanz, A., Atutxa, A., Jacob, E., & Saavedra, M. (2025). Comparing CNN and ViT for Open-Set Face Recognition. Electronics, 14(19), 3840. https://doi.org/10.3390/electronics14193840

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparing CNN and ViT for Open-Set Face Recognition^†

Abstract

1. Introduction