Explainable Face Recognition via Improved Localization

Shadman, Rashik; Hou, Daqing; Hussain, Faraz; Murshed, M. G. Sarwar

doi:10.3390/electronics14142745

Open AccessArticle

Explainable Face Recognition via Improved Localization

¹

Electrical and Computer Engineering Department, Clarkson University, Potsdam, NY 13699, USA

²

Department of Software Engineering, Rochester Institute of Technology, Rochester, NY 14623, USA

³

Department of Computer Science, University of Wisconsin-Green Bay, Green Bay, WI 54311, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2745; https://doi.org/10.3390/electronics14142745

Submission received: 2 May 2025 / Revised: 27 June 2025 / Accepted: 2 July 2025 / Published: 8 July 2025

(This article belongs to the Special Issue Explainability in AI and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Biometric authentication has become one of the most widely used tools in the current technological era to authenticate users and to distinguish between genuine users and impostors. The face is the most common form of biometric modality that has proven effective. Deep learning-based face recognition systems are now commonly used across different domains. However, these systems usually operate like black-box models that do not provide necessary explanations or justifications for their decisions. This is a major disadvantage because users cannot trust such artificial intelligence-based biometric systems and may not feel comfortable using them when clear explanations or justifications are not provided. This paper addresses this problem by applying an efficient method for explainable face recognition systems. We use a Class Activation Mapping (CAM)-based discriminative localization (very narrow/specific localization) technique called Scaled Directed Divergence (SDD) to visually explain the results of deep learning-based face recognition systems. We perform fine localization of the face features relevant to the deep learning model for its prediction/decision. Our experiments show that the SDD Class Activation Map (CAM) highlights the relevant face features very specifically and accurately compared to the traditional CAM. The provided visual explanations with narrow localization of relevant features can ensure much-needed transparency and trust for deep learning-based face recognition systems. We also demonstrate the adaptability of the SDD method by applying it to two different techniques: CAM and Score-CAM.

Keywords:

explainable AI; biometrics; localization; Class Activation Mapping (CAM); Scaled Directed Divergence (SDD)

1. Introduction

Deep learning models have become ubiquitous across a wide range of application domains due to their efficiency and high performance. The field of biometric authentication is also adopting deep learning-based systems such as face, fingerprint, and iris recognition systems. However, in most cases, the results or decisions of these models are not transparent, i.e., it is not clear why a deep learning model achieved a specific prediction. This can lead to a lack of trust in the results. To achieve transparency and increase the trustworthiness of such systems, explainable AI techniques have drawn much attention [1]. Explainability seeks to provide the reason behind the prediction/classification/decision of a deep learning model. There are multiple types of explanation techniques; however, this paper primarily focuses on the visual explanations of deep learning models [2], which is called discriminative localization. Discriminative localization refers to the process of identifying and highlighting specific regions in an image that are most relevant or important for a model’s decision making, particularly for distinguishing between different classes [3].

Explainability can enhance trust in deep learning-based biometric systems such as face recognition. Explaining the predictions of deep learning models is a complicated task, as the structure of a deep learning model is very complex [4]. There are different techniques to visually explain the decision of a deep learning model, such as Saliency Maps [5], LIME [6], Class Activation Mapping (CAM) [3], Gradient-weighted Class Activation Mapping (Grad-CAM) [7], etc.

In this paper, we use a CAM-based discriminative localization technique, Scaled Directed Divergence (SDD) [8], for narrow/specific localization of relevant (relevant to the deep learning model for its prediction/decision) spatial regions of face images. A Class Activation Map (CAM) highlights the class-specific discriminative regions relevant to the deep learning model for arriving at its prediction/decision. The SDD technique is particularly useful for the fine localization of class-specific features in situations of class overlapping. Class overlapping occurs when images from different classes share similar regions or features, making it challenging for a model to distinguish between them. Therefore, the SDD technique is suitable for visually explaining the decisions of face recognition systems, which often deal with similar face images (face images of different individuals that share similar features).

In a previous paper, Williford et al. [9] performed explainable face recognition using a new evaluation protocol called the “Inpainting Game”. Given a set of three images (probe, mate 1, and non-mate), the explainable face recognition algorithm was to identify pixels in a specific region that contribute to the mate’s identity more than to the non-mate’s identity. This identification was represented through a saliency map, indicating the discriminative regions for mate recognition. The “Inpainting Game” protocol was used for the evaluation of the identification made by the explainable face recognition algorithm. However, Williford et al. did not consider a multi-class model for explainable face recognition where there are both similar and non-similar classes. Rajpal et al. [10] used LIME to visually explain the predictions of face recognition models. However, they did not consider the case of class overlap (common areas highlighted in the visual explanations of different classes); therefore, their explanations failed to narrowly show the important features responsible for making a decision.

In this work, we implement the Scaled Directed Divergence (SDD) method to enhance the interpretability of convolutional neural network (CNN) predictions for face recognition tasks. A CNN model is trained on the multi-class FaceScrub dataset [11], and its classification accuracy is evaluated on the test set. For a test image, we identify the predicted class and apply the SDD technique to generate the class activation map (SDD CAM), enabling fine localization of the most relevant spatial regions contributing to the model’s decision. Compared to traditional CAM, the SDD CAM provides more precise and focused explanations. We perform the same experiments with another multi-class dataset, sampled CASIA-WebFace [12]. We evaluate the quality and effectiveness of these visual explanations generated by the SDD method. Additionally, we demonstrate the adaptability of the SDD technique by integrating it with the Score-CAM method [13] to generate SDD Score-CAMs, and we assess their explanatory performance as well. The key contributions of this paper are as follows:

Applying the SDD method for explainable face recognition;
Evaluating the visual explanation quality of SDD CAM;
Demonstrating the integration of SDD with Score-CAM to generate SDD Score-CAM explanations and show the adaptability of the SDD method.

The code is available at https://github.com/rashikshadman/Scaled_Directed_Divergence (accessed on 2 July 2025).

The paper is organized as follows: Section 2 reviews related work; Section 3 presents the methodology and algorithm; Section 4 describes the face recognition model, dataset, experimental setup, and visual explanation results; Section 5 provides the evaluation outcomes; Section 5.3 offers a discussion of the results; Section 6 explores the adaptability of the SDD technique; and Section 7 concludes the paper.

2. Related Work

This section briefly highlights key visual explanation techniques, which are foundational to interpretability in deep learning models.

Saliency Maps [5] prioritize pixels in an image based on their impact on the output score of a CNN. This is achieved through a first-order Taylor expansion to approximate the model’s nonlinear response. The resulting map highlights the most influential pixels, computed via backpropagation. Local Interpretable Model-agnostic Explanations (LIMEs) [6] provide local explanations by approximating a complex model with an interpretable one around a specific prediction. LIME helps demystify why a model made a particular prediction by simplifying the decision process for individual instances.

DeepLIFT [14] decomposes neural network outputs by attributing the prediction to the contributions of input features, comparing activations with a reference. This method performs a comparison efficiently in a single backward pass. Guided Backpropagation [15] improves visualization by combining “deconvnet” and backpropagation, enabling the visualization of important features at both the final and intermediate layers of the network. This technique focuses on preserving positive gradients, making it ideal for understanding what drives activation in the network. Tang et al. [16] present a two-stage meta-learning framework that integrates attention-guided pyramidal features to enhance performance in few-shot fine-grained recognition tasks. The proposed method effectively captures discriminative features at multiple scales, facilitating improved classification accuracy with limited labeled samples.

Class Activation Mapping (CAM) [3] highlights discriminative image regions by projecting the weights from the final, fully connected layer onto convolutional feature maps. This technique has been foundational in classifying images and is described in detail in Section 3. The Scaled Directed Divergence (SDD) technique, based on CAM, is used in this work to explain face recognition in Section 4. Gradient-weighted Class Activation Mapping (Grad-CAM) [7] extends CAM by incorporating gradient information, which helps identify important image regions associated with specific classes. Grad-CAM enhances class-specific localization by leveraging gradients from the final convolutional layers. Guided Grad-CAM, which combines Grad-CAM and Guided Backpropagation, offers high-resolution, class-discriminative visualizations.

3. Methodology and Algorithm

In this section, we describe the Class Activation Maps (CAMs) [3] and Scaled Directed Divergence (SDD) [8]. We use the SDD technique for our experiments on visual explainability. The SDD technique uses CAMs and is suitable for scenarios with overlapping classes, which is mentioned in Section 1.

3.1. Class Activation Maps

The main goal of Class Activation Maps (CAMs) is to highlight the regions of an image that are most influential for the decision of a CNN model. CAMs are useful for visually explaining the model’s focus areas.

This approach involves tracing the predicted class score back to the preceding convolutional layer to generate CAMs. These maps serve to accentuate regions in the input image that are specifically relevant to the identified class, showcasing discriminative features.

The last convolutional layer in CNN captures features significant for the prediction of the CNN model. These features are used to generate CAMs for the respective classes. The weighted sum of the feature maps is computed to generate CAMs. The CAM generation procedure is explained in Figure 1.

Following the terminology and notations from Zhou et al. [3], let softmax be the final output layer. For a test image,

f_{k} (x, y)

is the activation of unit k at spatial location

(x, y)

in the last convolutional layer. Then, the global average pooled result

F_{k}

for unit k is

\sum_{x, y} f_{k} (x, y)

. The input for the softmax (

S_{c}

) for the target class c is

\sum_{k} w_{k}^{c} F_{k}

. The target class refers to the specific class label for which the model’s attention or activation map is generated, highlighting the image regions that contribute most to the model’s prediction of that class. Here,

w_{k}^{c}

represents the weight corresponding to class c for unit k. Basically, the significance of

F_{k}

for class c is recognized by

w_{k}^{c}

. From

F_{k} = \sum_{x, y} f_{k} (x, y)

, the class score

S_{c}

can be written as follows [3]:

S_{c} = \sum_{k} w_{k}^{c} \sum_{x, y} f_{k} (x, y) = \sum_{x, y} \sum_{k} w_{k}^{c} f_{k} (x, y) .

(1)

The CAM

M_{c}

for class c is [3]

M_{c} (x, y) = \sum_{k} w_{k}^{c} f_{k} (x, y) .

(2)

Here,

M_{c} (x, y)

is the spatial element at grid

(x, y)

.

3.2. Scaled Directed Divergence (SDD)

Scaled Directed Divergence (SDD) [8] is a visual explainability technique based on CAMs. This technique is useful for the narrow/specific localization of relevant features (important to the CNN model for its prediction) of an input image compared to traditional CAMs.

Considering a multi-class model of

i + 1

classes, let

S = {x_{o}, \dots \dots, x_{i}}

be the set of CAMs. Let

T (t) = S \ {x_{t}}

be the set of CAMs where the target class (the class for which SDD CAM is generated) CAM

x_{t}

is removed from set S. Then, the SDD CAM for target class t is defined as follows [8]:

C A M_{S D D}^{t} = e^{α (x_{t} F - \sum_{k ϵ T (t)} x_{k})} .

(3)

We have improved the SDD technique by updating the scaling factor F and parameter

α

. The target class CAM is scaled by F (where F =

| S |

∗ (average mean absolute value of all CAMs except target class CAM / mean absolute value of target class CAM); the minimum value of F is S and the maximum value is 50) before computing the divergence values between the target CAM and other CAMs. Parameter

α

(where

α = 0.2 + (s t d d e v (C A M_{S D D}) / (s t d d e v (C A M_{S D D}) + 1)) * 0.2

) amplifies or decreases the divergence values to be exponentiated. The set of CAMs S only needs to contain the target CAM and CAMs with significant overlap. An overview of the SDD technique is shown in Figure 2.

4. Explainable Face Recognition

Our experiment is divided into two parts: (1) using SDD for explaining the face recognition model’s results, and (2) evaluation of SDD results. We first describe two face recognition models and use SDD to visually explain their results (Section 4.1 and Section 4.2).

4.1. Face Recognition Model—1

We use the state-of-the-art AdaFace model [17] for face recognition. The AdaFace model is ResNet100, which is pre-trained on the WebFace12M [18] face dataset. For the face classification task, the pre-trained AdaFace model is fine-tuned on the FaceScrub dataset [11] (a public dataset). The FaceScrub dataset is a high-quality resource for face recognition research, offering several advantages in terms of diversity and data integrity:

Extensive Coverage: Comprising 43,148 face images of 530 celebrities (265 male and 265 female), FaceScrub provides a substantial number of images per individual. This extensive coverage enhances the dataset’s utility for training and evaluating face recognition models.
Diverse Real-World Data: The images were collected from various online sources, including news outlets and entertainment websites, capturing subjects in uncontrolled environments. This diversity introduces variations in pose, lighting, and expression, which are crucial for developing robust face recognition systems.
Rigorous Data Cleaning Process: To ensure data quality, FaceScrub employs an automated cleaning approach that detects and removes irrelevant or mislabeled images. This process is complemented by manual verification, resulting in a dataset that is both large and accurate.

These attributes make FaceScrub a valuable dataset for advancing face recognition technologies, providing researchers with a rich and diverse set of images for model training and evaluation.

So, there are a total of 530 classes (each person is a class). Among the 530 classes, there are 265 male classes (23,216 images) and 265 female classes (19,932 images). We divide the dataset into train, validation, and test sets with a ratio of 80/10/10. So there are 34,099 training images, 4555 validation images, and 4494 test images. The AdamW optimizer is used for fine-tuning the model. The learning rate is 0.005, and the number of epochs is 20. The StepLR learning rate scheduler is used where step_size = 10 and gamma = 0.1. The training time is around 1 h and 10 min. The overall test accuracy of the model is 96.33%. We have used NVIDIA RTX A2000 GPU for training and testing the model. The driver version is 525.125.06 with CUDA version 12.0. The available memory of the GPU is 12 GB.

Visual Explanation of Face Recognition

Here, we visually explain the prediction of the face recognition model. We describe the fine localization of the most relevant face features (considered significant by the model), which visually explains the prediction of the CNN model. We consider four test images of four different classes (subject_4, subject_120, subject_63, and subject_134) that belong to the test set.

In Figure 3, the class of the test image is subject_4, which is predicted correctly by the face model. All the CAMs are overlaid on the test image of class subject_4. The target class is subject_4. For the SDD method, the top five classes (based on the probability predicted by the model) are considered. In the bottom row, the first five CAMs (from the left) are generated for the top five classes. The top row shows the sample images of the classes for which CAMs are generated. There is significant overlap among the CAMs of these classes. The last CAM (in the bottom row) is generated using SDD for target class subject_4, in which very narrow/specific localization (compared to the traditional CAM of class subject_4) of the most relevant face features is shown. The SDD CAM highlights the most important features of class subject_4 (with respect to the other four classes) for the correct prediction of the model.

In Figure 4, the class of the test image is subject_120. The face model correctly predicted the class of the test image. Like the previous example, the CAMs are overlaid on the test image. The target class is subject_120. Same as before, the top five classes are considered for implementing the SDD method. The last CAM (in the bottom row) is generated using SDD for target class subject_120, in which very narrow/specific localization (compared to the traditional CAM of class subject_120) of the most relevant face features is shown. The SDD CAM highlights the most important features of class subject_120.

In Figure 5, two similar examples are shown. In the top row, the test image belongs to class subject_63 and is predicted correctly by the model. And in the bottom row, the test image belongs to class subject_134 and is predicted correctly by the model. In both cases, the top five classes (based on model prediction) are considered for the SDD method. The last column shows the SDD CAMs for classes subject_63 and subject_134, respectively.

4.2. Face Recognition Model—2

We use the same pre-trained AdaFace model, which is fine-tuned on a face dataset of 500 classes. This dataset is sampled from the CASIA-WebFace dataset [12] (a public dataset). The CASIA-WebFace is another high-quality and diverse dataset used for face recognition research. In the sampled dataset, there are a total of 500 classes (each person is a class). Among the 500 classes, there are both male and female classes. We divide the dataset into train, validation, and test sets with a ratio of 80/10/10. So there are 38,313 training images, 5046 validation images, and 4990 test images. The hyperparameters are the same as described in Section 4.1. The training time is around 1 h and 18 min. The overall test accuracy of the model is 81.86%.

Visual Explanation of Face Recognition

In this section, we show the same kind of examples for the face recognition model described above (as we showed earlier) in Figure 6. We have generated SDD CAMs for four test images. All the test images are predicted correctly by the model.

5. Evaluation of the Visual Explanation

In this section, we evaluate the visual explanations generated by the Scaled Directed Divergence method for the face recognition model. We use the deletion-and-retention evaluation scheme [19]. The deletion method evaluates the effectiveness of the visual explanations by removing the most important regions of an image identified by the explanation map and measuring the model’s confidence drop (a higher drop is better). In the case of the retention method, evaluation is performed by retaining the most significant regions of an image identified by the explanation map and measuring the model’s confidence drop (a lower drop is better).

5.1. Deletion

During deletion, we remove the top 20% values of the SDD CAM by replacing them with 0. We remove the top 20% values to demonstrate the impact of the most important regions highlighted in the SDD CAM. However, consider that the top 20% values removes more areas than highlighted in the SDD CAM. This is because the extra regions are also included in the top 20% values of the SDD CAM (though they are not highlighted).

Figure 7 shows an example of the deletion scheme. To compare with the important regions of the SDD CAM, we remove random areas of the same dimension (we call it random CAM). When certain regions of a test image are removed/covered, the prediction confidence of the face recognition model drops. When SDD CAM regions are removed/covered, the confidence drop should be higher. On the other hand, when random regions are removed/covered, the confidence drop should be lower.

For example, for a test image, the original confidence of the model is 99% (for the top class). When SDD CAM regions are removed, the confidence is 70% (for the original top class). So the confidence drop is 20%. When random regions are removed, the confidence should be higher than 70% and the confidence drop should be lower than 20%. This is because SDD CAM regions are more important for the prediction of the model than random regions. This hypothesis may not apply to all the test images, but it should be true for most of the test images. Therefore, we calculate the average confidence drop for the whole test set (4494 images) of the FaceScrub dataset as well as for the test set (4990 images) of the sampled CASIA-WebFace dataset, removing/covering the SDD CAM regions (making the top 20% values of the SDD CAM 0 as it demonstrates the impact properly). Also, we calculate the average confidence drop after removing/covering random regions of the same dimension.

Another metric to evaluate the SDD CAM results is the change in prediction. In many cases, the prediction of the model changes due to removing/covering certain areas of the test image. For deletion, the prediction change percentage for the whole test dataset should be higher in the case of SDD CAMs. In the case of random CAMs, the prediction change percentage should be lower. For the deletion method and the FaceScrub test set, the average confidence drop for SDD CAMs is 62.33%, whereas the average confidence drop for random CAMs is 42.50%. The prediction change percentage for SDD CAMs is 51.36%, whereas for random CAMs, the prediction change percentage is 29.73%. The results are shown in Table 1. For the CASIA-WebFace test set, the average confidence drop for SDD CAMs is 45.05%, whereas the average confidence drop for random CAMs is 26.83%. The prediction change percentage for SDD CAMs is 60.40%, whereas for random CAMs, the prediction change percentage is 37.90%. For both evaluation metrics—(1) Average Confidence Drop and (2) Prediction Change Percentage—and for both datasets, the explanations generated by SDD CAM outperform those of random CAM. This means that SDD CAM highlights more relevant image regions, leading to a greater drop in model confidence and more frequent prediction changes when those regions are removed/covered.

5.2. Retention

In the retention method, only those regions are retained that are removed/covered in the deletion method. Figure 8 shows an example of the retention scheme.

As only certain regions are retained, the prediction confidence of the model drops significantly in both the SDD CAM and random CAM cases. However, when the SDD CAM regions are retained, the confidence drop should be lower as the SDD CAM regions are the most relevant regions to the model. When random CAM regions are retained, the confidence drop should be higher, as these regions are not that significant in general. Also, the prediction change percentage should be lower for the SDD CAM than for the random CAM. In the case of the FaceScrub test set, the average confidence drop for the retention method is 69.43% for SDD CAMs. For random CAMs, the average confidence drop is 82.21% which is higher compared to the average drop for SDD CAMs. The prediction change percentage for SDD CAMs is 57.68% using the retention method. For random CAMs, the prediction change percentage is 79.93%, which is much higher. The results are shown in Table 2. In the case of the CASIA-WebFace test set, the average confidence drop is 46.30% for SDD CAMs. For random CAMs, the average confidence drop is 61.95%. The prediction change percentage is 59.98% for SDD CAMs and 84.99% for random CAMs. The evaluation results clearly show that SDD CAM provides more effective explanations compared to random CAM.

5.3. Discussion

The SDD results show that the most relevant regions of a face that a deep learning model considers for making its prediction/decision can be localized very specifically (in a very narrow manner compared to the traditional CAM) using the SDD technique. The SDD explanations are evaluated using the deletion-and-retention scheme (Section 5). The evaluation results validate that the SDD CAM highlights the most important regions of a test image, which the face recognition model considers for its prediction. For the target class, the face features are determined with respect to other classes.

While the SDD technique is introduced effectively as a means to enhance class-discriminative localization, a comparison with traditional CAM in terms of computational efficiency and implementation complexity further supports its utility. Traditional CAM is computationally efficient and simple to implement, as it only requires a forward pass and access to the final convolutional feature maps and class weights. SDD builds on this framework by introducing additional steps, and specifically, by generating CAMs for non-target classes and computing the scaled divergence. Although this adds some computational overhead due to multiple forward passes or class-specific CAM generation, the implementation remains relatively straightforward within the same architectural pipeline. The additional cost is justified by the substantial improvement in localization precision and interpretability, particularly in class-overlapping or fine-grained scenarios. Thus, SDD strikes a practical balance between ease of integration and improved explanation quality, making it a compelling choice over a traditional CAM. With the GPU, it takes around 1.5 s to generate a SDD CAM from the CAMs of similar classes.

The face features highlighted in SDD CAMs are significant to the face recognition model for making its prediction. Such insight can be beneficial for explaining the behavior of deep learning-based face recognition models and increasing their transparency. Also, it will enhance trust in the decisions of the DL-based face recognition systems and can be used to improve their performance.

6. SDD Adaptability

SDD is method-agnostic and can serve as a refinement or alternative within the CAM family of techniques. It can be extended to work with other attribution maps by applying the SDD formulation over a set of such maps. In this section, we generate SDD Score-CAMs instead of SDD CAMs to demonstrate the SDD technique’s adaptability.

6.1. SDD Score-CAM

SDD Score-CAM is based on Score-CAM [13]. Score-CAM (Score-Weighted Class Activation Mapping) is a gradient-free visual explanation method proposed to improve the interpretability of CNNs. Unlike gradient-based techniques such as Grad-CAM, which rely on backpropagated gradients to assess the importance of activation maps, Score-CAM leverages the model’s prediction scores directly. Specifically, Score-CAM generates attention maps by first extracting activation maps from a chosen convolutional layer, then upsampling and normalizing them to create spatial masks. These masks are individually applied to the input image to form masked inputs, which are fed forward through the model to obtain class-specific confidence scores. The contribution of each activation map is quantified based on the score it yields, and the final class activation map is produced by a weighted combination of these maps using the corresponding scores as weights. This gradient-independent approach reduces noise and enhances the reliability of the visual explanations, making Score-CAM more robust and architecture-agnostic for interpreting CNN decisions.

The details of the SDD framework described in Section 3.2 remain the same for generating SDD Score-CAM. Instead of a set of CAMs, a set of Score-CAMs is used to generate the SDD Score-CAM for the target class. Four examples are shown in Figure 9 for the test images of the FaceScrub dataset. In the first row, a test image of class subject_4 is considered for the SDD Score-CAM implementation. The model correctly predicted the class of the test image. Score-CAMs are generated for the top five classes (based on model prediction) and overlaid on the test image. The target class is subject_4. The last image in the row shows the SDD Score-CAM for class subject_4. Similarly, in the second row, a test image of class subject_120 is considered (correct prediction by the model). Score-CAMs are generated for the top five classes and overlaid on the test image. The last image in the row shows the SDD Score-CAM for the target class subject_120. Two more similar examples are shown in the third and fourth rows for a test image of class subject_63 (correct prediction by the model) and a test image of class subject_134 (correct prediction by the model), respectively. For the test images of the CASIA-WebFace dataset, SDD Score-CAMs are generated in Figure 10.

6.2. Evaluation of SDD Score-CAM Results

To evaluate the SDD Score-CAM results, we use the same deletion-and-retention method described in Section 5. We use the same two evaluation metrics: (1) Average Confidence Drop and (2) Prediction Change Percentage.

During deletion, we remove the top 20% values of the SDD Score-CAM, and we remove random areas of the same dimension (random CAM) to compare with the most important regions of the SDD Score-CAM. For the deletion method and the FaceScrub test set, the average confidence drop for SDD Score-CAMs is 39.17%, whereas the average confidence drop for random CAMs is 13.66%. The prediction change percentage for SDD Score-CAMs is 29.33%, whereas for random CAMs, the prediction change percentage is 7.85%. The results are shown in Table 3. For the CASIA-WebFace test set, the average confidence drop for SDD Score-CAMs is 45.48%, whereas the average confidence drop for random CAMs is 11.44%. The prediction change percentage for SDD Score-CAMs is 61.30%, whereas for random CAMs, the prediction change percentage is 20.84%. From the evaluation results, it is evident that SDD Score-CAM explanations are better than those of random CAM.

Only the top 20% of the SDD Score-CAM values are retained during the retention process. For the random CAM, these retained regions are shifted randomly. For the retention method and the FaceScrub test set, the average confidence drop for SDD Score-CAMs is 86.69%, which is lower than the average confidence drop for random CAMs (89.22%). The prediction change percentage for SDD Score-CAMs is lower (89.99%) than that for random CAMs (97.82%). In the case of CASIA-WebFace test set, the average confidence drop for SDD Score-CAMs is 54.39% (lower than the average confidence drop for random CAMs—66.79%). The prediction change percentage for SDD Score-CAMs is lower (70.60%) than that for random CAMs (96.33%). A lower average drop and prediction change are better for the retention method, which shows the effectiveness of SDD Score-CAM explanations. The results are shown in Table 4.

6.3. Discussion

Despite traditional CAM being considered less precise than methods like Score-CAM, our results show that SDD-CAM, based on traditional CAM, outperforms SDD Score-CAM in deletion and retention evaluations. This is because traditional CAM inherently leverages class-specific weights from the final, fully connected layer, producing activation maps that are already focused on class-relevant features. When combined with SDD, these maps become even more discriminative by suppressing overlapping evidence from other classes. In contrast, Score-CAM, while generally strong for localization, generates broader activation maps that do not benefit as much from the SDD subtraction process. As a result, SDD CAM tends to highlight more compact and class-unique regions, which leads to a higher confidence drop and prediction change during deletion, and a lower drop and prediction change during retention, both of which indicate stronger explanation quality.

Another aspect is the computational efficiency. SDD Score-CAM combines the strengths of Score-CAM with the class-discriminative power of the SDD framework, but it comes with increased computational cost and implementation complexity. Unlike a traditional CAM, which requires only a single forward pass, Score-CAM involves multiple forward passes—one for each activation map masked input—to compute class-specific importance weights. Integrating SDD into this process further increases the load by requiring Score-CAM generation for non-target classes to perform the divergence-based subtraction. As a result, SDD Score-CAM is computationally more intensive, especially when applied to multi-class settings or high-resolution images. With the GPU, it takes around 27 s to generate a SDD Score-CAM from the Score-CAMs of similar classes. Despite these challenges, SDD Score-CAM can be beneficial in cases where broad localization is preferred.

7. Conclusions

This paper visually explains deep learning-based face recognition using an effective localization technique Scaled Directed Divergence. The experiments show that traditional CAMs perform broad localization of relevant face features and are unsuitable in the case of overlapping classes. The Scaled Directed Divergence technique based on CAM can perform very narrow/specific localization of the face features relevant to a deep learning-based face recognition model for making its prediction/decision. The visual explanations generated by the SDD technique are evaluated using the deletion-and-retention scheme. The evaluation results validate the effectiveness of the SDD technique in localizing the most significant face features. Also, the SDD technique is a flexible framework which can be used based on other CAM methods as well, such as Score-CAM. The evaluation results show that SDD Score-CAM may not be as effective as SDD CAM when very narrow/specific localization is preferred. Still, SDD Score-CAM provides narrower localization than Score-CAM. These visual explanations will increase the transparency and trustworthiness of existing and new deep learning-based face recognition models. The predictions/decisions of the face recognition models will be more acceptable to the end-users with proper explanations. Also, this can be used to improve the performance of the models by indicating the flaws and biases.

Author Contributions

Conceptualization, R.S., M.G.S.M. and F.H.; Methodology, R.S., M.G.S.M. and F.H.; Software, R.S.; Validation, M.G.S.M., F.H. and D.H.; Investigation, R.S., M.G.S.M., F.H. and D.H.; Writing—review & editing, R.S., M.G.S.M., F.H. and D.H.; Supervision, F.H. and M.G.S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Center for Identification Technology Research (CITeR) and the National Science Foundation under Grant No. 1650503.

Data Availability Statement

The authors used the FaceScrub dataset and the CASIA-WebFace dataset, which are existing public datasets: https://vintage.winklerbros.net/facescrub.html and https://paperswithcode.com/dataset/casia-webface. The access date is 31 October 2024. No new data was collected for the experiments. This paper describes our research on generating visual explanations for the decisions made by the deep learning-based face recognition system and evaluates the generated explanations. The methodology, visual explanations, and evaluation results are the main contributions of this paper.

Acknowledgments

This material is based upon work supported by the Center for Identification Technology Research (CITeR) and the National Science Foundation under Grant No. 1650503.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Samek, W.; Wiegand, T.; Müller, K.R. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv 2017, arXiv:1708.08296. [Google Scholar]
Babiker, H.K.B.; Goebel, R. An introduction to deep visual explanation. arXiv 2017, arXiv:1711.09482. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Verenich, E.; Murshed, M.S.; Khan, N.; Velasquez, A.; Hussain, F. Mitigating the Class Overlap Problem in Discriminative Localization: COVID-19 and Pneumonia Case Study. In Explainable AI Within the Digital Transformation and Cyber Physical Systems: XAI Methods and Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 125–151. [Google Scholar]
Williford, J.R.; May, B.B.; Byrne, J. Explainable face recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 248–263. [Google Scholar]
Rajpal, A.; Sehra, K.; Bagri, R.; Sikka, P. Xai-fr: Explainable ai-based face recognition using deep neural networks. Wirel. Pers. Commun. 2023, 129, 663–680. [Google Scholar] [CrossRef] [PubMed]
Ng, H.W.; Winkler, S. A data-driven approach to cleaning large face datasets. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 343–347. [Google Scholar]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning face representation from scratch. arXiv 2014, arXiv:1411.7923. [Google Scholar]
Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.; Mardziel, P.; Hu, X. Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 24–25. [Google Scholar]
Shrikumar, A.; Greenside, P.; Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3145–3153, PMLR. [Google Scholar]
Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for simplicity: The all convolutional net. arXiv 2014, arXiv:1412.6806. [Google Scholar]
Tang, H.; Yuan, C.; Li, Z.; Tang, J. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit. 2022, 130, 108792. [Google Scholar] [CrossRef]
Kim, M.; Jain, A.K.; Liu, X. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18750–18759. [Google Scholar]
Zhu, Z.; Huang, G.; Deng, J.; Ye, Y.; Huang, J.; Chen, X.; Zhu, J.; Yang, T.; Du, D.; Lu, J.; et al. Webface260m: A benchmark unveiling the power of million-scale deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10492–10502. [Google Scholar]
Petsiuk, V.; Das, A.; Saenko, K. Rise: Randomized input sampling for explanation of black-box models. arXiv 2018, arXiv:1806.07421. [Google Scholar]

Figure 1. Steps of generating CAMs for visual explainability [3]. The most important layers in CNN are convolutional layers, which capture significant features of the input image. The feature maps of the last convolutional layer are global average pooled before the output layer. The weights of the output layer are mapped back to the convolutional feature maps (last convolutional layer) to generate the CAM for the input class. The CAM is a weighted sum of the convolutional feature maps. The feature maps are multiplied by the weights before summation (* shows multiplication). In the CAM, red, yellow, and green regions indicate the relevant areas, while blue regions indicate the non-relevant areas.

Figure 2. An overview of the Scaled Directed Divergence (SDD) technique [8]. For an input image, multiple CAMs are generated for multiple classes that are overlaid on the same test image. Among the classes for which CAMs are generated, one class is selected as the target class (the selection of the target class is up to the user; it can be the original class, the predicted class, or any other class). Here, class_1 is the target class. As mentioned earlier, the target class is not fixed, and any class can be considered the target class. The target class CAM is multiplied by the scaling factor (SF) to adjust for differences in magnitude between CAMs to enable meaningful subtraction. To generate the Divergence values, CAMs of overlapping classes need to be subtracted from the scaled target class CAM. The selection of the set of CAMs depends on the similarity of the classes. The top n classes (based on the prediction of the model) can be used for SDD; the selection of n depends on the user (n = 2, 3, 4, 5, 6……). The Divergence values are multiplied by alpha (explained in Section 3.2) for amplification/decrease before exponentiation. Finally, the SDD CAM is generated for the target class.

Figure 3. Visually explaining the prediction of the face recognition model (described in Section 4.1) by highlighting the relevant face features. The class of the test image is subject_4. The DL model performed a correct prediction of the test image. The probability of the top 5 classes predicted by the model: subject_4 (99.025%), subject_333 (0.118%), subject_482 (0.103%), subject_475 (0.089%), subject_518 (0.087%). In a CAM, the red, yellow, and green regions highlight the relevant features while the blue regions highlight the non-relevant features. The top row shows the example images of the classes. The target class is subject_4. In the bottom row, the first image (from the left) shows the traditional CAM for class subject_4. The last image (SDD_subject_4) shows the SDD CAM for target class subject_4, which localizes the most relevant features (left side area beside the nose and middle of the forehead) in a very narrow manner (compared to the traditional CAM). Other CAMs are subtracted from the scaled target class CAM to generate the Divergence values, which are amplified/decreased and exponentiated to generate the SDD CAM.

Figure 4. Visually explaining the prediction of the face recognition model (described in Section 4.1) by highlighting the relevant face features. The class of the test image is subject_120. The DL model made a correct prediction of the test image. The target class is subject_120. The probability of the top 5 classes predicted by the model: subject_120 (99.817%), subject_398 (0.029%), subject_285 (0.022%), subject_490 (0.011%), subject_303 (0.009%). In a CAM, the red, yellow, and green regions highlight the relevant features while the blue regions highlight the non-relevant features. The top row shows the example images of the classes. In the bottom row, the first image (from the left) shows the traditional CAM for class subject_120. The last image (SDD_subject_120) shows the SDD CAM for target class subject_120, which localizes the most relevant features (right side area beside the lips) in a very narrow manner (compared to the traditional CAM). Other CAMs are subtracted from the scaled target class CAM to generate the Divergence values, which are amplified/decreased and exponentiated to generate the SDD CAM.

Figure 5. Visually explaining the prediction of the face recognition model (described in Section 4.1) by highlighting the relevant face features. In a CAM, the red, yellow, and green regions highlight the relevant features while the blue regions highlight the non-relevant features. In the top row, the SDD method is applied to the test image of class subject_63 (correct prediction by the model). The bottom row considers the test image of class subject_134 (correct prediction by the model) for the SDD application. The last image of the top row shows the SDD CAM for class subject_63 overlaid on the test image of the same class. The last image of the bottom row shows the SDD CAM for class subject_134 overlaid on the test image of the same class. Both SDD CAMs accurately highlight the most important facial features—area below the right eye for class subject_63 and area below the nose for class subject_134.

Figure 6. Visually explaining the prediction of the face recognition model (described in Section 4.2) by highlighting the relevant face features using SDD CAMs. In a CAM, the red, yellow, and green regions highlight the relevant features while the blue regions highlight the non-relevant features. In the 1st row, CAMs are generated for the top 5 classes and overlaid on the test image of class subject_10. The last image in the row shows the SDD CAM for the target class subject_10, which highlights the most significant features (right side area of the neck) in a narrow manner compared to the CAM of class subject_10. In the 2nd row, the same procedure is implemented for a test image of class subject_150. The last image in the row shows the SDD CAM for the target class subject_150, where the most important facial features are highlighted (right eye area and left eyebrow). In the 3rd and 4th row, SDD CAMs are generated for classes subject_101 (important features are right side area beside the lips) and subject_352 (important feature is right ear), respectively.

Figure 7. An overview of the deletion scheme, which is used for the evaluation of the visual explanation generated by the SDD method. From the left, the first image is the original image, the second image shows the SDD CAM (SDD CAM is overlaid on the image), the top 20% SDD CAM regions are deleted in the third image (covered in black), and random CAM regions of the same dimension are deleted in the fourth image (covered in black).

Figure 8. An overview of the retention scheme, which is used for the evaluation of the visual explanation generated by the SDD method. From the left, the first image is the original image, the second image shows the SDD CAM (SDD CAM is overlaid on the image), the top 20% SDD CAM regions are retained in the third image, and random CAM regions of the same dimension are retained in the fourth image.

Figure 9. Visually explaining the prediction of the face recognition model (described in Section 4.1) by highlighting the relevant face features using SDD Score-CAMs. In a Score-CAM, the red, yellow, and green regions highlight the relevant features while the blue regions highlight the non-relevant features. In the 1st row, Score-CAMs are generated for the top 5 classes and overlaid on the test image of class subject_4. The last image in the row shows the SDD Score-CAM for the target class subject_4, which highlights the most significant features (all the areas except the lips and the area above the lips) in a narrow manner compared to the Score-CAM of class subject_4. In the 2nd row, the same procedure is implemented for a test image of class subject_120. The last image in the row shows the SDD Score-CAM for the target class subject_120, where the most important facial features are highlighted (right side area below the lips. In the 3rd and 4th row, SDD Score-CAMs are generated for classes subject_63 (important features are nose and the area below the right eye) and subject_134 (important feature is right eye), respectively.

Figure 10. Visually explaining the prediction of the face recognition model (described in Section 4.2) by highlighting the relevant face features using SDD Score-CAMs. In a Score-CAM, the red, yellow, and green regions highlight the relevant features while the blue regions highlight the non-relevant features. In the 1st row, Score-CAMs are generated for the top 5 classes and overlaid on the test image of class subject_10. The last image in the row shows the SDD Score-CAM for the target class subject_10. In the 2nd row, the same procedure is implemented for a test image of class subject_150. The last image in the row shows the SDD Score-CAM for the target class subject_150. In the 3rd and 4th row, SDD Score-CAMs are generated for classes subject_101 and subject_352, respectively.

Table 1. Performance comparison between SDD CAM and random CAM using the deletion method on the FaceScrub (4494 test images) and CASIA-WebFace (4990 test images) datasets.

Dataset	Metric	SDD CAM	Random CAM
FaceScrub	Average Confidence Drop	62.33%	42.50%
FaceScrub	Prediction Change Percentage	51.36%	29.73%
CASIA-WebFace	Average Confidence Drop	45.05%	26.83%
CASIA-WebFace	Prediction Change Percentage	60.40%	37.90%

Table 2. Performance comparison between SDD CAM and random CAM using the retention method on the FaceScrub (4494 test images) and CASIA-WebFace (4990 test images) datasets.

Dataset	Metric	SDD CAM	Random CAM
FaceScrub	Average Confidence Drop	69.43%	82.21%
FaceScrub	Prediction Change Percentage	57.68%	79.93%
CASIA-WebFace	Average Confidence Drop	46.30%	61.95%
CASIA-WebFace	Prediction Change Percentage	59.98%	84.99%

Table 3. Performance comparison between SDD Score-CAM and random CAM using the deletion method on the FaceScrub (4494 test images) and CASIA-WebFace (4990 test images) datasets.

Dataset	Metric	SDD Score-CAM	Random CAM
FaceScrub	Average Confidence Drop	39.17%	13.66%
FaceScrub	Prediction Change Percentage	29.33%	7.85%
CASIA-WebFace	Average Confidence Drop	45.48%	11.44%
CASIA-WebFace	Prediction Change Percentage	61.30%	20.84%

Table 4. Performance comparison between SDD Score-CAM and random CAM using the retention method on the FaceScrub (4494 test images) and CASIA-WebFace (4990 test images) datasets.

Dataset	Metric	SDD Score-CAM	Random CAM
FaceScrub	Average Confidence Drop	86.69%	89.22%
FaceScrub	Prediction Change Percentage	89.99%	97.82%
CASIA-WebFace	Average Confidence Drop	54.39%	66.79%
CASIA-WebFace	Prediction Change Percentage	70.60%	96.33%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shadman, R.; Hou, D.; Hussain, F.; Murshed, M.G.S. Explainable Face Recognition via Improved Localization. Electronics 2025, 14, 2745. https://doi.org/10.3390/electronics14142745

AMA Style

Shadman R, Hou D, Hussain F, Murshed MGS. Explainable Face Recognition via Improved Localization. Electronics. 2025; 14(14):2745. https://doi.org/10.3390/electronics14142745

Chicago/Turabian Style

Shadman, Rashik, Daqing Hou, Faraz Hussain, and M. G. Sarwar Murshed. 2025. "Explainable Face Recognition via Improved Localization" Electronics 14, no. 14: 2745. https://doi.org/10.3390/electronics14142745

APA Style

Shadman, R., Hou, D., Hussain, F., & Murshed, M. G. S. (2025). Explainable Face Recognition via Improved Localization. Electronics, 14(14), 2745. https://doi.org/10.3390/electronics14142745

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable Face Recognition via Improved Localization

Abstract

1. Introduction

2. Related Work

3. Methodology and Algorithm

3.1. Class Activation Maps

3.2. Scaled Directed Divergence (SDD)

4. Explainable Face Recognition

4.1. Face Recognition Model—1

Visual Explanation of Face Recognition

4.2. Face Recognition Model—2

Visual Explanation of Face Recognition

5. Evaluation of the Visual Explanation

5.1. Deletion

5.2. Retention

5.3. Discussion

6. SDD Adaptability

6.1. SDD Score-CAM

6.2. Evaluation of SDD Score-CAM Results

6.3. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI