Beyond Local Explanations: A Framework for Global Concept-Based Interpretation in Image Classification

Vasu, Bhavan; Rathore, Kunal; Tadepalli, Prasad

doi:10.3390/electronics14163230

Open AccessArticle

Beyond Local Explanations: A Framework for Global Concept-Based Interpretation in Image Classification

by

Bhavan Vasu

^*

,

Kunal Rathore

and

Prasad Tadepalli

Department of EECS, College of Engineering, Oregon State University, Corvallis 97331, OR, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3230; https://doi.org/10.3390/electronics14163230

Submission received: 3 July 2025 / Revised: 4 August 2025 / Accepted: 5 August 2025 / Published: 14 August 2025

(This article belongs to the Special Issue Explainable Artificial Intelligence: Concepts, Techniques, Analytics and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Explaining the decisions of learned models in human-interpretable language is critical for building trustworthy AI. Much of the current work on explainable AI in image classification focuses on generating local (image-specific) explanations that highlight parts of the image responsible for decisions. Although valuable, these local explanations are often not human-interpretable, lack generalizability across entire classes or datasets, and fail to cater to a diverse set of stakeholders. In this work, we introduce a novel framework for generating global explanations in terms of human-aligned concepts that are applicable to any image-based classifier, irrespective of the architecture. Our methods provide both local and global explanations across multiple images from multiple classes. We present our framework for generating global explanations along with experimental results on multiple datasets that demonstrate the effectiveness of our technique. Our method achieves a test coverage of 99.3% on the Stanford Cars dataset.

Keywords:

explainable AI (XAI); global explanations; human-interpretable concepts

1. Introduction

Understanding deep neural networks has been a key focus of research since the advent of AlexNet [1]. As machine learning models became increasingly complicated with deeper architectures, sophisticated design choices, and complex activation functions, the need to align them with human values and concepts has come into sharper focus. The advent of regulatory regimes such as GDPR [2] and the AI Act [3] has increased the demand for explanations with properties such as completeness, comprehensibility, and compactness [4].

Explanatory methods are typically classified as white-box or black-box. White-box methods, such as [5,6,7], assume access to a model’s internals, whereas black-box methods [8,9,10,11] operate solely on input–output relationships. In particular, inherently interpretable methods like ProtoPNet [12] do not fit neatly into these categories due to their specific architectural requirement that prioritizes explainability during model training, which may not be practical.

In this work, we focus exclusively on black-box approaches for explaining image classification models. Many companies would find it impractical to comply with explainability requirements by open-sourcing their models, given the immediate financial implications. Black-box methods offer a viable alternative, ensuring compliance without compromising proprietary interests. Our approach is model-agnostic in that it imposes no constraints on model architecture design or training processes. In fact, it is versatile enough to explain any machine learning model that accepts an image as input and predicts a confidence score for the predicted label.

Explanations can be categorized into local and global explanations. Local explanations clarify the decision-making process for individual images, while global explanations provide insights into the general reasons behind the classifications of the images in the whole dataset. We assert that a complete understanding of a model’s behavior requires examining both local and global explanations to capture specific decisions and overarching patterns, respectively.

We build on the previous work of [11] that generates multiple local explanations for image classification, and we introduce a novel method for generating global explanations in terms of human-aligned concepts. Traditional image explanations, such as saliency maps, binary maps, and graphs, can be highly subjective depending on the user due to lack of common-domain consensus. For instance, there is no definitive answer to what objects/regions in a saliency map should be important for a class and by how much. More importantly, these explanations are image-specific and are not generalizable across datasets. To address this issue, our method grounds explanations based on a human-designed taxonomy of objects and parts of objects present in the dataset [13,14]. Given that deep networks lack explicit knowledge of objects within a scene, our approach provides approximate explanations in terms of human-annotated objects and object parts while reducing human annotation burden through automatic correspondence methods. Although this might reduce completeness of the explanation compared to the true model, it significantly enhances the model’s comprehensibility due to its precision and the use of a human-aligned vocabulary. In doing so, our work enables recognition and validation of domain knowledge relevant to the datasets used in image classification.

Global explanations aim to encapsulate multiple decisions, but it is crucial not to provide too much or too little detail. Incomplete global explanations fail to explain many images within a class (e.g., complex scenes with high intraclass variance), whereas overly complex explanations (e.g., a decision tree with hundreds of leaf nodes) may be incomprehensible to humans. Our approach aims to strike a trade-off between the two extremes by carefully controlling the size and form of the explanations.

We evaluated the explanations through coverage graphs that show the fraction of the test data covered by each part of the global explanation. We present results on different image classifiers on three datasets, namely, CUB-200 [15], MIT Scene Parsing [16], and the Stanford Cars [17]. The research questions we aim to answer are as follows:

(RQ1): Can global explanation methods be grounded with human taxonomy?
(RQ2): What is the relationship between local and global explanations?
(RQ3): How can our method be adapted to any model and dataset?

Our approach differs from proxy-based methods such as LIME [18] and ANCHOR [19], which approximate model behavior with local classifiers. Instead, we formalize local explanations with causal/contributing factors (e.g., parts of detected objects) directly responsible for model prediction in a specific instance, aligned with the interpretation based on the concept [20]. While a global explanation is a systematic aggregation of local rationales across a dataset, exposing model decision patterns, unlike surrogate methods that prioritize decision boundary replication, we emphasize explanatory sufficiency [21]: explanations justify predictions through domain-based concepts without the need to independently classify instances. For example, a model can classify a bird as a Robin based on the shape of its beak, and the explanation can identify the “beak” or even refer to its shape, but the complexity of the shape itself often exceeds what can be captured in words. Consequently, while a global explanation may cover all instances by identifying influential features, it does not provide a direct classification rule. This avoids the fidelity–complexity trade-off of proxy models [22] while improving human–AI collaboration in visual reasoning tasks.

2. Related Work

Explainability has been a longstanding focus in machine learning, extending beyond the era of deep models to encompass simple models like Logistic Regression, Matrix or Subspace Methods, and Support Vector Machines. The interpretability of deep networks gained significant momentum with the pioneering work of Zeiler and Fergus [9] on visualizing convolutional networks. GradCAM, introduced by Selvaraju et al. [7], has inspired numerous gradient-based approaches to model interpretability. Khorram et al. [10] further advanced this field by integrating gradient techniques with optimization paradigms to derive explanations.

Attention mechanisms have also played a crucial role in interpretability. Studies by Petsiuk et al. [8], Zheng et al. [23], Zhang et al. [24], and others have utilized saliency maps to highlight important regions of an image, providing valuable insights into model decision making. Despite their utility, saliency maps often suffer from subjectivity and cognitive biases such as overgeneralization, correlation fallacy, and confirmation bias [25]. Our work addresses these biases by aggregating multiple local explanations for the same image and across different images to construct a comprehensive global explanation.

Structured local explanations, as presented by Shitole et al. [11], offer a quantitative local measure through graphical representations. However, these explanations are not in human-understandable language and do not solve the challenge of global explanations. Our approach bridges this gap by approximating important regions using a human taxonomy, ensuring that the explanations are directly interpretable by humans. This methodology improves both the comprehensibility and applicability of explanations in deep image classification models.

Global explanations, as proposed by Singla et al. [26], leveraged predefined concepts from radiology reports and employed causal mediation analysis through counterfactual interventions. The work in Sharma et al. [27] attempts to generate global rules for tabular data by optimizing for a hypercube that fits all instances of a single class. Comprehensive overviews of various approaches to explainability have been provided by [28,29,30,31]. Our work draws inspiration from intersecting research by Balayn et al. [32], which focuses on generating global concepts that are highly correlated with the image label but uses a statistical test such as the Cramer V test to determine importance. In contrast, our goal is to generate human-comprehensible rules that are globally consistent with the machine learning model with complex nonlinear interactions. The work in [33] attempts to generate explanations of low-dimensional digit recognition tasks as logical formulas but fails to scale to large images because its primitive literals are at a pixel level.

Work in [34] attempts to generate global explanations through common grounding for graph classification using motif abstraction, but it fails to address image data that lack structured literals in image classification similar to motifs. To the best of our knowledge, we are the first to do so on large real-world image data.

3. Datasets and Annotations

The problem we seek to address is to generate global explanations of the classification decisions of a black-box model over an image dataset using human vocabulary of objects and their parts. We adopt the training-set test-set methodology to evaluate global explanations. In particular, we divide each dataset into a training set and test set, derive global explanations of the decisions of the neural network model on the training set, and evaluate how much of the test set is covered by the global explanations. Aggregating local explanations requires a common taxonomy of objects or object parts. Identifying global patterns requires the location and names of these objects/parts, i.e., instance segmentation and the associated names aligned with the human vocabulary.

To show the generality of our approach, we selected datasets with varying levels of annotation complexity: ADE20k [16], CUB-200 [15], and the Stanford Cars dataset [17]. The ADE20k dataset, with its 25,574 training images and 2000 validation images, is fully annotated with 150 objects and parts, providing a robust testbed for our method. In contrast, CUB-200 and Stanford Cars, which lack comprehensive object annotations, required us to employ a label correspondence pipeline that facilitates the transfer of concepts across datasets. The Stanford Cars dataset comprises 16,185 images of 196 car classes, divided into 8144 training images and 8041 testing images. Similarly, the CUB-200 dataset includes 11,788 images that span 200 species of birds. For these datasets, we create a condensed subset by selecting 15% of the images of each class for manual annotation to maximize viewpoints. Using the Roboflow platform, we performed part annotations, allowing users to define labels and annotate each image. For example, a bird image was annotated with labels such as head, wings, tail, beak, eye, foot, chest, and background, while car images were labeled with door, screen, headlight, mirror, backscreen, wheel, hood, bumper, and background.

To promote reproducibility and support future research, we will release these user-defined annotations. This release aims to facilitate the broader adoption and application of our approach, contributing to the ongoing development of interpretable AI systems that align with human concepts.

4. Annotation Transfer Across Images

Global explanations are minimal sets of human-interpretable labels of objects and parts, where each set, called a minimal sufficient explanation (MSX), covers or explains a subset of examples in the dataset. We seek to maximize the coverage of the global explanation, which is the percentage of images that are collectively covered by all of its MSXs. Critical to the construction of MSXs is the annotation of object and part labels, which is labor-intensive. In the rest of this section, we describe how we can reduce the human annotation effort through automatic label transfer.

4.1. Finding Visually Nearest Neighbor

When a dataset is not fully annotated with objects/parts, our approach involves transferring part labels from a preannotated gallery set G of images within each category. Each image

g \in G

is annotated with user-defined object part labels, as illustrated in Figure 1. The pixel-wise part annotation process is performed using the Roboflow tool. To maintain high precision for label transfer from gallery images G to a larger query set Q, we need to retrieve visually similar images from G for each query image

q \in Q

. We define visual similarity on the basis of shared characteristics, such as camera angles and object distribution across images. Label transfer problems are typically evaluated on standard benchmark datasets such as Polo, SIFT-Flow, and MSRC-21 [35,36,37], which contain predefined image pairs. However, since such image pairs are not readily available in most image classification datasets, we resort to retrieving visually similar images.

Previous work [38,39] has demonstrated the effectiveness of using network gradients as a form of weak supervision to guide the learning of visual features. We adopt this approach by making use of latent features, weighted by their corresponding gradients, to retrieve visually similar neighbors. Let

g (θ)

denote a CNN image classifier, and let

θ

represent the parameters of g. Let A be the output feature maps of the final convolutional layer in g, and

A_{l}

be the l-th feature map within A. The confidence score gradient

y^{c}

with respect to spatial location

(i, j)

in the feature map

A_{l}

can be calculated by

Δ_{i j}^{l c} = \frac{\partial y^{c}}{\partial A_{i j}^{l}}

. We filter out locations with negative gradients, assigning a weight of zero, and retain those with positive gradients. Positive gradients indicate the feature location that increase the model’s confidence in its prediction [40,41], highlighting that positively contributing regions (e.g., edges, textures, or regions of an object) align better with how humans interpret visual salience [7,42]. The weight of the spatial location

(i, j)

in the l th feature map is then formally represented as

w_{i j}^{l c} = ReLu (Δ_{i, j}^{l c})

(1)

where i and j are the spatial locations on activation maps from layer l for class c. To compute new feature sets, we multiply the activation value of each spatial location in the feature map by weights, which helps to emphasize spatial similarity in images [6,43]:

{\hat{A}}_{i j}^{l} = w_{i, j}^{l c} \cdot A_{i j}^{l}

(2)

Finally, we used weighted features

{\hat{A}}^{l}

to compute the nearest neighbor for each query image from the gallery image set using the cosine distance over the flattened features. This method allows for a more accurate identification of visually similar images, ensuring that the label transfer process is reliable and aligned with the underlying features learned by the model.

4.2. Unsupervised Segmentation

The visually similar images are subjected to unsupervised segmentation. We use Simple Linear Iterative Clustering (SLIC) [44] to segment the query image into meaningful subsegments. This method clusters image pixels based on similarity, forming what are known as superpixels, by utilizing a spatially localized version of the k-means clustering algorithm. However, it is important to note that the superpixels generated by this process are based solely on pixel values and do not inherently correspond to human-aligned concept names. We set

k = 60

as the fixed number of subsegments for each image from the CUB and STF Cars datasets, while the ADE20k dataset uses segmentation maps from the dataset. The choice of k depends on the complexity of the scene being segmented; for classes with a greater number of concepts, a higher value of k is recommended to capture the finer details within the image.

4.3. Part Label Transfer with Correspondence

Our approach to transfer part labels varies depending on the complexity of the images. For complex tasks like scene classification, the focus is on understanding the relationships between multiple objects within a scene. In contrast, simpler tasks, such as bird or car classification, emphasize learning part-based relationships specific to objects.

To achieve consistent representations of human-aligned concepts in multiple images, it is essential to extract meaningful concepts from these images. For datasets that require part-based annotations, we utilize HyperPixelFlow (HPF) [45] to propagate hand-labeled concepts in a few shots. HPF performs matching over deep features, taking advantage of the fact that different layers of a deep neural network (DNN) progressively learn more complex concepts as the network depth increases. This approach is inspired by the concept of hypercolumns [46], which are used in object segmentation and detection. Given an image, the intermediate outputs of a Convolutional Neural Network (CNN)

(f^{l_{0}}, f^{l_{1}}, f^{l_{2}} . . ., f^{l_{Z - 1}})

are pooled and upsampled to create a hyperimage that combines spatial features Z. We recommend that the CNN used to extract intermediate output is the network being explained, but in the case of a model that cannot be used to extract multilevel features, one could use a pre-trained feature extractor to obtain F:

F = [f^{l_{0}}, U (f^{l_{1}}), U (f^{l_{2}}) . . ., U (f^{l_{Z - 1}})]

(3)

where

U

represents a function that upscales the input feature map to match the size of

f^{l_{0}}

, which is the base map. Each spatial position p in the hyperimage corresponds to the image coordinates

x_{p}

and a hyperpixel feature

f_{p}

, where

f_{p} = F (x_{p})

. The hyperpixel at position p is denoted as

h_{p} = (x_{p}, f_{p})

. The core idea of matching involves re-weighting appearance similarity by Hough space voting to enforce geometric consistency. Let

D = (H, H^{'})

be two sets of hyperpixels and

m = (h, h^{'})

be a hyperpixel match where h and

h^{'}

are, respectively, elements of

H

and

H^{'}

. Given a Hough space

χ

of possible offsets between the two hyperpixels, the confidence for the match m is computed as

p (m | D) \propto p (m_{a}) \sum_{x \in χ} p (m_{g} | x) \sum_{m \in H \times H^{'}} p (m_{a}) p (m_{g} | x)

(4)

where

p (m_{a})

represents the confidence in appearance matching and

p (m_{g} | x)

is the confidence in geometric matching with an offset

x

, measuring how close the offset induced by m is to

x

. To compute

p (m_{g} | x)

, we construct a two-dimensional offset space, quantize it into a grid of bins, and use a set of center points of the bins for X. For Hough voting, each match is misassigned to the corresponding offset bin to increment the score of the bin by the appearance similarity score

p (m_{a})

. Overall, the below pseudocode explains how the appearance and geometric constraints are weighted:

For each hyperpixel correspondence $(i, j)$ :
(a)
Appearance weight:
Here, appearance matching confidence is derived using an exponentiated cosine distance over hyperpixel features:

$W_{app} (i, j) = {[max (0, cosine_sim (f_{i}, f_{j}))]}^{d}$

(b)
Hough voting:
- Map the correspondence to its offset bin in Hough space.
- Add $W_{app} (i, j)$ to that bin’s accumulator.
(c)
Spatial regularization:
- Convolve the Hough space with a smoothing filter.
- Read out spatial consistency $W_{spatial} (i, j)$ from the bin of $(i, j)$ .
(d)
Final weight:

$W_{final} (i, j) = W_{app} (i, j) \times W_{spatial} (i, j)$

where the ReLU function clamps the negative values to zero and the exponent d helps to emphasize the difference between the hyperpixel features. When integrated with Hough voting, this similarity function with

d \geq 2

increases matching performance by reducing the impact of noisy activations. The centroids

(C_{x}, C_{y})

of each segment of the query image act as key predictors in the HyperPixelFlow process, allowing for matching of the corresponding regions and transferring parts from the gallery image to the query image. The combined use of HPF with SLIC enables the assignment of human-aligned names to superpixel groups, thereby identifying meaningful concepts. Figure 2 shows the result of the transfer of part labels from a labeled query image on the right to an unlabeled gallery image on the left.

5. Local and Global Explanations

Notation: The dataset that we attempt to explain is

D = {(x_{i}, y_{i}, s_{i}) | i = 1, \dots, N}

. It contains N input images

x_{i} \in X

,

y_{i} \in Y, y_{i} \in {1, 2, \dots, c}

categorical labels indicating the image class, and a segmentation map

s_{i}

of the same spatial resolution as

x_{i}

. Let

s_{i} = {s_{i}^{1}, s_{i}^{2}, \dots s_{i}^{p}}

be a set of binary segmentation maps with respect to the p object categories present in

x_{i}

.

5.1. Minimal Sufficient Explanations

For a given image

x_{i}

,

f (x_{i}, y_{i}; θ)

is a deep convolutional network that predicts

y_{i}

with a score

p_{i}

. Following [11], we utilize the beam search to find subsets of

s_{i}

to find the minimal subsets that result in a score of at least

0.95 p_{i}

. In previous work [11], it has been found that a single image might contain multiple MSXs. We call such subsets minimal sufficient explanations (MSXs), denoted by

S_{i}^{m i n}

, and find them all with a single beam search guided by the score of the machine learning model as the heuristic. At each node of the graph, we apply a Gaussian blur to mask all parts of the image except the object categories selected at that node, using the pixel-level annotations in

s_{i}

. The Gaussian mask is smoothed to avoid introducing abrupt edges that could cause the input image to fall outside the distribution seen during training. We proceed to select object segmentation maps for subsequent levels of the graph only if they result in an image score for class

y_{c}

, that is, at least 95% of the original score of

y_{c}

when no masking is applied. Each resulting set of objects/parts from this search process is referred to as a minimal sufficient explanation for a given class label

y_{c}

. Continuing the search provides a systematic approach to identify and represent multiple MSXs, each of which is a concise and meaningful explanation expressed as a set of objects or parts of the object.

5.2. Symbolic Representation

While we borrow the core idea of computing multiple local explanation from SAG, we modify its perturbation strategy, which computes explanations by dividing the image into a 7 × 7 grid. Rather than perturbing the fixed grid cells, we perturb object segmentation maps in the image that are directly interpretable by humans. To ensure consistent representation of the minimal sufficient explanations (MSXs) across all images within a class, we encode them as a sparse matrix

G_{y_{c}} \in R^{k \times m}

, where k represents the number of objects and m is the number of local MSXs for class

y_{c}

throughout the dataset. This matrix captures the presence or absence of each object/part in the MSXs for a specific class. A value of 1 at position

G_{y_{c}} [i, j]

indicates the presence of object i in MSX j for class

y_{c}

while 0 represents its absence.

5.3. Deriving Global Explanations

To derive global explanations from the symbolic representation of MSXs, we make use of a monotonic Disjunctive Normal Form (mDNF) where each literal corresponds to an object or part label. A symbolic MSX of an image identifies a minimal set of parts that is sufficient to recognize the class of that image by the given model and can be interpreted as a conjunction of positive literals that represent object/part labels. For a global explanation, we seek the smallest set of conjunctions that collectively cover the dataset.

However, finding the smallest set of conjunctions is an NP-hard problem [47]. Exhaustive search is impractical due to the large number of images for each dataset. Back-of-the-envelope calculations suggest that one would need

\approx 10^{9}

comparisons for just the ADE dataset. We avoid using SHAP values to generate global explanations due to their computational complexity and unreliability [48]. They are also not consistent with our motivation of finding succinct interpretable global explanations as a set of object image parts. We instead employ the well-known greedy set cover algorithm [49], which iteratively selects the conjunction (MSX) that covers most of the remaining images, until all image instances are covered. Algorithm 1 outputs such a set. Here, R represents the set of selected MSXs. Step 3 of the algorithm selects the MSX that has the largest intersection with the images not covered yet. Step 5 adds it to R, and step 4 updates the uncovered set of images. Although the algorithm is not guaranteed to produce the smallest R, it performs well in practice and has a weak approximation guarantee. Electronics 14 03230 i001

This completes the discussion of our approach, summarized in the flow diagram in Figure 3. To answer RQ1, the global explanations are grounded in human taxonomy by transferring human-annotated labels to query images and to the superpixels in MSXs. The superpixels grounded in local explanations are used to construct global explanations through the greedy set cover algorithm, which answers RQ2.

The total time complexity of our approach is

O (H (N) + m c + G (k c) + B (N))

, where m is the maximum number of images in a class

y_{c}

from the dataset, c is the number of categories, and k is the number of neighbors used to find visually similar images.

H ()

,

G ()

, and c are the time taken by unsupervised segmentation, part label transfer, and beam search, respectively. Since unsupervised segmentation

H ()

, label transfer

G ()

, and beam search

G ()

can be parallelized across each image, computing the nearest neighbor, i.e., the

m c

term, is the most time-consuming. Beam search

G ()

has a worst-case complexity of

O (beam width \times depth \times p)

and only takes about 100 ms per image on an NVIDIA DGX.

6. Results and Discussion

The evaluation consists of deriving the global explanations for each class from the training dataset and measuring their coverage on the test dataset that has been kept separate. The “test coverage” of a global explanation or a rule set measures what fraction of the images in each class has one of its MSXs captured by some rule in the rule set. It can be thought of as measuring the power consumption fidelity of the global explanation.

In Table 1, we present the total coverage and standard deviation across all classes for three pre-trained model architectures (VGG19, ResNet101, and DenseNet121) in three datasets (ADE20K, CUB200, and Stanford Cars). The results indicate a notable advantage of defining human-aligned concepts and transferring them to the query images, as evidenced by higher mean coverage on the CUB-200 and Stanford Cars datasets. The degradation in average test coverage observed on the ADE20k dataset can likely be attributed to intra-object variance within different scenes, where numerous visually distinct instances of the same object class, such as bed, are labeled identically. In contrast, bird and car parts exhibit higher consistency across images, leading to more stable coverage. Additionally, the ADE20k dataset includes 150 distinct concepts, while the CUB-200 and Stanford datasets contain 9 concepts each. The increased intra-object variance coupled with the larger number of concepts makes it more challenging to achieve significant coverage across the ADE20k dataset. This complexity also contributes to the higher standard deviation in test coverage between classes in the ADE20k dataset. In comparison, the Stanford Cars dataset, characterized by higher object part rigidity, promotes better generalization of explanations from training to test datasets, as reflected in its lower standard deviation. Although our primary focus is not on the performance of different architectures, given that our proposed approach is a black-box method universally applicable to any image recognition model, it is worth noting that the ResNet101 architecture achieved the highest mean test coverage across all three datasets. This suggests that ResNet101 was more effective in learning human-aligned concepts during training compared to VGG19 and DenseNet121.

Figure 4 provides a qualitative visualization of global explanations for the classes “Indigo Bunting,” “Bugatti Veyron,” and “Kitchen” from the CUB-200, Stanford Cars, and ADE20k datasets, respectively, with a ResNet101 model. The bar graph illustrates the distribution of global rules computed on the training images, while the pie chart depicts the test coverage of the respective rules (color-coded) on the test images, indicating the extent to which the network has internalized human-aligned concepts within these classes. Interpreting results for class “Kitchen” in the ADE20k dataset, the human-aligned rule Background, Cabinet, Wall accounted for 15% of the training coverage and 3% of the test coverage. The empty segment in the pie chart represents test images that were misclassified by the ResNet101 model.

In Figure 5 we plot the cumulative test coverage in class against the length of the rules explaining the training dataset for the three datasets, focusing on the top three and bottom three classes when sorted by test coverage. Unlike the average test coverage reported in Table 1, this figure aims to capture the varying levels of difficulty in explaining different classes. Ideally, high test coverage should be achieved with the smallest set of rules. The figure highlights the diversity in the rate of test coverage gained with each additional rule. Notably, the “White Pelican,” “Aston Martin Virage Coupe 2012,” and “miscellaneous” classes achieved the highest class test coverage for the CUB-200, ADE20k, and Stanford Cars datasets, respectively, with “Aston Martin Virage Coupe 2012” being able to explain all test and train examples with as few as 17 rules. The figure also visually demonstrates the higher intraclass deviation in mean test coverage for the ADE20k and CUB-200 datasets, whereas classes in the Stanford Cars dataset achieve higher test coverage with fewer global rules.

7. Conclusions and Future Work

In this work, we have presented a novel framework for generating global explanations that align AI decision-making with human concepts by leveraging human-annotated object and part labels. Through comprehensive experiments on diverse datasets, we addressed three key research questions.

First, we demonstrated that global explanation methods can indeed be grounded in human taxonomy, as evidenced by the higher mean test coverage achieved on datasets with well-defined object parts like CUB-200 and Stanford Cars (RQ1). This grounding not only enhances the interpretability of the explanations but also ensures that they are more comprehensible to a broader audience with little to no additional training required. Second, we explored the benefits of aggregating multiple local explanations to construct global explanations. Our findings show that this aggregation leads to a more robust and comprehensive understanding of the model decision-making process across various classes, capturing both specific and overarching patterns in the data (RQ2). Finally, we demonstrated the adaptability of our method to different models and datasets, showcasing its versatility as a model-agnostic approach that can be applied to any image-based classifier (RQ3). The consistent performance across multiple architectures, particularly the superior results achieved with ResNet101, further validates the effectiveness of our approach.

Future work includes extending our methods to other modalities such as video and text, as well as answering more refined discriminative questions such as why the model classifies an image as a cat and not a dog.

Author Contributions

Conceptualization: B.V., K.R. and P.T.; Methodology: B.V., K.R. and P.T.; Software and Experiments: B.V. and K.R.; Investigation and Data Curation: K.R.; Visualization: B.V. and K.R.; Writing—Original Draft: B.V. and K.R.; Writing—Review and Editing: B.V. and P.T.; Supervision: P.T.; Funding Acquisition: P.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded through the NSF grant CNS-1941892, the ARO grant W911NF2210251 and the Industry–University Cooperative Research Center on Pervasive Personalized Intelligence.

Data Availability Statement

All base image datasets used in this study—CUB-200-2011, ADE20K, and STF—are publicly available from their original authors under the licenses specified in their respective repositories. The additional manual annotations, class splits, and evaluation scripts can be found in the GitHub repo: https://github.com/vbhavank/LoctoGlobal (accessed on 1 July 2025).

Acknowledgments

We thank Rahul Khanna, Raffa Giuseppe, Kai Ishikawa, and Kunihiko Sadamasa for their valuable feedback.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
General Data Protection Regulation (GDPR). Regulation (EU) 2016/679 of the European Parliament and of the Council; Publications Office of the European Union: Luxembourg, 2016. [Google Scholar]
Smuha, N.A.; Ahmed-Rengers, E.; Harkens, A.; Li, W.; MacLaren, J.; Piselli, R.; Yeung, K. How the EU Can Achieve Legally Trustworthy AI: A Response to the European Commission’s Proposal for an Artificial Intelligence Act. 2021. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3899991 (accessed on 13 July 2025).
Fresz, B.; Dubovitskaya, E.; Brajovic, D.; Huber, M.; Horz, C. How should AI decisions be explained? Requirements for Explanations from the Perspective of European Law. arXiv 2024, arXiv:2404.12762. [Google Scholar] [CrossRef]
Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6541–6549. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Petsiuk, V.; Das, A.; Saenko, K. Rise: Randomized input sampling for explanation of black-box models. arXiv 2018, arXiv:1806.07421. [Google Scholar] [CrossRef]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13. Springer: Cham, Switzerland, 2014; pp. 818–833. [Google Scholar]
Khorram, S.; Lawson, T.; Fuxin, L. iGOS++ integrated gradient optimized saliency by bilateral perturbations. In Proceedings of the Conference on Health, Inference, and Learning, Virtual, 8–10 April 2021; pp. 174–182. [Google Scholar]
Shitole, V.; Li, F.; Kahng, M.; Tadepalli, P.; Fern, A. One explanation is not enough: Structured attention graphs for image classification. Adv. Neural Inf. Process. Syst. 2021, 34, 11352–11363. [Google Scholar]
Chen, C.; Li, O.; Tao, D.; Barnett, A.; Rudin, C.; Su, J.K. This looks like that: Deep learning for interpretable image recognition. Adv. Neural Inf. Process. Syst. 2019, 8928–8939. [Google Scholar]
Fei-Fei, L.; Iyer, A.; Koch, C.; Perona, P. What do we perceive in a glance of a real-world scene? J. Vis. 2007, 7, 10. [Google Scholar] [CrossRef] [PubMed]
Biederman, I. Recognition-by-components: A theory of human image understanding. Psychol. Rev. 1987, 94, 115. [Google Scholar] [CrossRef] [PubMed]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; Computation & Neural Systems Technical Report, CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011; Available online: https://www.vision.caltech.edu/datasets/cub_200_2011/ (accessed on 13 July 2025).
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 633–641. [Google Scholar]
Krause, J.; Deng, J.; Stark, M.; Fei-Fei, L. Collecting a large-scale dataset of fine-grained cars. In Proceedings of the Second Workshop on Fine-Grained Visual Categorization (FGVC), IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Portland, OR, USA, 23–28 June 2013; Available online: https://ai.stanford.edu/~jkrause/papers/fgvc13.pdf (accessed on 13 July 2025).
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Viegas, F.; Sayres, R. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018; pp. 2668–2677. [Google Scholar]
Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5209–5217. [Google Scholar]
Zhang, N.; Donahue, J.; Girshick, R.; Darrell, T. Part-based R-CNNs for fine-grained category detection. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar]
Kliegr, T.; Bahník, Š.; Fürnkranz, J. A review of possible effects of cognitive biases on interpretation of rule-based machine learning models. Artif. Intell. 2021, 295, 103458. [Google Scholar] [CrossRef]
Singla, S.; Wallace, S.; Triantafillou, S.; Batmanghelich, K. Using causal analysis for conceptual deep learning explanation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part III 24. Springer: Cham, Switzerland, 2021; pp. 519–528. [Google Scholar]
Sharma, R.; Reddy, N.; Kamakshi, V.; Krishnan, N.C.; Jain, S. MAIRE-a model-agnostic interpretable rule extraction procedure for explaining classifiers. In Proceedings of the Machine Learning and Knowledge Extraction: 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Virtual Event, 17–20 August 2021; Proceedings 5. Springer: Cham, Switzerland, 2021; pp. 329–349. [Google Scholar]
Friedman, J.H.; Popescu, B.E. Predictive learning via rule ensembles. Ann. Appl. Stat. 2008, 2, 916–954. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.G.; Lee, S.I. Consistent individualized feature attribution for tree ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar]
Harris, C.; Pymar, R.; Rowat, C. Joint Shapley values: A measure of joint feature importance. arXiv 2021, arXiv:2107.11357. [Google Scholar]
Kamakshi, V.; Krishnan, N.C. Explainable image classification: The journey so far and the road ahead. AI 2023, 4, 620–651. [Google Scholar] [CrossRef]
Balayn, A.; Soilis, P.; Lofi, C.; Yang, J.; Bozzon, A. What do you mean? Interpreting image classification with crowdsourced concept extraction and analysis. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 1937–1948. [Google Scholar]
Ji, C.; Darwiche, A. A new class of explanations for classifiers with non-binary features. In Proceedings of the European Conference on Logics in Artificial Intelligence, Dresden, Germany, 20–22 September 2023; Springer: Cham, Switzerland, 2023; pp. 106–122. [Google Scholar]
Azzolin, S.; Longa, A.; Barbiero, P.; Lio, P.; Passerini, A. Global Explainability of GNNs via Logic Combination of Learned Concepts. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2022. [Google Scholar]
Zhang, H.; Fang, T.; Chen, X.; Zhao, Q.; Quan, L. Partial similarity based nonparametric scene parsing in certain environment. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; IEEE: New York, NY, USA, 2011; pp. 2241–2248. [Google Scholar]
Liu, C.; Yuen, J.; Torralba, A. Nonparametric scene parsing: Label transfer via dense scene alignment. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 1972–1979. [Google Scholar]
Criminisi, A. Microsoft Research Cambridge (MSRC) Object Recognition Image Database, Version 2.0; Microsoft Research Cambridge: Cambridge, UK, 2004; Available online: http://research.microsoft.com/vision/cambridge/recognition/default.htm (accessed on 13 July 2025).
Joulin, A.; Van Der Maaten, L.; Jabri, A.; Vasilache, N. Learning visual features from large weakly supervised data. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14. Springer: Cham, Switzerland, 2016; pp. 67–84. [Google Scholar]
Zhang, X.; Wei, Y.; Kang, G.; Yang, Y.; Huang, T. Self-produced guidance for weakly-supervised object localization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 597–613. [Google Scholar]
Simonyan, K. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for simplicity: The all convolutional net. arXiv 2014, arXiv:1412.6806. [Google Scholar]
Samek, W. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv 2017, arXiv:1708.08296. [Google Scholar] [CrossRef]
Islam, M.J.; Wang, R.; Sattar, J. SVAM: Saliency-guided visual attention modeling by autonomous underwater robots. arXiv 2020, arXiv:2011.06252. [Google Scholar]
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [PubMed]
Min, J.; Lee, J.; Ponce, J.; Cho, M. Hyperpixel Flow: Semantic Correspondence with Multi-Layer Neural Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of thee IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 447–456. [Google Scholar]
Slavík, P. A tight analysis of the greedy algorithm for set cover. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, Philadelphia, PA, USA, 22–24 May 1996; pp. 435–441. [Google Scholar]
Huang, X.; Marques-Silva, J. On the failings of Shapley values for explainability. Int. J. Approx. Reason. 2024, 171, 109112. [Google Scholar] [CrossRef]
Young, N.E. Greedy set-cover algorithms (1974–1979, chvátal, johnson, lovász, stein). In Encyclopedia of Algorithms; Springer: New York, NY, USA, 2008; pp. 379–381. [Google Scholar]

Figure 1. Image from CUB200 dataset (left) and its respective part-based annotation overlaid (right).

Figure 2. A figure showing how part labels are transferred to an unlabeled query image (right) from a labeled gallery image (left) with named superpixels obtained from SLIC.

Figure 3. The overall flow diagram for the proposed method consists of mining for visually similar images with nearest neighbor, part-based matching with HPF, computing symbolic local MSx, and deriving class-level global symbolic explanations using greedy set cover.

Figure 4. Visualization of global explanations for ResNet101 architecture decisions on classes “Indigo Bunting” (Panel-A), “Bugatti Veyron” (Panel-B), and “Kitchen” (Panel-C) from CUB200, ADE20k, and Stanford Cars datasets, respectively. Bar plot shows coverage of each MSX in train set, whereas pie chart indicates coverage of MSX in test set with same color.

Figure 5. A figure showing the cumulative test coverage vs. the number of rules explaining the training set for the top 3 and bottom 3 classes when sorted based on test coverage.

Table 1. Table reporting the total mean coverage and standard deviation across all classes from the ADE20k, CUB200, and Stanford Cars datasets across different model architectures.

	Model (Mean Coverage %, Standard Deviation)
Datasets	VGG19	ResNet101	DenseNet121
ADE20k	48.5, 25.9	53.5, 25.3	52.1, 27.5
CUB200	56.7, 12	59.2, 11.2	58.3, 12.4
Stanford Cars	98.5, 1.7	99.3, 1.34	97.2, 3.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vasu, B.; Rathore, K.; Tadepalli, P. Beyond Local Explanations: A Framework for Global Concept-Based Interpretation in Image Classification. Electronics 2025, 14, 3230. https://doi.org/10.3390/electronics14163230

AMA Style

Vasu B, Rathore K, Tadepalli P. Beyond Local Explanations: A Framework for Global Concept-Based Interpretation in Image Classification. Electronics. 2025; 14(16):3230. https://doi.org/10.3390/electronics14163230

Chicago/Turabian Style

Vasu, Bhavan, Kunal Rathore, and Prasad Tadepalli. 2025. "Beyond Local Explanations: A Framework for Global Concept-Based Interpretation in Image Classification" Electronics 14, no. 16: 3230. https://doi.org/10.3390/electronics14163230

APA Style

Vasu, B., Rathore, K., & Tadepalli, P. (2025). Beyond Local Explanations: A Framework for Global Concept-Based Interpretation in Image Classification. Electronics, 14(16), 3230. https://doi.org/10.3390/electronics14163230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Beyond Local Explanations: A Framework for Global Concept-Based Interpretation in Image Classification

Abstract

1. Introduction

2. Related Work

3. Datasets and Annotations

4. Annotation Transfer Across Images

4.1. Finding Visually Nearest Neighbor

4.2. Unsupervised Segmentation

4.3. Part Label Transfer with Correspondence

5. Local and Global Explanations

5.1. Minimal Sufficient Explanations

5.2. Symbolic Representation

5.3. Deriving Global Explanations

6. Results and Discussion

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI